How to Concatenate columns in PySpark DataFrame

In this PySpark tutorial, we will see how to concatenate columns in PySpark dataframe.

Let's create an pyspark dataframe with 5 rows and 3 columns.

Install pyspark:

pip install pyspark

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

Output:

Dataframe output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

Method 1 : concat()

concat() will join two or more columns (values in a column) in theDataFrame. So It can be possible to create a new column and add the concatenated values.

It is used with select() method and alias is used to add a new column to store concatenated values.

Syntax:

dataframe.select(concat(column1,……..).alias("new_column_name"))

where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.

Example:

In this example, we are concatenating marks and rollno column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

#import concat function
from pyspark.sql.functions import concat

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#concatenate marks and rollno
data.select(data.marks,data.rollno,concat(data.marks,data.rollno).alias("concatenated_column")).show()

We are displaying marks, rollno and concatenated marks-rollno.

Output:

+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
|   98|     1|                981|
|   89|     2|                892|
|   90|     3|                903|
|   78|     4|                784|
|  100|     5|               1005|
+-----+------+-------------------+

Method 2 : concat_ws()

concat_ws() will join two or more columns (values in a column) in theDataFrame. So It can be possible to create a new column and add the concatenated values. The difference between concat and concat_ws() is that concat won't take any separator to separate concatenated values but concat_ws will take separator to separate concatenated values.

It is used with select() method and alias is used to add a new column to store concatenated values.

Syntax:

dataframe.select(concat_ws(separator,column1,……..).alias("new_column_name"))

where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.

Example:

In this example, we are concatenating marks and rollno column separated by "-->" .

# import the below modules
import pyspark
from pyspark.sql import SparkSession

#import concat_ws function
from pyspark.sql.functions import concat_ws

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#concatenate marks and rollno
data.select(data.marks,data.rollno,concat_ws("-->",data.marks,data.rollno).alias("concatenated_column")).show()

Output:

We can see that concatenated values are separated by -->.

+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
|   98|     1|             98-->1|
|   89|     2|             89-->2|
|   90|     3|             90-->3|
|   78|     4|             78-->4|
|  100|     5|            100-->5|
+-----+------+-------------------+

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

How to Concatenate columns in PySpark DataFrame

How to Concatenate columns in PySpark DataFrame

Related Articles

You might be interested in