Articles

How to Concatenate columns in PySpark DataFrame

How to Concatenate columns in PySpark DataFrame


In this PySpark tutorial, we will see how to concatenate columns in PySpark dataframe.

Let's create an pyspark dataframe with 5 rows and 3 columns. 

Install pyspark:

pip install pyspark
# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

Output:

Dataframe output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

Method 1 : concat()

concat() will join two or more columns (values in a column)  in theDataFrame. So It can be possible to create a new column and add the concatenated values.

It is used with select() method and alias is used to add a new column to store concatenated values.

Syntax:

dataframe.select(concat(column1,……..).alias("new_column_name"))

where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.

Example:

In this example, we are concatenating marks and rollno column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

#import concat function
from pyspark.sql.functions import concat

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#concatenate marks and rollno
data.select(data.marks,data.rollno,concat(data.marks,data.rollno).alias("concatenated_column")).show()

We are displaying marks, rollno and concatenated marks-rollno.

Output:

+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
|   98|     1|                981|
|   89|     2|                892|
|   90|     3|                903|
|   78|     4|                784|
|  100|     5|               1005|
+-----+------+-------------------+

Method 2 : concat_ws()

concat_ws() will join two or more columns (values in a column)  in theDataFrame. So It can be possible to create a new column and add the concatenated values. The difference between concat and concat_ws() is that concat won't take any separator to separate concatenated values but concat_ws will take separator to separate concatenated values.

It is used with select() method and alias is used to add a new column to store concatenated values.

Syntax:

dataframe.select(concat_ws(separator,column1,……..).alias("new_column_name"))

where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.

Example:

In this example, we are concatenating marks and rollno column separated by "-->" .

# import the below modules
import pyspark
from pyspark.sql import SparkSession

#import concat_ws function
from pyspark.sql.functions import concat_ws

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#concatenate marks and rollno
data.select(data.marks,data.rollno,concat_ws("-->",data.marks,data.rollno).alias("concatenated_column")).show()

Output:

We can see that concatenated values are separated by -->.

+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
|   98|     1|             98-->1|
|   89|     2|             89-->2|
|   90|     3|             90-->3|
|   78|     4|             78-->4|
|  100|     5|            100-->5|
+-----+------+-------------------+

 


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 14,2024  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!