How to Concatenate columns in PySpark DataFrame
In this PySpark tutorial, we will see how to concatenate columns in PySpark dataframe.
Let's create an pyspark dataframe with 5 rows and 3 columns.
Install pyspark:
pip install pyspark
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
Output:
Dataframe output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
Method 1 : concat()
concat() will join two or more columns (values in a column) in theDataFrame. So It can be possible to create a new column and add the concatenated values.
It is used with select() method and alias is used to add a new column to store concatenated values.
Syntax:
dataframe.select(concat(column1,……..).alias("new_column_name"))
where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.
Example:
In this example, we are concatenating marks and rollno column.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
#import concat function
from pyspark.sql.functions import concat
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#concatenate marks and rollno
data.select(data.marks,data.rollno,concat(data.marks,data.rollno).alias("concatenated_column")).show()
We are displaying marks, rollno and concatenated marks-rollno.
Output:
+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
| 98| 1| 981|
| 89| 2| 892|
| 90| 3| 903|
| 78| 4| 784|
| 100| 5| 1005|
+-----+------+-------------------+
Method 2 : concat_ws()
concat_ws() will join two or more columns (values in a column) in theDataFrame. So It can be possible to create a new column and add the concatenated values. The difference between concat and concat_ws() is that concat won't take any separator to separate concatenated values but concat_ws will take separator to separate concatenated values.
It is used with select() method and alias is used to add a new column to store concatenated values.
Syntax:
dataframe.select(concat_ws(separator,column1,……..).alias("new_column_name"))
where, dataframe is the input pyspark dataframe and new_column_name is the concatenated column.
Example:
In this example, we are concatenating marks and rollno column separated by "-->" .
# import the below modules
import pyspark
from pyspark.sql import SparkSession
#import concat_ws function
from pyspark.sql.functions import concat_ws
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#concatenate marks and rollno
data.select(data.marks,data.rollno,concat_ws("-->",data.marks,data.rollno).alias("concatenated_column")).show()
Output:
We can see that concatenated values are separated by -->.
+-----+------+-------------------+
|marks|rollno|concatenated_column|
+-----+------+-------------------+
| 98| 1| 98-->1|
| 89| 2| 89-->2|
| 90| 3| 90-->3|
| 78| 4| 78-->4|
| 100| 5| 100-->5|
+-----+------+-------------------+
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024