Union on PySpark DataFrames
In this pyspark tutorial, we will see how to perform union on two dataframes.
union will join two dataframes. Let's create two dataframes.
Let's install pyspark module
pip install pyspark
Example:
First dataframe is created with 5 rows and 3 columns and second dataframe is created with 6 rows and 3 columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
#create a list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},
{'rollno': 22, 'student name': 'deepak','marks': 69},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 33, 'student name': 'Lasya','marks': 80}]
# create the dataframe from the values
data1 = spark.createDataFrame(values1)
# create the dataframe from the values
data2 = spark.createDataFrame(values2)
#display
data1.show()
#display
data2.show()
Output:
Two dataframes
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
+-----+------+------------------+
|marks|rollno| student name|
+-----+------+------------------+
| 48| 11| harish|
| 69| 22| deepak|
| 89| 2|Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 80| 33| Lasya|
+-----+------+------------------+
union() in pyspark is used to join two dataframes by appending rows in the second dataframe to the first dataframe.
Syntax:
dataframe1.union(dataframe2)
where, dataframe1 is the 1st dataframe and dataframe2 is the 2nd dataframe.
Example:
Join above dataframes using union.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
#create a list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},
{'rollno': 22, 'student name': 'deepak','marks': 69},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 33, 'student name': 'Lasya','marks': 80}]
# create the dataframe from the values
data1 = spark.createDataFrame(values1)
# create the dataframe from the values
data2 = spark.createDataFrame(values2)
#perform union
data1.union(data2).show()
Output:
We can see that rows in the second dataframe is appended to the first dataframe.
It will return an error, if the total number of columns are different.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 48| 11| harish|
| 69| 22| deepak|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 80| 33| Lasya|
+-----+------+-------------------+
unionAll() in pyspark is used to join two dataframes by appending rows in the second dataframe to the first dataframe. It is same as union(). union() is depricated in pyspark 2.0 version and unionAll() came into picture.
Syntax:
dataframe1.unionAll(dataframe2)
where, dataframe1 is the 1st dataframe and dataframe2 is the 2nd dataframe.
Example:
Join above dataframes using unionAll.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
#create a list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},
{'rollno': 22, 'student name': 'deepak','marks': 69},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 33, 'student name': 'Lasya','marks': 80}]
# create the dataframe from the values
data1 = spark.createDataFrame(values1)
# create the dataframe from the values
data2 = spark.createDataFrame(values2)
#perform unionAll
data1.unionAll(data2).show()
Output:
We can see that rows in the second dataframe is appended to the first dataframe.
It will return an error, if the total number of columns are different.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 48| 11| harish|
| 69| 22| deepak|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 80| 33| Lasya|
+-----+------+-------------------+
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024