In this tutorial we will perform SQL Joins through PySpark.
Syntax to install:
pip install pyspark
To Perform Join, we need two dataframes. So lets create two PySpark dataframes.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#display
data1.show()
data2.show()
Output:
Two pyspark dataframes:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
+------+------+
|gender|rollno|
+------+------+
| Male| 1|
|Female| 2|
| Male| 6|
+------+------+
Steps to perform Joins:
1. Create temporary views for both the dataframes.
We can create by using createOrReplaceTempView() method.
Syntax:
input_dataframe.createOrReplaceTempView("view_name")
where, is the view_name dataframe view name.
2. Perform joins using spark.sql() function.
Let’s see INNER JOIN
INNER JOIN joins both the dataframes by taking only matched rows from both the dataframes based on column name.
Syntax:
spark.sql("select * from view1 INNER JOIN view2 ON view1. column == view2.column")
where,
-
view1 is the first PySpark dataframe view
-
view2 is the second PySpark dataframe view
Example:
In this example, we will perform INNER JOIN on rollno column from two dataframes.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
# create view for data1 as data1
data1.createOrReplaceTempView("data1")
# create view for data2 as data2
data2.createOrReplaceTempView("data2")
# perform sql - inner join
spark.sql("select * from data1 INNER JOIN data2 ON data1.rollno == data2.rollno").show()
Output:
+-----+------+-------------------+------+------+
|marks|rollno| student name|gender|rollno|
+-----+------+-------------------+------+------+
| 98| 1|Gottumukkala Sravan| Male| 1|
| 89| 2| Gottumukkala Bobby|Female| 2|
+-----+------+-------------------+------+------+
We are performing inner join based on rollno column on both the dataframes, so we can see that there are only 2 rows common (matching) in both the dataframes with rollno 1 and 2.
LEFT JOIN
LEFT JOIN joins both the dataframes by taking all rows from the first dataframe and only matched rows from the second dataframe based on column name. It will replace null values in the second dataframe in unmatched rows
Syntax:
spark.sql("select * from view1 LEFT JOIN view2 ON view1. column == view2.column")
where,
-
view1 is the first PySpark dataframe view
-
view2 is the second PySpark dataframe view
Example:
Perform Left Join based on rollno column
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
# create view for data1 as data1
data1.createOrReplaceTempView("data1")
# create view for data2 as data2
data2.createOrReplaceTempView("data2")
# perform sql - left join
spark.sql("select * from data1 LEFT JOIN data2 ON data1.rollno == data2.rollno").show()
Output:
+-----+------+-------------------+------+------+
|marks|rollno| student name|gender|rollno|
+-----+------+-------------------+------+------+
| 98| 1|Gottumukkala Sravan| Male| 1|
| 89| 2| Gottumukkala Bobby|Female| 2|
| 90| 3| Lavu Ojaswi| null| null|
| 78| 4| Lavu Gnanesh| null| null|
| 100| 5| Chennupati Rohith| null| null|
+-----+------+-------------------+------+------+
We are performing left join using "left" keyword. we can see that null is placed in the 2nd dataframe (data2) for rows-3,4,5 unmatched with respect to 1st dataframe(data1) . These rows are not there in 2nd dataframe.
RIGHT JOIN
RIGHT JOIN joins both the dataframes by taking all rows from the second dataframe and only matched rows from the first dataframe based on column name. It will replace null values in the first dataframe in unmatched rows
Syntax:
spark.sql("select * from view1 RIGHT JOIN view2 ON view1. column == view2.column")
where,
-
view1 is the first PySpark dataframe view
-
view2 is the second PySpark dataframe view
Example:
In this example, we will perform RIGHT JOIN on rollno column from two dataframes.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
# create view for data1 as data1
data1.createOrReplaceTempView("data1")
# create view for data2 as data2
data2.createOrReplaceTempView("data2")
# perform sql - right join
spark.sql("select * from data1 RIGHT JOIN data2 ON data1.rollno == data2.rollno").show()
Output:
+-----+------+-------------------+------+------+
|marks|rollno| student name|gender|rollno|
+-----+------+-------------------+------+------+
| 98| 1|Gottumukkala Sravan| Male| 1|
| 89| 2| Gottumukkala Bobby|Female| 2|
| null| null| null| Male| 6|
+-----+------+-------------------+------+------+
We are performing right join using "right" keyword. we can see that null is placed in the 1st dataframe (data1) for row-6 unmatched with respect to 2nd dataframe(data2) . This row is not there in 1st dataframe.
FULL JOIN
FULL JOIN joins both the dataframes by taking all rows from both the dataframes. if there are any unmatched rows, it will replace null for those rows.
Syntax:
spark.sql("select * from view1 FULL JOIN view2 ON view1. column == view2.column")
where,
-
view1 is the first PySpark dataframe view
-
view2 is the second PySpark dataframe view
Example:
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
# create view for data1 as data1
data1.createOrReplaceTempView("data1")
# create view for data2 as data2
data2.createOrReplaceTempView("data2")
# perform sql - full join
spark.sql("select * from data1 FULL JOIN data2 ON data1.rollno == data2.rollno").show()
Output:
+-----+------+-------------------+------+------+
|marks|rollno| student name|gender|rollno|
+-----+------+-------------------+------+------+
| 98| 1|Gottumukkala Sravan| Male| 1|
| 89| 2| Gottumukkala Bobby|Female| 2|
| 90| 3| Lavu Ojaswi| null| null|
| 78| 4| Lavu Gnanesh| null| null|
| 100| 5| Chennupati Rohith| null| null|
| null| null| null| Male| 6|
+-----+------+-------------------+------+------+
We are performing outer join using "outer" keyword. So in 2nd dataframe (data2)null is returned at positions of rollno-3,4,5 with respect to 1st dataframe (data1), because rollo-3,4,5 are not available in data2, and In 1st dataframe (data1)null is returned at position of rollno-6 with respect to 2nd dataframe (data2), because rollo-6 is not available in data1.
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024