In this PySpark tutorial, we will see how to perform inner join on two PySpark dataframes.
Let's create two pyspark dataframes and before that we have to install pyspark module.
pip install pyspark
First PySpark DataFrame:
It has three columns with 5 rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#pdisplay first dataframe
data1.show()
Output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
First PySpark DataFrame:
It has two columns with 3 rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#pdisplay second dataframe
data2.show()
Output:
+------+------+
|gender|rollno|
+------+------+
| Male| 1|
|Female| 2|
| Male| 6|
+------+------+
From the above dataframes, we can observe that rollno column is common in both the dataframes. so we are joining the two dataframes using INNER JOIN.
INNER JOIN will join only matching rows in both the dataframes. It will join based on the column provided.
The keyword used is inner.
Syntax:
data1.join(data2,data1.column== data2.column,"inner")
where,
data1 is the 1st pyspark dataframe and data2 is the 2nd pyspark dataframe. column is the column name in which the dataframes are joined based on this column.
Example:
Perform inner join on the above two dataframes.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#perform inner join
data1.join(data2,data1.rollno == data2.rollno,"inner").show()
Output:
+-----+------+-------------------+------+------+
|marks|rollno| student name|gender|rollno|
+-----+------+-------------------+------+------+
| 98| 1|Gottumukkala Sravan| Male| 1|
| 89| 2| Gottumukkala Bobby|Female| 2|
+-----+------+-------------------+------+------+
We are performing inner join based on rollno column on both the dataframes, so we can see that there are only 2 rows common (matching) in both the dataframes with rollno 1 and 2.
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024