In this PySpark tutorial, we will see how to perform anti join on two PySpark dataframes.
Let's create two pyspark dataframes and before that we have to install pyspark module.
pip install pyspark
First PySpark DataFrame:
It has three columns with 5 rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#pdisplay first dataframe
data1.show()
Output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
First PySpark DataFrame:
It has two columns with 3 rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#pdisplay second dataframe
data2.show()
Output:
+------+------+
|gender|rollno|
+------+------+
| Male| 1|
|Female| 2|
| Male| 6|
+------+------+
From the above dataframes, we can observe that rollno column is common in both the dataframes. anti join is quite opposite to semi join. semi join will return only 1st dataframe after performing join by getting only similar rows from both the dataframes.Anti Join will return rows except similar rows. We can perform anti join uisng ANTI and LEFTANTI keywords.
Syntax:
data1.join(data2,data1.column== data2.column,"anti")
data1.join(data2,data1.column== data2.column,"leftanti")
where,
data1 is the 1st pyspark dataframe and data2 is the 2nd pyspark dataframe. column is the column name in which the dataframes are joined based on this column.
Example 1:
Perform anti join on the above two dataframes with anti keyword.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#perform anti join
data1.join(data2,data1.rollno == data2.rollno,"anti").show()
Output:
+-----+------+-----------------+
|marks|rollno| student name|
+-----+------+-----------------+
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5|Chennupati Rohith|
+-----+------+-----------------+
We can see that 1,2 rollno are common in both the dataframes. so anti join returns the remaining rows except rollno-1,2 from 1st dataframe.
Example 2:
Perform anti join on the above two dataframes with leftanti keyword.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data1 = spark.createDataFrame(values)
#create a list of data
values = [{'rollno': 1, 'gender': 'Male'},
{'rollno': 2, 'gender': 'Female'},
{'rollno': 6, 'gender': 'Male'},
]
# create the dataframe from the values
data2 = spark.createDataFrame(values)
#perform anti join
data1.join(data2,data1.rollno == data2.rollno,"leftanti").show()
Output:
+-----+------+-----------------+
|marks|rollno| student name|
+-----+------+-----------------+
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5|Chennupati Rohith|
+-----+------+-----------------+
We can see that 1,2 rollno are common in both the dataframes. so anti join returns the remaining rows except rollno-1,2 from 1st dataframe.
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024