SEMI JOIN PySpark

In this PySpark tutorial, we will see how to perform semi join on two PySpark dataframes.

Let's create two pyspark dataframes and before that we have to install pyspark module.

pip install pyspark

First PySpark DataFrame:

It has three columns with 5 rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#pdisplay first dataframe
data1.show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

First PySpark DataFrame:

It has two columns with 3 rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#pdisplay second dataframe
data2.show()

Output:

+------+------+
|gender|rollno|
+------+------+
|  Male|     1|
|Female|     2|
|  Male|     6|
+------+------+

From the above dataframes, we can observe that rollno column is common in both the dataframes. semi join is similar to inner join. Inner join will join only similar rows from both the dataframes and display all the columns from both the dataframes. But the semi join will return only 1st dataframe after performing join. We can perform semi join uisng SEMI and LEFTSEMI keywords.

Syntax:

data1.join(data2,data1.column==  data2.column,"semi")
data1.join(data2,data1.column==  data2.column,"leftsemi")

where,

data1 is the 1st pyspark dataframe and data2 is the 2nd pyspark dataframe. column is the column name in which the dataframes are joined based on this column.

Example 1:

Perform semi join on the above two dataframes with semi keyword.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#perform semi join
data1.join(data2,data1.rollno ==  data2.rollno,"semi").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
+-----+------+-------------------+

We can see that 1,2 rollno are common in both the dataframes. so semi join returns the rows from 1st dataframe.

Example 2:

Perform semi join on the above two dataframes with leftsemi keyword.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#perform semi join
data1.join(data2,data1.rollno ==  data2.rollno,"leftsemi").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
+-----+------+-------------------+

We can see that 1,2 rollno are common in both the dataframes. so semi join returns the rows from 1st dataframe.

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

SEMI JOIN PySpark

SEMI JOIN PySpark

Related Articles

You might be interested in