Articles

Intersection on PySpark DataFrames

Intersection on PySpark DataFrames


In this pyspark tutorial, we will see how to perform intersection on two dataframes.

intersection will join two dataframes. Let's create two dataframes. 

 

Let's install pyspark module

pip install pyspark

Example:

First dataframe is created with 7 rows and 3 columns and second dataframe is created with 8 rows and 3 columns.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
         {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]

#create a  list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},

        {'rollno': 22, 'student name': 'deepak','marks': 69},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 33, 'student name': 'Lasya','marks': 80}]

# create the dataframe from the values
data1 = spark.createDataFrame(values1)

# create the dataframe from the values
data2 = spark.createDataFrame(values2)

#display
data1.show()

#display
data2.show()

Output:

Two dataframes

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-----+------+------------------+
|marks|rollno|      student name|
+-----+------+------------------+
|   48|    11|            harish|
|   69|    22|            deepak|
|   89|     2|Gottumukkala Bobby|
|   90|     3|       Lavu Ojaswi|
|   89|     2|Gottumukkala Bobby|
|   90|     3|       Lavu Ojaswi|
|   78|     4|      Lavu Gnanesh|
|   80|    33|             Lasya|
+-----+------+------------------+
  • intersect()

intersect() in pyspark is used to join two dataframes by taking only common rows from both the dataframes. If the common rows are duplicate in both the dataframes, intersect() will take those rows only once.

Syntax:

dataframe1.intersect(dataframe2)

where, dataframe1 is the 1st dataframe and dataframe2 is the 2nd dataframe.

Example:

Join above dataframes using intersect.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
         {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]

#create a  list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},

        {'rollno': 22, 'student name': 'deepak','marks': 69},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 33, 'student name': 'Lasya','marks': 80}]

# create the dataframe from the values
data1 = spark.createDataFrame(values1)

# create the dataframe from the values
data2 = spark.createDataFrame(values2)

#perform intersect
data1.intersect(data2).show()

Output:

We can see that it will take only unique common rows.

+-----+------+------------------+
|marks|rollno|      student name|
+-----+------+------------------+
|   90|     3|       Lavu Ojaswi|
|   89|     2|Gottumukkala Bobby|
|   78|     4|      Lavu Gnanesh|
+-----+------+------------------+
  • intersectAll()

intersectAll() in pyspark is used to join two dataframes by taking only common rows from both the dataframes. If the common rows are duplicate in both the dataframes, intersectAll() will take duplicate rows also.

Syntax:

dataframe1.intersectAll(dataframe2)

where, dataframe1 is the 1st dataframe and dataframe2 is the 2nd dataframe.

Example:

Join above dataframes using intersectAll.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values1 = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
         {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]

#create a  list of data
values2 = [{'rollno': 11, 'student name': 'harish','marks': 48},

        {'rollno': 22, 'student name': 'deepak','marks': 69},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 33, 'student name': 'Lasya','marks': 80}]

# create the dataframe from the values
data1 = spark.createDataFrame(values1)

# create the dataframe from the values
data2 = spark.createDataFrame(values2)

#perform intersectAll
data1.intersectAll(data2).show()

Output:

+-----+------+------------------+
|marks|rollno|      student name|
+-----+------+------------------+
|   89|     2|Gottumukkala Bobby|
|   89|     2|Gottumukkala Bobby|
|   90|     3|       Lavu Ojaswi|
|   90|     3|       Lavu Ojaswi|
|   78|     4|      Lavu Gnanesh|
+-----+------+------------------+

We can see that intersectAll() function will take duplicates that are in common.


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 14,2024  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!