Articles

OUTER JOIN PySpark

OUTER JOIN PySpark


In this PySpark tutorial, we will see how to perform outer join on two PySpark dataframes.

Let's create two pyspark dataframes and before that we have to install pyspark module.

pip install pyspark

First PySpark DataFrame:

It has three columns with 5 rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#pdisplay first dataframe
data1.show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

First PySpark DataFrame:

It has two columns with 3 rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#pdisplay second dataframe
data2.show()

Output:

+------+------+
|gender|rollno|
+------+------+
|  Male|     1|
|Female|     2|
|  Male|     6|
+------+------+

From the above dataframes, we can observe that rollno column is common in both the dataframes. so we are joining the two dataframes using OUTER JOIN.

OUTER JOIN will join all the rows from both the dataframes and return null values whereever rows are unmatched.

 

Syntax:

data1.join(data2,data1.column==  data2.column,"outer")
data1.join(data2,data1.column==  data2.column,"full")
data1.join(data2,data1.column==  data2.column,"fullouter")

where,

data1 is the 1st pyspark dataframe and data2 is the 2nd pyspark dataframe. column is the column name in which the dataframes are joined based on this column.

Example 1:

Perform outer join on the above two dataframes with outer keyword.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#perform outer join
data1.join(data2,data1.rollno ==  data2.rollno,"outer").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
|   90|     3|        Lavu Ojaswi|  null|  null|
|   78|     4|       Lavu Gnanesh|  null|  null|
|  100|     5|  Chennupati Rohith|  null|  null|
| null|  null|               null|  Male|     6|
+-----+------+-------------------+------+------+

We are performing outer join using "outer" keyword. So in 2nd dataframe  (data2)null is returned  at positions of rollno-3,4,5 with respect to 1st dataframe (data1), because rollo-3,4,5 are not available in data2, and In 1st dataframe  (data1)null is returned  at position of rollno-6 with respect to 2nd  dataframe (data2), because rollo-6 is not available in data1.

 

Example 2:

Perform outer join on the above two dataframes with full keyword.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#perform outer join
data1.join(data2,data1.rollno ==  data2.rollno,"full").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
|   90|     3|        Lavu Ojaswi|  null|  null|
|   78|     4|       Lavu Gnanesh|  null|  null|
|  100|     5|  Chennupati Rohith|  null|  null|
| null|  null|               null|  Male|     6|
+-----+------+-------------------+------+------+

We are performing outer join using "full" keyword. So in 2nd dataframe  (data2)null is returned  at positions of rollno-3,4,5 with respect to 1st dataframe (data1), because rollo-3,4,5 are not available in data2, and In 1st dataframe  (data1)null is returned  at position of rollno-6 with respect to 2nd  dataframe (data2), because rollo-6 is not available in data1.

 

Example 3:

Perform outer join on the above two dataframes with fullouter keyword.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#perform outer join
data1.join(data2,data1.rollno ==  data2.rollno,"fullouter").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
|   90|     3|        Lavu Ojaswi|  null|  null|
|   78|     4|       Lavu Gnanesh|  null|  null|
|  100|     5|  Chennupati Rohith|  null|  null|
| null|  null|               null|  Male|     6|
+-----+------+-------------------+------+------+

We are performing outer join using "fullouter" keyword. So in 2nd dataframe  (data2)null is returned  at positions of rollno-3,4,5 with respect to 1st dataframe (data1), because rollo-3,4,5 are not available in data2, and In 1st dataframe  (data1)null is returned  at position of rollno-6 with respect to 2nd  dataframe (data2), because rollo-6 is not available in data1.


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 14,2024  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!