PySpark SQL Joins

In this tutorial we will perform SQL Joins through PySpark.

Syntax to install:

pip install pyspark

To Perform Join, we need two dataframes. So lets create two PySpark dataframes.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

#display
data1.show()

data2.show()

Output:

Two pyspark dataframes:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+------+------+
|gender|rollno|
+------+------+
|  Male|     1|
|Female|     2|
|  Male|     6|
+------+------+

Steps to perform Joins:

1. Create temporary views for both the dataframes.

We can create by using createOrReplaceTempView() method.

Syntax:

input_dataframe.createOrReplaceTempView("view_name")

where, is the view_name dataframe view name.

2. Perform joins using spark.sql() function.

Let’s see INNER JOIN

INNER JOIN joins both the dataframes by taking only matched rows from both the dataframes based on column name.

Syntax:

spark.sql("select * from view1 INNER JOIN  view2  ON  view1. column == view2.column")

where,

view1 is the first PySpark dataframe view
view2 is the second PySpark dataframe view

Example:

In this example, we will perform INNER JOIN on rollno column from two dataframes.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

# create view for data1  as data1
data1.createOrReplaceTempView("data1")

# create view for data2  as data2
data2.createOrReplaceTempView("data2")


# perform sql - inner join
spark.sql("select * from data1 INNER JOIN  data2 ON  data1.rollno == data2.rollno").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
+-----+------+-------------------+------+------+

We are performing inner join based on rollno column on both the dataframes, so we can see that there are only 2 rows common (matching) in both the dataframes with rollno 1 and 2.

LEFT JOIN

LEFT JOIN joins both the dataframes by taking all rows from the first dataframe and only matched rows from the second dataframe based on column name. It will replace null values in the second dataframe in unmatched rows

Syntax:

spark.sql("select * from view1 LEFT  JOIN  view2  ON  view1. column == view2.column")

where,

view1 is the first PySpark dataframe view
view2 is the second PySpark dataframe view

Example:

Perform Left Join based on rollno column

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

# create view for data1  as data1
data1.createOrReplaceTempView("data1")

# create view for data2  as data2
data2.createOrReplaceTempView("data2")


# perform sql - left join
spark.sql("select * from data1 LEFT JOIN  data2 ON  data1.rollno == data2.rollno").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
|   90|     3|        Lavu Ojaswi|  null|  null|
|   78|     4|       Lavu Gnanesh|  null|  null|
|  100|     5|  Chennupati Rohith|  null|  null|
+-----+------+-------------------+------+------+

We are performing left join using "left" keyword. we can see that null is placed in the 2nd dataframe (data2) for rows-3,4,5 unmatched with respect to 1st dataframe(data1) . These rows are not there in 2nd dataframe.

RIGHT JOIN

RIGHT JOIN joins both the dataframes by taking all rows from the second dataframe and only matched rows from the first dataframe based on column name. It will replace null values in the first dataframe in unmatched rows

Syntax:

spark.sql("select * from view1 RIGHT  JOIN  view2  ON  view1. column == view2.column")

where,

view1 is the first PySpark dataframe view
view2 is the second PySpark dataframe view

Example:

In this example, we will perform RIGHT JOIN on rollno column from two dataframes.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

# create view for data1  as data1
data1.createOrReplaceTempView("data1")

# create view for data2  as data2
data2.createOrReplaceTempView("data2")


# perform sql - right join
spark.sql("select * from data1 RIGHT JOIN  data2 ON  data1.rollno == data2.rollno").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
| null|  null|               null|  Male|     6|
+-----+------+-------------------+------+------+

We are performing right join using "right" keyword. we can see that null is placed in the 1st dataframe (data1) for row-6 unmatched with respect to 2nd dataframe(data2) . This row is not there in 1st dataframe.

FULL JOIN

FULL JOIN joins both the dataframes by taking all rows from both the dataframes. if there are any unmatched rows, it will replace null for those rows.

Syntax:

spark.sql("select * from view1 FULL JOIN view2 ON view1. column == view2.column")

where,

view1 is the first PySpark dataframe view
view2 is the second PySpark dataframe view

Example:

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data1 = spark.createDataFrame(values)

#create a  list of data
values = [{'rollno': 1, 'gender': 'Male'},

        {'rollno': 2, 'gender': 'Female'},

        {'rollno': 6, 'gender': 'Male'},

        ]


# create the dataframe from the values
data2 = spark.createDataFrame(values)

# create view for data1  as data1
data1.createOrReplaceTempView("data1")

# create view for data2  as data2
data2.createOrReplaceTempView("data2")


# perform sql - full join
spark.sql("select * from data1 FULL JOIN  data2 ON  data1.rollno == data2.rollno").show()

Output:

+-----+------+-------------------+------+------+
|marks|rollno|       student name|gender|rollno|
+-----+------+-------------------+------+------+
|   98|     1|Gottumukkala Sravan|  Male|     1|
|   89|     2| Gottumukkala Bobby|Female|     2|
|   90|     3|        Lavu Ojaswi|  null|  null|
|   78|     4|       Lavu Gnanesh|  null|  null|
|  100|     5|  Chennupati Rohith|  null|  null|
| null|  null|               null|  Male|     6|
+-----+------+-------------------+------+------+

We are performing outer join using "outer" keyword. So in 2nd dataframe (data2)null is returned at positions of rollno-3,4,5 with respect to 1st dataframe (data1), because rollo-3,4,5 are not available in data2, and In 1st dataframe (data1)null is returned at position of rollno-6 with respect to 2nd dataframe (data2), because rollo-6 is not available in data1.

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

PySpark SQL Joins

PySpark SQL Joins

Related Articles

You might be interested in