In this PySpark tutorial, we will discuss how to filter the rows by using where() in PySpark DataFrame.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
Let's create PySpark DataFrame with 5 rows and 3 columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display
data.show()
PySpark DataFrame:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
where() is used to filter the rows in pyspark dataframe using conditional operators.
Let's see different scenarios to filter the rows by applying different conditions.
Scenario - 1 : Using operator
We can use relational operators for conditions. based on the conditions, we will get the results.
Syntax:
dataframe.where(dataframe.column_name operator value)
where, operator refers to the relational operator to check the condition and value is used to compare with values in each column in the pyspark dataframe.
Example:
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#get the rows where marks are greater than 87
print(data.where(data.marks > 87).collect())
print()
#get the rows where rollno is less than 3
print(data.where(data.rollno <3).collect())
print()
#get the row where rollno equal to 1
print(data.where(data.rollno ==1).collect())
Output:
In the first output, we are getting the rows from the dataframe where marks are greater than 87.
In the second output, we are getting the rows where values in rollno column are less than3.
In the last output, we are getting row from rollno column where values equals to 1.
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'), Row(marks=89, rollno=2, student name='Gottumukkala Bobby'), Row(marks=90, rollno=3, student name='Lavu Ojaswi'), Row(marks=100, rollno=5, student name='Chennupati Rohith')]
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'), Row(marks=89, rollno=2, student name='Gottumukkala Bobby')]
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan')]
Scenario - 2 : Using SQL Expression
We can use SQL expression inside where() method, this will work as condition.
Syntax:
dataframe.where("SQL EXPRESSION")
dataframe.where("column_name operator value")
where, operator refers to the relational operator to check the condition in given column( column_name) and value is used to compare with values in each column in the pyspark dataframe.
Example:
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#get the rows where marks are greater than 87
print(data.where("marks > 87").collect())
print()
#get the rows where rollno is less than 3
print(data.where("rollno <3").collect())
print()
#get the row where rollno equal to 1
print(data.where("rollno ==1").collect())
Output:
In the first output, we are getting the rows from the dataframe where marks are greater than 87.
In the second output, we are getting the rows where values in rollno column are less than3.
In the last output, we are getting row from rollno column where values equals to 1.
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'), Row(marks=89, rollno=2, student name='Gottumukkala Bobby'), Row(marks=90, rollno=3, student name='Lavu Ojaswi'), Row(marks=100, rollno=5, student name='Chennupati Rohith')]
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'), Row(marks=89, rollno=2, student name='Gottumukkala Bobby')]
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan')]
Scenario 3 : Filtering using string functions
In this case, we are using string in-built functions performed on string value columns in pyspark DataFrame.
ti return only particular rows. so these will act as conditons in where() method.
1. startswith()
Checks whether the value starts with the given character.
Syntax:
dataframe.where(dataframe["column_name"].startswith("character"))
where., character represents the letter to be compared with first character of values in particular column (column_name)
2. endswith()
Checks whether the value ends with the given character.
Syntax:
dataframe.where(dataframe["column_name"].endswith("character"))
where., character represents the letter to be compared with last character of values in particular column (column_name)
3. contains()
Checks whether the value contains the character or not.
Syntax:
dataframe.where(dataframe["column_name"].contains("character"))
where., character represents the letter present in values of column(column_name) or not.
For all three functions, if charcater exists/matches, then rows will be returned.
Example:
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#get the rows where values in 'student name' column starts with 'L'
print(data.where(data["student name"].startswith("L")).collect())
print()
#get the rows where values in 'student name' column ends with 'y'
print(data.where(data["student name"].endswith("y")).collect())
print()
#get the rows where values in 'student name' contains 'i'
print(data.where(data["student name"].contains("i")).collect())
Output:
In the first output, we will get the rows where values in 'student name' column starts with 'L'
In the second output, we will get the rows where values in 'student name' column ends with 'y'
In the last output, we will get the rows where values in 'student name' contains 'i'
[Row(marks=90, rollno=3, student name='Lavu Ojaswi'), Row(marks=78, rollno=4, student name='Lavu Gnanesh')]
[Row(marks=89, rollno=2, student name='Gottumukkala Bobby')]
[Row(marks=90, rollno=3, student name='Lavu Ojaswi'), Row(marks=100, rollno=5, student name='Chennupati Rohith')]
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 12,2023