In this PySpark tutorial, we will discuss what is groupBy() and how to use groupBy() with aggregate functions on PySpark DataFrame.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
groupBy() is used to combine rows that has similar values in a column.
If want to perform any aggregate operations like sum(), min() etc with groupBy(), the result woulld be from grouped rows.
Syntax:
data.groupBy("group_column").aggregate_function("column")
where, group_column is the column in which we will group similar values and column is the aggregated with this group column.
aggregate_function includes the following:
-
sum() - Return total sum in each group
-
min() - Return minimum value in each group
-
max() - Return maximum value in each group
-
count() - Return number of values in each group
-
avg()- Return total average in each group.
Example 1:
In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply sum() aggregate function in group.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#display sum by group
data.groupBy("student name").sum("marks").show()
Output:
From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get total sum on marks column.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 60| 3| Lavu Ojaswi|
| 58| 4| Lavu Gnanesh|
| 70| 5| Chennupati Rohith|
+-----+------+-------------------+
+-------------------+----------+
| student name|sum(marks)|
+-------------------+----------+
| Lavu Gnanesh| 214|
|Gottumukkala Sravan| 98|
| Lavu Ojaswi| 240|
| Gottumukkala Bobby| 89|
| Chennupati Rohith| 270|
+-------------------+----------+
Example 2:
In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply min() aggregate function in group.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#display minimum from each group
data.groupBy("student name").min("marks").show()
Output:
From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get minimum value in marks column.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 60| 3| Lavu Ojaswi|
| 58| 4| Lavu Gnanesh|
| 70| 5| Chennupati Rohith|
+-----+------+-------------------+
+-------------------+----------+
| student name|min(marks)|
+-------------------+----------+
| Lavu Gnanesh| 58|
|Gottumukkala Sravan| 98|
| Lavu Ojaswi| 60|
| Gottumukkala Bobby| 89|
| Chennupati Rohith| 70|
+-------------------+----------+
Example 3:
In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply max() aggregate function in group.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#display maximum from each group
data.groupBy("student name").max("marks").show()
Output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 60| 3| Lavu Ojaswi|
| 58| 4| Lavu Gnanesh|
| 70| 5| Chennupati Rohith|
+-----+------+-------------------+
+-------------------+----------+
| student name|max(marks)|
+-------------------+----------+
| Lavu Gnanesh| 78|
|Gottumukkala Sravan| 98|
| Lavu Ojaswi| 90|
| Gottumukkala Bobby| 89|
| Chennupati Rohith| 100|
+-------------------+----------+
Example 4:
In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply avg() aggregate function in group.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#display average in each group
data.groupBy("student name").avg("marks").show()
Output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 60| 3| Lavu Ojaswi|
| 58| 4| Lavu Gnanesh|
| 70| 5| Chennupati Rohith|
+-----+------+-------------------+
+-------------------+-----------------+
| student name| avg(marks)|
+-------------------+-----------------+
| Lavu Gnanesh|71.33333333333333|
|Gottumukkala Sravan| 98.0|
| Lavu Ojaswi| 80.0|
| Gottumukkala Bobby| 89.0|
| Chennupati Rohith| 90.0|
+-------------------+-----------------+
Example 5:
In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply count() aggregate function in group.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#display count in each group
data.groupBy("student name").count().show()
Output:
Total count in each group is returned.
Note - We don't specify the column name inside count. it will throw an error
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 60| 3| Lavu Ojaswi|
| 58| 4| Lavu Gnanesh|
| 70| 5| Chennupati Rohith|
+-----+------+-------------------+
+-------------------+-----+
| student name|count|
+-------------------+-----+
| Lavu Gnanesh| 3|
|Gottumukkala Sravan| 1|
| Lavu Ojaswi| 3|
| Gottumukkala Bobby| 1|
| Chennupati Rohith| 3|
+-------------------+-----+
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 12,2023