PySpark - groupBy()

In this PySpark tutorial, we will discuss what is groupBy() and how to use groupBy() with aggregate functions on PySpark DataFrame.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

groupBy() is used to combine rows that has similar values in a column.

If want to perform any aggregate operations like sum(), min() etc with groupBy(), the result woulld be from grouped rows.

Syntax:

data.groupBy("group_column").aggregate_function("column")

where, group_column is the column in which we will group similar values and column is the aggregated with this group column.

aggregate_function includes the following:

sum() - Return total sum in each group
min() - Return minimum value in each group
max() - Return maximum value in each group
count() - Return number of values in each group
avg()- Return total average in each group.

Example 1:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply sum() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display sum by group
data.groupBy("student name").sum("marks").show()

Output:

From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get total sum on marks column.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|sum(marks)|
+-------------------+----------+
|       Lavu Gnanesh|       214|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|       240|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|       270|
+-------------------+----------+

Example 2:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply min() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display minimum from each group
data.groupBy("student name").min("marks").show()

Output:

From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get minimum value in marks column.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|min(marks)|
+-------------------+----------+
|       Lavu Gnanesh|        58|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|        60|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|        70|
+-------------------+----------+

Example 3:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply max() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display maximum from each group
data.groupBy("student name").max("marks").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|max(marks)|
+-------------------+----------+
|       Lavu Gnanesh|        78|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|        90|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|       100|
+-------------------+----------+

Example 4:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply avg() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display average in each group
data.groupBy("student name").avg("marks").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+-----------------+
|       student name|       avg(marks)|
+-------------------+-----------------+
|       Lavu Gnanesh|71.33333333333333|
|Gottumukkala Sravan|             98.0|
|        Lavu Ojaswi|             80.0|
| Gottumukkala Bobby|             89.0|
|  Chennupati Rohith|             90.0|
+-------------------+-----------------+

Example 5:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply count() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display count in each group
data.groupBy("student name").count().show()

Output:

Total count in each group is returned.

Note - We don't specify the column name inside count. it will throw an error

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+-----+
|       student name|count|
+-------------------+-----+
|       Lavu Gnanesh|    3|
|Gottumukkala Sravan|    1|
|        Lavu Ojaswi|    3|
| Gottumukkala Bobby|    1|
|  Chennupati Rohith|    3|
+-------------------+-----+

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 12,2023

Please Share this page

PySpark - groupBy()

PySpark - groupBy()

Related Articles

You might be interested in