Articles

PySpark - groupBy()

PySpark - groupBy()


In this PySpark tutorial, we will discuss what is groupBy() and how to use groupBy() with aggregate functions on PySpark DataFrame. 

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

groupBy() is used to combine rows that has similar values in a column.

If want to perform any aggregate operations like sum(), min() etc with groupBy(), the result woulld be from grouped rows.

Syntax:

data.groupBy("group_column").aggregate_function("column")

where, group_column is the column in which we will group similar values and column is the aggregated with this group column.

aggregate_function includes the following:

  1. sum() - Return total sum in each  group
  2. min() - Return minimum value in each  group
  3. max() - Return maximum value in each  group
  4. count() - Return number of values in  each  group
  5. avg()- Return total average in each  group.

Example 1:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply sum() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display sum by group
data.groupBy("student name").sum("marks").show()

Output:

From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get total sum on marks column.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|sum(marks)|
+-------------------+----------+
|       Lavu Gnanesh|       214|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|       240|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|       270|
+-------------------+----------+

Example 2:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply min() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display minimum from each group
data.groupBy("student name").min("marks").show()

Output:

From the dataframe, we seen that there are five unique student names and are grouping by their name(student name) column with marks column and get minimum value in marks column.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|min(marks)|
+-------------------+----------+
|       Lavu Gnanesh|        58|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|        60|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|        70|
+-------------------+----------+

Example 3:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply max() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display maximum from each group
data.groupBy("student name").max("marks").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+----------+
|       student name|max(marks)|
+-------------------+----------+
|       Lavu Gnanesh|        78|
|Gottumukkala Sravan|        98|
|        Lavu Ojaswi|        90|
| Gottumukkala Bobby|        89|
|  Chennupati Rohith|       100|
+-------------------+----------+

Example 4:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply avg() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display average in each group
data.groupBy("student name").avg("marks").show()

Output:

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+-----------------+
|       student name|       avg(marks)|
+-------------------+-----------------+
|       Lavu Gnanesh|71.33333333333333|
|Gottumukkala Sravan|             98.0|
|        Lavu Ojaswi|             80.0|
| Gottumukkala Bobby|             89.0|
|  Chennupati Rohith|             90.0|
+-------------------+-----------------+

Example 5:

In this example, we will create PySpark dataframe with 11 rows and 3 columns and apply count() aggregate function in group.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 60},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 58},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 70}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display dataframe
data.show()

#display count in each group
data.groupBy("student name").count().show()

Output:

Total count in each group is returned.

Note - We don't specify the column name inside count. it will throw an error

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|   60|     3|        Lavu Ojaswi|
|   58|     4|       Lavu Gnanesh|
|   70|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+-----+
|       student name|count|
+-------------------+-----+
|       Lavu Gnanesh|    3|
|Gottumukkala Sravan|    1|
|        Lavu Ojaswi|    3|
| Gottumukkala Bobby|    1|
|  Chennupati Rohith|    3|
+-------------------+-----+

 


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 12,2023  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!