PySpark - mean()

In this PySpark tutorial, we will discuss how to get average value from single column/ multiple columns in two ways in an PySpark DataFrame.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules
      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax: 
     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe
    Syntax:
    spark.createDataFrame(list of values)

Let's see the methods.

Method -1 : Using select()

mean() is an aggregate function used to get the average value from the given column in the PySpark DataFrame.

We have to import mean() method from pyspark.sql.functions

from pyspark.sql.functions import mean

Syntax:

dataframe.select(mean("column_name"),.............)

where, column_name is the column average value is returned.

Example 1:

In this example, we created pyspark dataframe with 5 rows and three columns and will get the average value from marks column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#import mean function
from pyspark.sql.functions import mean

#display mean of  marks
print(data.select(mean("marks")).collect())

Output:

We are collecting the output with collect() method.

[Row(avg(marks)=91.0)]

Example 2:

In this example, we created pyspark dataframe with 5 rows and three columns and will get the average from marks and rollno column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#import mean function
from pyspark.sql.functions import mean

#display average  of  marks and rollno
print(data.select(mean("marks"),mean("rollno")).collect())

Output:

Average value from marks and rollno columns is returned.

[Row(avg(marks)=91.0, avg(rollno)=3.0)]

Method -2 : Using agg()

agg() stands for aggregation which will take dictionary in which key will be the column and value will be the mean function. It will return average from particular column provided as key

Syntax:

dataframe.agg({'column_name': 'mean',............})

where, column_name is the column average is returned.

Example:

In this example, we created pyspark dataframe with 5 rows and three columns and will get the average value from marks and rollno column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display average   marks and rollno
data.agg({'marks': 'mean','rollno': 'mean'}).collect()

Output:

Average value from marks and rollno columns is returned.

[Row(avg(rollno)=3.0, avg(marks)=91.0)]

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 12,2023

Please Share this page

PySpark - mean()

PySpark - mean()

Related Articles

You might be interested in