Articles

Get Standard Deviation from PySpark DataFrame

Get Standard Deviation from PySpark DataFrame


In this PySpark tutorial, we will discuss how to get standard deviation in one/more columns from PySpark DataFrame

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession


2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:
     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

Method 1 : Using select()

We can get  standard deviation in a sample and standard deviation in a population by using stddev/stddev_samp() and stddev_pop() functions.

We have to import them from pyspark.sql.functions.

from pyspark.sql.functions import stddev,stddev_samp,stddev_pop

Syntax:

dataframe.select(stddev("column_name"),stddev__samp("column_name"),stddev_pop("column_name"))

Example:

In this example, we will get standard deviation of sample in two ways - by using stddev() and stddev_samp() methods and by using stddev_pop() we will get standard deviation in population.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#import stddev,stddev_samp,stddev_pop functions
from pyspark.sql.functions import stddev,stddev_samp,stddev_pop

#display standard deviation of marks
print(data.select(stddev("marks"),stddev_samp("marks"),stddev_pop("marks")).collect())

#display  standard deviation of rollno
print(data.select(stddev("rollno"),stddev_samp("rollno"),stddev_pop("rollno")).collect())

Output:

standard deviation of marks  and rollno columns is returned.

[Row(stddev_samp(marks)=8.717797887081348, stddev_samp(marks)=8.717797887081348, stddev_pop(marks)=7.797435475847172)]
[Row(stddev_samp(rollno)=1.5811388300841898, stddev_samp(rollno)=1.5811388300841898, stddev_pop(rollno)=1.4142135623730951)]

Method 2 : Using agg()

agg() stands for aggregate which will take a dictionary such that key will be the column name and value will  be the standard deviation function/s.

By this, it will compute standard deviation.

Syntax:

dataframe.agg({'column1':'stddev_pop','column2':'stddev_pop',..........})
dataframe.agg({'column1':'stddev_samp','column2':'stddev_samp',..........})
dataframe.agg({'column1':'stddev','column2':'stddev',..........})

Example:

In this example, we will get standard deviationin a sample and standard deviation in a population from marks and rollno columns.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display sample standard deviation of marks and rollno
print(data.agg({'marks':'stddev','rollno':'stddev'}).collect())

#display sample standard deviation of marks
print(data.agg({'marks':'stddev_samp','rollno':'stddev_samp'}).collect())

#display population standard deviation of marks
print(data.agg({'marks':'stddev_pop','rollno':'stddev_pop'}).collect())

Output:

[Row(stddev(rollno)=1.5811388300841898, stddev(marks)=8.717797887081348)]
[Row(stddev_samp(rollno)=1.5811388300841898, stddev_samp(marks)=8.717797887081348)]
[Row(stddev_pop(rollno)=1.4142135623730951, stddev_pop(marks)=7.797435475847172)]

 


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 14,2024  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!