Get variance from PySpark DataFrame
In this PySpark tutorial, we will discuss how to get variance in one/more columns from PySpark DataFrame
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
Method 1 : Using select()
We can get variance in a sample and variance in a population by using variance/var_samp() and var_pop() functions.
We have to import them from pyspark.sql.functions.
from pyspark.sql.functions import variance,var_samp,var_pop
Syntax:
dataframe.select(variance("column_name"),var_samp("column_name"),var_pop("column_name"))
Example:
In this example, we will get variance of sample in two ways - by using variance() and var_samp() method and by using var_pop() we will get variance in population.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#import variance,var_samp,var_pop functions
from pyspark.sql.functions import variance,var_samp,var_pop
#display variance of marks
print(data.select(variance("marks"),var_samp("marks"),var_pop("marks")).collect())
#display variance of rollno
print(data.select(variance("rollno"),var_samp("rollno"),var_pop("rollno")).collect())
Output:
variance of marks and rollno columns is returned.
[Row(var_samp(marks)=76.00000000000001, var_samp(marks)=76.00000000000001, var_pop(marks)=60.80000000000001)]
[Row(var_samp(rollno)=2.5, var_samp(rollno)=2.5, var_pop(rollno)=2.0)]
Method 2 : Using agg()
agg() stands for aggregate which will take a dictionary such that key will be the column name and value will be th variance function.
By this, it will compute variance.
Syntax:
dataframe.agg({'column1':'var_pop','column2':'var_pop',..........})
dataframe.agg({'column1':'var_samp','column2':'var_samp',..........})
dataframe.agg({'column1':'variance','column2':'variance',..........})
Example:
In this example, we will get variance in a sample and variance in a population from marks and rollno columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display sample variance of marks and rollno
print(data.agg({'marks':'variance','rollno':'variance'}).collect())
#display sample variance of marks
print(data.agg({'marks':'var_samp','rollno':'var_samp'}).collect())
#display population variance of marks
print(data.agg({'marks':'var_pop','rollno':'var_pop'}).collect())
Output:
[Row(variance(rollno)=2.5, variance(marks)=76.00000000000001)]
[Row(var_samp(rollno)=2.5, var_samp(marks)=76.00000000000001)]
[Row(var_pop(rollno)=2.0, var_pop(marks)=60.80000000000001)]
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 12,2023