Articles

PySpark - sumDistinct(), countDistinct()

PySpark - sumDistinct(), countDistinct()


In this PySpark tutorial, we will discuss how to use sumDistinct() and countDistinct() methods on PySpark DataFrame. 

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)
  • sumDistinct()

sumDistinct() is used to return total sum of the column without adding duplicate values.

Example:

If a column contains values - 1,2,3,2,3 , then it will add - 1+2+3 = 6(because, 2 and 3 are duplicated).

We have to import ot from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import sumDistinct

It can be used with select() method.

Syntax:

dataframe.select(sumDistinct('column_name'),.............)

where, column_name is the column to get sum without considering duplicates.

Example:

In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct sum from rollno and marks column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import sumDistinct
from pyspark.sql.functions import sumDistinct

#return distinct sum from marks and rollno column
data.select(sumDistinct('marks'),sumDistinct('rollno')).show()

Output:

Distinct sum from marks and rollno columns is returned.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+--------------------+
|sum(DISTINCT marks)|sum(DISTINCT rollno)|
+-------------------+--------------------+
|                455|                  15|
+-------------------+--------------------+
  • countDistinct()

countDistinct() is used to return total count of the column without considering duplicate values.

Example:

If a column contains values - 1,2,3,2,3 , then it will count- 1,2,3 = so 3. (because, 2 and 3 are duplicated).

We have to import ot from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import countDistinct

It can be used with select() method.

Syntax:

dataframe.select(countDistinct('column_name'),.............)

where, column_name is the column to get count without considering duplicates.

Example:

In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct count from rollno and marks column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import countDistinct
from pyspark.sql.functions import countDistinct

#return distinct count from marks and rollno column
data.select(countDistinct('marks'),countDistinct('rollno')).show()

Output:

Distinct count from marks and rollno columns is returned.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+---------------------+----------------------+
|count(DISTINCT marks)|count(DISTINCT rollno)|
+---------------------+----------------------+
|                    5|                     5|
+---------------------+----------------------+
 

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 14,2024  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!