PySpark - sumDistinct(), countDistinct()

In this PySpark tutorial, we will discuss how to use sumDistinct() and countDistinct() methods on PySpark DataFrame.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

sumDistinct()

sumDistinct() is used to return total sum of the column without adding duplicate values.

Example:

If a column contains values - 1,2,3,2,3 , then it will add - 1+2+3 = 6(because, 2 and 3 are duplicated).

We have to import ot from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import sumDistinct

It can be used with select() method.

Syntax:

dataframe.select(sumDistinct('column_name'),.............)

where, column_name is the column to get sum without considering duplicates.

Example:

In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct sum from rollno and marks column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import sumDistinct
from pyspark.sql.functions import sumDistinct

#return distinct sum from marks and rollno column
data.select(sumDistinct('marks'),sumDistinct('rollno')).show()

Output:

Distinct sum from marks and rollno columns is returned.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+-------------------+--------------------+
|sum(DISTINCT marks)|sum(DISTINCT rollno)|
+-------------------+--------------------+
|                455|                  15|
+-------------------+--------------------+

countDistinct()

countDistinct() is used to return total count of the column without considering duplicate values.

Example:

If a column contains values - 1,2,3,2,3 , then it will count- 1,2,3 = so 3. (because, 2 and 3 are duplicated).

We have to import ot from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import countDistinct

It can be used with select() method.

Syntax:

dataframe.select(countDistinct('column_name'),.............)

where, column_name is the column to get count without considering duplicates.

Example:

In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct count from rollno and marks column.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import countDistinct
from pyspark.sql.functions import countDistinct

#return distinct count from marks and rollno column
data.select(countDistinct('marks'),countDistinct('rollno')).show()

Output:

Distinct count from marks and rollno columns is returned.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

+---------------------+----------------------+
|count(DISTINCT marks)|count(DISTINCT rollno)|
+---------------------+----------------------+
|                    5|                     5|
+---------------------+----------------------+

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

PySpark - sumDistinct(), countDistinct()

PySpark - sumDistinct(), countDistinct()

Related Articles

You might be interested in