PySpark - collect_list(),collect_set()

In this PySpark tutorial, we will discuss how to apply collect_list() & collect_set() methods on PySpark DataFrame.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

collect_list()

collect_list() is used to get the values from a column. We have to import this method from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import collect_list

It can be used with select() method.

Syntax:

dataframe.select(collect_list("column_name"),.........)

where, column_name is the column to get list of values present in that column.

Example:

In this example, we are creating pyspark dataframe with 3 columns and 11 rows and getting values from rollno and marks column using collect_list().

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import collect_list
from pyspark.sql.functions import collect_list

#get the list of values from rollno and marks column
data.select(collect_list("rollno"),collect_list("marks")).collect()

Output:

list of values from rollno and marks columns are returned.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

[Row(collect_list(rollno)=[1, 2, 3, 4, 3, 1, 2, 3, 4, 5, 5], collect_list(marks)=[98, 89, 90, 78, 90, 98, 89, 90, 78, 100, 100])]

collect_set()

collect_set() is used to get the values from a column without duplicates. We have to import this method from pyspark.sql.functions module.

Syntax:

from pyspark.sql.functions import collect_set

It can be used with select() method.

Syntax:

dataframe.select(collect_set("column_name"),.........)

where, column_name is the column to get list of values present in that column.

Example:

In this example, we are creating pyspark dataframe with 3 columns and 11 rows (6 are duplicated) and getting values from rollno and marks column using collect_set().

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display dataframe
data.show()

#import collect_set
from pyspark.sql.functions import collect_set

#get the list of values from rollno and marks column without duplicates
data.select(collect_set("rollno"),collect_set("marks")).collect()

Output:

collect_set() returned the list of values from rollno and marks column with out duplicates.

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|   90|     3|        Lavu Ojaswi|
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

[Row(collect_set(rollno)=[1, 5, 2, 3, 4], collect_set(marks)=[78, 100, 89, 90, 98])]

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

PySpark - collect_list(),collect_set()

PySpark - collect_list(),collect_set()

Related Articles

You might be interested in