PySpark - collect_list(),collect_set()
In this PySpark tutorial, we will discuss how to apply collect_list() & collect_set() methods on PySpark DataFrame.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
collect_list() is used to get the values from a column. We have to import this method from pyspark.sql.functions module.
Syntax:
from pyspark.sql.functions import collect_list
It can be used with select() method.
Syntax:
dataframe.select(collect_list("column_name"),.........)
where, column_name is the column to get list of values present in that column.
Example:
In this example, we are creating pyspark dataframe with 3 columns and 11 rows and getting values from rollno and marks column using collect_list().
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#import collect_list
from pyspark.sql.functions import collect_list
#get the list of values from rollno and marks column
data.select(collect_list("rollno"),collect_list("marks")).collect()
Output:
list of values from rollno and marks columns are returned.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 90| 3| Lavu Ojaswi|
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
[Row(collect_list(rollno)=[1, 2, 3, 4, 3, 1, 2, 3, 4, 5, 5], collect_list(marks)=[98, 89, 90, 78, 90, 98, 89, 90, 78, 100, 100])]
collect_set() is used to get the values from a column without duplicates. We have to import this method from pyspark.sql.functions module.
Syntax:
from pyspark.sql.functions import collect_set
It can be used with select() method.
Syntax:
dataframe.select(collect_set("column_name"),.........)
where, column_name is the column to get list of values present in that column.
Example:
In this example, we are creating pyspark dataframe with 3 columns and 11 rows (6 are duplicated) and getting values from rollno and marks column using collect_set().
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
#import collect_set
from pyspark.sql.functions import collect_set
#get the list of values from rollno and marks column without duplicates
data.select(collect_set("rollno"),collect_set("marks")).collect()
Output:
collect_set() returned the list of values from rollno and marks column with out duplicates.
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 90| 3| Lavu Ojaswi|
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
[Row(collect_set(rollno)=[1, 5, 2, 3, 4], collect_set(marks)=[78, 100, 89, 90, 98])]
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024