In this PySpark tutorial, we will discuss how to use collect() to get all Rows / particular Rows and Columns from PySpark dataframe.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
Scenario - 1: Get all Rows and Columns
We will get all rows and columns simply by using collect method.
Syntax:
dataframe.collect()
It will return the data rowwise.
Example:
In this example, we are creating the dataframe with 3 columns and 5 rows and display using collect()
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display using collect()
data.collect()
Output:
[Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar'),
Row(marks=89, rollno=2, student name='Gottumukkala Bobby'),
Row(marks=90, rollno=3, student name='Lavu Ojaswi'),
Row(marks=78, rollno=4, student name='Lavu Gnanesh'),
Row(marks=100, rollno=5, student name='Chennupati Rohith')]
Scenario 2 : Get particular rows
If we want to get particular rows, we have to specify the row index. index starts with 0.
Syntax:
data.collect()[row_index]
Example:
In this example, we will get first, third and fifth rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#get first row
print(data.collect()[0])
#get third row
print(data.collect()[2])
#get fifth row
print(data.collect()[4])
Output:
Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar')
Row(marks=90, rollno=3, student name='Lavu Ojaswi')
Row(marks=100, rollno=5, student name='Chennupati Rohith')
Scenario 3 : Get particular columns
If we want to get particular columns in a row, we have to specify the row index along with column index. index starts with 0.
Syntax:
data.collect()[row_index][col_index]
where,
1. row_index is the row index
2. col_index is the column index
Example:
In this example, we will get first and second columns from first, third and fifth rows.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#get first row- first column
print(data.collect()[0][0])
#get first row- second column
print(data.collect()[0][1])
#get third row - first column
print(data.collect()[2][0])
#get third row - second column
print(data.collect()[2][1])
#get fifth row - first column
print(data.collect()[4][0])
#get fifth row - second column
print(data.collect()[4][1])
Output:
98
1
90
3
100
5
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 12,2023