Articles

PySpark - collect()

PySpark - collect()


In this PySpark tutorial, we will discuss how to use collect() to get all Rows / particular Rows and Columns from PySpark dataframe. 

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules
      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax: 
     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe
    Syntax:
    spark.createDataFrame(list of values)

Scenario - 1: Get all Rows and Columns

 

We will get all rows and columns simply by using collect method.

Syntax:

dataframe.collect()

It will return the data rowwise.

Example:

In this example, we are creating the dataframe with 3 columns and 5 rows and display using collect()

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display using collect()
data.collect()

Output:

[Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar'),
 Row(marks=89, rollno=2, student name='Gottumukkala Bobby'),
 Row(marks=90, rollno=3, student name='Lavu Ojaswi'),
 Row(marks=78, rollno=4, student name='Lavu Gnanesh'),
 Row(marks=100, rollno=5, student name='Chennupati Rohith')]

Scenario 2 : Get particular rows

If we want to get particular rows, we have to specify the row index. index starts with 0.

Syntax:

data.collect()[row_index]

Example:

In this example, we will get first, third and fifth rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#get first row
print(data.collect()[0])

#get third row
print(data.collect()[2])

#get fifth row
print(data.collect()[4])

Output:

Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar')
Row(marks=90, rollno=3, student name='Lavu Ojaswi')
Row(marks=100, rollno=5, student name='Chennupati Rohith')

Scenario 3 : Get particular columns

If we want to get particular columns in a row, we have to specify the row index along with column index. index starts with 0.

Syntax:

data.collect()[row_index][col_index]

where,

1. row_index is the row index

2. col_index is the column index

Example:

In this example, we will get first and second columns from first, third and fifth rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#get first row- first column
print(data.collect()[0][0])

#get first row- second column
print(data.collect()[0][1])


#get third row - first column
print(data.collect()[2][0])


#get third row - second column
print(data.collect()[2][1])


#get fifth row - first column
print(data.collect()[4][0])

#get fifth row - second column
print(data.collect()[4][1])

Output:

98
1
90
3
100
5

 


pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :    Published Date : Jun 12,2023  
Please Share this page

Related Articles

Like every other website we use cookies. By using our site you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service. Learn more Got it!