PySpark - show()

In this PySpark tutorial, we will discuss how to use collect() to get all Rows / particular Rows and Columns from PySpark datafrane.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Output:

Steps to create dataframe in PySpark:

1. Import the below modules
      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax: 
     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe
    Syntax:
    spark.createDataFrame(list of values)

Scenario - 1: Get all Rows and Columns

We will get all rows and columns simply by using collect method.

Syntax:

dataframe.collect()

It will return the data rowwise.

Example:

In this example, we are creating the dataframe with 3 columns and 5 rows and display using collect()

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#display using collect()
data.collect()

Output:

[Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar'),
 Row(marks=89, rollno=2, student name='Gottumukkala Bobby'),
 Row(marks=90, rollno=3, student name='Lavu Ojaswi'),
 Row(marks=78, rollno=4, student name='Lavu Gnanesh'),
 Row(marks=100, rollno=5, student name='Chennupati Rohith')]

Scenario 2 : Get particular rows

If we want to get particular rows, we have to specify the row index. index starts with 0.

Syntax:

data.collect()[row_index]

Example:

In this example, we will get first, third and fifth rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#get first row
print(data.collect()[0])

#get third row
print(data.collect()[2])

#get fifth row
print(data.collect()[4])

Output:

Row(marks=98, rollno=1, student name='Gottumukkala Sravan kumar')
Row(marks=90, rollno=3, student name='Lavu Ojaswi')
Row(marks=100, rollno=5, student name='Chennupati Rohith')

Scenario 3 : Get particular columns

If we want to get particular columns in a row, we have to specify the row index along with column index. index starts with 0.

Syntax:

data.collect()[row_index][col_index]

where,

1. row_index is the row index

2. col_index is the column index

Example:

In this example, we will get first and second columns from first, third and fifth rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)

#get first row- first column
print(data.collect()[0][0])

#get first row- second column
print(data.collect()[0][1])


#get third row - first column
print(data.collect()[2][0])


#get third row - second column
print(data.collect()[2][1])


#get fifth row - first column
print(data.collect()[4][0])

#get fifth row - second column
print(data.collect()[4][1])

Output:

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 14,2024

Please Share this page

PySpark - show()

PySpark - show()

Related Articles

You might be interested in