PySpark - head(),tail(),take(),first()

In this PySpark tutorial, we will discuss how to display top and bottom rows in PySpark DataFrame using head(), tail(), first() and take() methods.

Introduction:

DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.

Let's install pyspark module before going to this. The command to install any module in python is "pip".

Syntax:

pip install module_name

Installing PySpark:

pip install pyspark

Steps to create dataframe in PySpark:

1. Import the below modules

      import pyspark
      from pyspark.sql import SparkSession

2. Create spark app named tutorialsinhand using getOrCreate() method

     Syntax:

     spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

3. Create list of values for dataframe

4. Pass this list to createDataFrame() method to create pyspark dataframe

    Syntax:
    spark.createDataFrame(list of values)

Let's create pyspark dataframe with 3 columns and 5 rows.

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display
data.show()

Output:

PySpark dataframe output

+-----+------+-------------------+
|marks|rollno|       student name|
+-----+------+-------------------+
|   98|     1|Gottumukkala Sravan|
|   89|     2| Gottumukkala Bobby|
|   90|     3|        Lavu Ojaswi|
|   78|     4|       Lavu Gnanesh|
|  100|     5|  Chennupati Rohith|
+-----+------+-------------------+

head()

head() is used to display n number of rows from top. It will take an integer parameter n that specifies number of rows to be displayed.

Syntax:

dataframe.head(n)

Example:

In this example, we will display top 3 rows

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display top 3 rows
data.head(3)

Output:

Top 3 rows are returned.

[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'),
 Row(marks=89, rollno=2, student name='Gottumukkala Bobby'),
 Row(marks=90, rollno=3, student name='Lavu Ojaswi')]

take()

take() is used to display n number of rows from top. It will take an integer parameter n that specifies number of rows to be displayed.

Syntax:

dataframe.take(n)

Example:

In this example, we will display top 3 rows

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display top 3 rows
data.take(3)

Output:

Top 3 rows are returned.

[Row(marks=98, rollno=1, student name='Gottumukkala Sravan'),
 Row(marks=89, rollno=2, student name='Gottumukkala Bobby'),
 Row(marks=90, rollno=3, student name='Lavu Ojaswi')]

tail()

tail() is used to display n number of rows from bottom. It will take an integer parameter n that specifies number of rows to be displayed.

Syntax:

dataframe.tail(n)

Example:

In this example, we will display last 3 rows

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display last 3 rows
data.tail(3)

Output:

Last three rows were returned.

[Row(marks=90, rollno=3, student name='Lavu Ojaswi'),
 Row(marks=78, rollno=4, student name='Lavu Gnanesh'),
 Row(marks=100, rollno=5, student name='Chennupati Rohith')]

first()

first() won't take any parameters. It will only return the first row from the pyspark dataframe.

Syntax:

dataframe.first()

Example:

Get first row using first()

# import the below modules
import pyspark
from pyspark.sql import SparkSession

# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()

#create a  list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},

        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},

        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},

        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},

        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]


# create the dataframe from the values
data = spark.createDataFrame(values)


#display 1st row
data.first()

Output:

Returned only first row.

Row(marks=98, rollno=1, student name='Gottumukkala Sravan')

pyspark

Would you like to see your article here on tutorialsinhand. Join Write4Us program by tutorialsinhand.com

About the Author

Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University. Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning

Page Views : Published Date : Jun 12,2023

Please Share this page

PySpark - head(),tail(),take(),first()

PySpark - head(),tail(),take(),first()

Related Articles

You might be interested in