Convert PySpark DataFrame to pandas DataFrame
In this PySpark tutorial, we will discuss how to convert PySpark DataFrame to pandas DataFrame.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create list of values for dataframe
4. Pass this list to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(list of values)
Let's create PySpark DataFrame with 5 rows and 3 columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#display dataframe
data.show()
Output:
+-----+------+-------------------+
|marks|rollno| student name|
+-----+------+-------------------+
| 98| 1|Gottumukkala Sravan|
| 89| 2| Gottumukkala Bobby|
| 90| 3| Lavu Ojaswi|
| 78| 4| Lavu Gnanesh|
| 100| 5| Chennupati Rohith|
+-----+------+-------------------+
DataFrame in pandas is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
pandas is a module used for data analysis.
Syntax to import:
import pandas
Method 1 : Using toPandas()
toPandas() is used to convert PySpark dataframe to pandas dataframe.
Syntax:
dataframe.toPandas()
where, dataframe is the input pyspark dataframe which will be converting into pandas dataframe.
Example:
In this example, we are converting pyspark dataframe into pandas dataframe and also display the type of the converted dataframe with type() function.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#convert into pandas dataframe
print(data.toPandas())
print()
#get the type
print(type(data.toPandas()))
Output:
marks rollno student name
0 98 1 Gottumukkala Sravan
1 89 2 Gottumukkala Bobby
2 90 3 Lavu Ojaswi
3 78 4 Lavu Gnanesh
4 100 5 Chennupati Rohith
<class 'pandas.core.frame.DataFrame'>
Method 2 : Using iterrows() with toPandas()
In this scenario, we are using for loop to iterate over pyspark dataframe by converting into pandas dataframe using toPandas() method. After that, we are applying iterrows() method to get values in particular column.
In for loop we are using two variables - i and j, where i is used to iterate column index and j is used as row index.
Note - Index starts with 0.
Syntax:
for i, j in dataframe.toPandas().iterrows():
print(j[index],........)
Example:
In this example, we are converting to pandas dataframe and displaying all columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan','marks': 98},
{'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
{'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
{'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
{'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
for i, j in data.toPandas().iterrows():
print(j[0]," ",j[1]," ",j[2])
Output:
98 1 Gottumukkala Sravan
89 2 Gottumukkala Bobby
90 3 Lavu Ojaswi
78 4 Lavu Gnanesh
100 5 Chennupati Rohith
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024