In this PySpark tutorial, we will discuss how to use Row class to create pyspark dataframe.
Introduction:
DataFrame in PySpark is an two dimensional data structure that will store data in two dimensional format. One dimension refers to a row and second dimension refers to a column, So It will store the data in rows and columns.
Let's install pyspark module before going to this. The command to install any module in python is "pip".
Syntax:
pip install module_name
Installing PySpark:
pip install pyspark
Steps to create dataframe in PySpark:
1. Import the below modules
import pyspark
from pyspark.sql import SparkSession
2. Create spark app named tutorialsinhand using getOrCreate() method
Syntax:
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
3. Create Rows using Row class
4. Pass this Row class to createDataFrame() method to create pyspark dataframe
Syntax:
spark.createDataFrame(Row_Class)
Row Class is used to create Rows for the dataframe.
Syntax:
[Row(column=value),........]
where column represents the column name in the pyspark dataframe and value represent row value.
We have to import this from pyspark.sql module.
Syntax:
from pyspark.sql import Row
Example:
In this example, we will create PySpark dataframe with 5 rows and 3 columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create rows usng Row class
rows = [ Row(rollno= 1, name= 'Gottumukkala Sravan',marks= 98),
Row(rollno= 2, name= 'Gottumukkala Bobby',marks= 89),
Row(rollno= 3, name= 'Lavu Ojaswi',marks= 90),
Row(rollno= 4, name= 'Lavu Gnanesh',marks= 78),
Row(rollno= 5, name= 'Chennupati Rohith',marks=100)]
# create the dataframe from rows
data = spark.createDataFrame(rows)
#display dataframe
data.show()
Output:
So the column names are - rollno, name and marks.
+------+-------------------+-----+
|rollno| name|marks|
+------+-------------------+-----+
| 1|Gottumukkala Sravan| 98|
| 2| Gottumukkala Bobby| 89|
| 3| Lavu Ojaswi| 90|
| 4| Lavu Gnanesh| 78|
| 5| Chennupati Rohith| 100|
+------+-------------------+-----+
We can also create columns first and then we will pass Rows.
Syntax:
column_names=Row(column,...............)
[column_names(value1,..................),.........]
Example:
In this example, we will create PySpark dataframe with 5 rows and 3 columns.
# import the below modules
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# create an app
spark = SparkSession.builder.appName('tutorialsinhand').getOrCreate()
#create columns usng Row class
col=Row("rollno","name","marks")
rows = [ col(1, 'Gottumukkala Sravan', 98),
col( 2, 'Gottumukkala Bobby', 89),
col( 3, 'Lavu Ojaswi', 90),
col(4, 'Lavu Gnanesh', 78),
col(5, 'Chennupati Rohith',100)]
# create the dataframe from rows
data = spark.createDataFrame(rows)
#display dataframe
data.show()
Output
So the column names are - rollno, name and marks.
+------+-------------------+-----+
|rollno| name|marks|
+------+-------------------+-----+
| 1|Gottumukkala Sravan| 98|
| 2| Gottumukkala Bobby| 89|
| 3| Lavu Ojaswi| 90|
| 4| Lavu Gnanesh| 78|
| 5| Chennupati Rohith| 100|
+------+-------------------+-----+
Would you like to see your article here on tutorialsinhand.
Join
Write4Us program by tutorialsinhand.com
About the Author
Gottumukkala Sravan Kumar 171FA07058
B.Tech (Hon's) - IT from Vignan's University.
Published 1400+ Technical Articles on Python, R, Swift, Java, C#, LISP, PHP - MySQL and Machine Learning
Page Views :
Published Date :
Jun 14,2024