Showing results for tags 'dataframes'.

pyspark Simplify PySpark testing with DataFrame equality functions

Databricks posted a topic in Databases, Data Engineering & Data Science

The DataFrame equality test functions were introduced in Apache Spark™ 3.5 and Databricks Runtime 14.2 to simplify PySpark unit testing. The full set o... View the full article

python Pandas DataFrame Groupby()

Linux Hint posted a topic in Databases, Data Engineering & Data Science

While working with large data in Python, we sometimes need to analyze data for various purposes. In the analyzing process, we split the data based on the groups and performed certain operations on it. The “groupby()” method in Python is utilized to accomplish this operation. This method groups the data based on single or multiple columns or other values and applies certain methods to it. This write-up will deliver you a detailed guide on Pandas “DataFrame.groupby()” method using this contents: What is the “DataFrame.groupby()” Method in Python? Group the Data Based on a Specified Column Group the Data Based on a Multiple Column Group the Data Based on an Index Column Apply the Function to Group Data Sort the Group Data View the full article

pandas Pandas Add Header

Linux Hint posted a topic in Databases, Data Engineering & Data Science

Python supports a variety of modules and functions for executing data analysis and manipulation operations. The data structure in Pandas named “DataFrame” is used in Python to store and manipulate data. The header of a DataFrame provides the column names, making it easy to identify and access the data. To add a header to DataFrame, various methods are utilized in Python. This Python blog presents a detailed guide on adding header rows to Pandas DataFrame using numerous examples. How to Add Header to Pandas DataFrame? The following methods are utilized in Python to add/insert a header to a DataFrame: Using “pd.DataFrame()” Columns Parameter Using “DataFrame.columns” Method Using “DataFrame.set_axis()” Method Add Header to Pandas DataFrame Using “pd.DataFrame()” Columns Parameter The “pd.DataFrame()” method takes the “columns” as an argument and adds the header rows to the newly created DataFrame. For example, in the below code, the “pd.DataFrame()” method creates a DataFrame with header column values as “Name”, “Age”, and “Height”. import pandas data1 = [ ["Joseph",20, 5.3],["Henry",25, 4.6],["Lily", 32, 4.7]] column_names=["Name", "Age", "Height"] print(pandas.DataFrame(data1, columns=column_names)) The above code retrieves the following DataFrame to the output: Add Header to Pandas DataFrame Using “DataFrame.columns” Method The “DataFrame.columns()” method can also be utilized to add header rows to the Pandas DataFrame. The following code adds the specified header column to the input DataFrame containing no header: import pandas df = pandas.DataFrame([ ["Joseph",20, 5.3],["Henry",25, 4.6],["Lily", 32, 4.7]]) column_names=["Name", "Age", "Height"] df.columns = column_names print(df) The above code execution generates the following DataFrame: Add Header to Pandas DataFrame Using “DataFrame.set_axis()” Method In Python, the “set_axis()” method can be used to change the labels of a DataFrame. We can change the labels of the columns or the rows by assigning a list of labels to the label’s argument. This method is used in the below code to add header rows to the newly created DataFame by taking the list of header values and axis as an argument. The “axis” argument indicates which axis the labels will be assigned to. The value “0” specifies the rows, and “1” specifies the columns. import pandas df = pandas.DataFrame([ ["Joseph",20, 5.3],["Henry",25, 4.6],["Lily", 32, 4.7]]) column_names=["Name", "Age", "Height"] df = df.set_axis(column_names, axis=1) print(df) The above code displays the following output: Add Multiple Header to Pandas DataFrame To add multiple headers to Pandas DataFrame, the “pandas.MultiIndex.from_tuples()” method is used along with the “df.columns” method. In the below code, the “pandas.DataFrame()” method creates the DataFrame with header rows. After that, the “pandas.MultiIndex.from_tuples()” method creates the multi-index of the DataFrame. import pandas df = pandas.DataFrame([ ["Joseph",20, 5.3],["Henry",25, 4.6],["Lily", 32, 4.7]],columns=["Name", "Age", "Height"]) df.columns = pandas.MultiIndex.from_tuples(zip(['A', 'B', 'C'], df.columns)) print(df) The above code generates the below output: That’s all about adding the header to the Pandas data frame. Conclusion The “pd.DataFrame()” columns parameter, “DataFrame.columns” method, and the “DataFrame.set_axis()” method is used to add a header to Pandas DataFrame in Python. These methods can be used to add a header while creating DataFrame or after creating the DataFrame. We can also add multiple headers to Pandas DataFrame using the “pandas.MultiIndex.from_tuples()” and the “df.columns” methods. This guide delivered a comprehensive tutorial on how to add a header to Pandas DataFrame using numerous examples. View the full article

pyspark Check the Given Data is PySpark RDD or DataFrame

Linux Hint posted a topic in Databases, Data Engineering & Data Science

In Python, PySpark is a Spark module used to provide a similar kind of processing like spark. RDD stands for Resilient Distributed Datasets. We can call RDD a fundamental data structure in Apache Spark. Syntax: 1 spark_app.sparkContext.parallelize(data) We can display the data in a tabular format. The data structure used is DataFrame.Tabular format means it stores data in rows and columns. Syntax: In PySpark, we can create a DataFrame from spark app with the createDataFrame() method. Syntax: 1 Spark_app.createDataFrame(input_data,columns) Where input_data may be a dictionary or a list to create a dataframe from this data, and if the input_data is a list of dictionaries, then the columns are not needed. If it is a nested list, we have to provide the column names. Now, let’s discuss how to check the given data in PySpark RDD or DataFrame. Creation of PySpark RDD: In this example, we will create an RDD named students and display using collect() action. #import the pyspark module import pyspark #import SparkSession for creating a session from pyspark.sql import SparkSession # import RDD from pyspark.rdd from pyspark.rdd import RDD #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =spark_app.sparkContext.parallelize([ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]) #display the RDD using collect() print(students.collect()) Output: [{'rollno': '001', 'name': 'sravan', 'age': 23, 'height': 5.79, 'weight': 67, 'address': 'guntur'}, {'rollno': '002', 'name': 'ojaswi', 'age': 16, 'height': 3.79, 'weight': 34, 'address': 'hyd'}, {'rollno': '003', 'name': 'gnanesh chowdary', 'age': 7, 'height': 2.79, 'weight': 17, 'address': 'patna'}, {'rollno': '004', 'name': 'rohith', 'age': 9, 'height': 3.69, 'weight': 28, 'address': 'hyd'}, {'rollno': '005', 'name': 'sridevi', 'age': 37, 'height': 5.59, 'weight': 54, 'address': 'hyd'}] Creation of PySpark DataFrame: In this example, we will create a DataFrame named df from the students’ data and display it using the show() method. #import the pyspark module import pyspark #import SparkSession for creating a session from pyspark.sql import SparkSession #import the col function from pyspark.sql.functions import col #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =[ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}] # create the dataframe df = spark_app.createDataFrame( students) #display the dataframe df.show() Output: Method 1 : isinstance() In Python, isinstance() method is used to compare the given object(data) with the type(RDD/DataFrame) Syntax: 1 isinstance(object,RDD/DataFrame) It takes two parameters: Parameters: object refers to the data RDD is the type available in pyspark.rdd module and DataFrame is the type available in pyspark.sql module It will return Boolean values (True/False). Suppose the data is RDD and the type is also RDD, then it will return True, otherwise it will return False. Similarly, if the data is DataFrame and type is also DataFrame, then it will return True, otherwise it will return False. Example 1: Check for RDD object In this example, we will apply isinstance() for RDD object. #import the pyspark module import pyspark #import SparkSession and DataFrame for creating a session from pyspark.sql import SparkSession,DataFrame # import RDD from pyspark.rdd from pyspark.rdd import RDD #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =spark_app.sparkContext.parallelize([ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]) #check if the students object is RDD print(isinstance(students,RDD)) #check if the students object is DataFrame print(isinstance(students,DataFrame)) Output: 1 2 3 True False First, we compared students with RDD; it returned True because it is an RDD; and then we compared students with DataFrame, it returned False because it is an RDD (not a DataFrame). Example 2: Check for DataFrame object In this example, we will apply isinstance() for the DataFrame object. #import the pyspark module import pyspark #import SparkSession,DataFrame for creating a session from pyspark.sql import SparkSession,DataFrame #import the col function from pyspark.sql.functions import col # import RDD from pyspark.rdd from pyspark.rdd import RDD #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =[ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}] # create the dataframe df = spark_app.createDataFrame( students) #check if the df is RDD print(isinstance(df,RDD)) #check if the df is DataFrame print(isinstance(df,DataFrame)) Output: 1 2 3 False True First, we compared df with RDD; it returned False because it is a DataFrame and then we compared df with DataFrame; it returned True because it is a DataFrame (not an RDD). Method 2 : type() In Python, the type() method returns the class of the specified object. It takes object as a parameter. Syntax: 1 type(object) Example 1: Check for an RDD object. We will apply type() to the RDD object. #import the pyspark module import pyspark #import SparkSession for creating a session from pyspark.sql import SparkSession # import RDD from pyspark.rdd from pyspark.rdd import RDD #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =spark_app.sparkContext.parallelize([ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}]) #check the type of students print(type(students)) Output: 1 <class 'pyspark.rdd.RDD'> We can see that class RDD is returned. Example 2: Check for DataFrame object. We will apply type() on the DataFrame object. #import the pyspark module import pyspark #import SparkSession for creating a session from pyspark.sql import SparkSession #import the col function from pyspark.sql.functions import col #create an app named linuxhint spark_app = SparkSession.builder.appName('linuxhint').getOrCreate() # create student data with 5 rows and 6 attributes students =[ {'rollno':'001','name':'sravan','age':23,'height':5.79,'weight':67,'address':'guntur'}, {'rollno':'002','name':'ojaswi','age':16,'height':3.79,'weight':34,'address':'hyd'}, {'rollno':'003','name':'gnanesh chowdary','age':7,'height':2.79,'weight':17,'address':'patna'}, {'rollno':'004','name':'rohith','age':9,'height':3.69,'weight':28,'address':'hyd'}, {'rollno':'005','name':'sridevi','age':37,'height':5.59,'weight':54,'address':'hyd'}] # create the dataframe df = spark_app.createDataFrame( students) #check the type of df print(type(df)) Output: 1 <class 'pyspark.sql.dataframe.DataFrame'> We can see that class DataFrame is returned. Conclusion In the above article, we saw two ways to check if the given data or object is an RDD or DataFrame using isinstance() and type(). You must note that isinstance() results in boolean values based on the given object – if the object type is the same, then it will return True, otherwise False. And type() is used to return the class of the given data or object. View the full article

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Calendars

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

Website URL

LinkedIn Profile URL

About Me

Cloud Platforms

Cloud Experience

Development Experience

Current Role

Skills

Certifications

Favourite Tools

Interests

pyspark Simplify PySpark testing with DataFrame equality functions

python Pandas DataFrame Groupby()

pandas Pandas Add Header

pyspark Check the Given Data is PySpark RDD or DataFrame

Forum Statistics