Introduction to Pandas
History: Pandas were initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. He convinced the AQR to allow him to open source the Pandas. Another AQR employee, Chang She, joined as the second major contributor to the library in 2012. Over time many versions of pandas have been released. The latest version of the pandas is 1.5.0, released on Sep 19, 2022.
Advantages
- Fast and efficient for manipulating and analyzing data.
- Data from different file objects can be loaded.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Data set merging and joining.
- Flexible reshaping and pivoting of data sets
- Provides time-series functionality.
- Powerful group by functionality for performing split-apply-combine operations on data sets.
Getting Started
The first step of working in pandas is to ensure whether it is installed in the Python folder or not. If not then we need to install it in our system using pip command. Type cmd command in the search box and locate the folder using cd command where python-pip file has been installed. After locating it, type the command:
pip install pandas
After the pandas have been installed into the system, you need to import the library. This module is generally imported as:
import pandas as pd
Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less amount code every time a method or property is called.
Pandas generally provide two data structures for manipulating data, They are:
- Series
- DataFrame
Series: Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
Creating a Series
In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, an Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc.
Example:
- Python3
Output:
Series([], dtype: float64) 0 g 1 e 2 e 3 k 4 s dtype: object
DataFrame
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
Creating a DataFrame:
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, an Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionaries, etc.
Example:
- Python3
Output:
Empty DataFrame Columns: [] Index: [] 0 0 ujjwal 1 matoliya
Why Pandas is used for Data Science
Pandas are generally used for data science but have you wondered why? This is because pandas are used in conjunction with other libraries that are used for data science. It is built on the top of the NumPy library which means that a lot of structures of NumPy are used or replicated in Pandas. The data produced by Pandas are often used as input for plotting functions of Matplotlib, statistical analysis in SciPy, and machine learning algorithms in Scikit-learn.
Pandas program can be run from any text editor but it is recommended to use Jupyter Notebook for this as Jupyter given the ability to execute code in a particular cell rather than executing the entire file. Jupyter also provides an easy way to visualize pandas data frames and plots.
Key Features of Pandas
- It has a fast and efficient DataFrame object with the default and customized indexing.
- Used for reshaping and pivoting of the data sets.
- Group by data for aggregations and transformations.
- It is used for data alignment and integration of the missing data.
- Provide the functionality of Time Series.
- Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time series.
- Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-ordering, and re-shaping.
- It integrates with the other libraries such as SciPy, and scikit-learn.
- Provides fast performance, and If you want to speed it, even more, you can use the Cython.
Benefits of Pandas
The benefits of pandas over using other language are as follows:
- Data Representation: It represents the data in a form that is suited for data analysis through its DataFrame and Series.
- Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it provides clear and concise code for the user.
Python Pandas Data Structure
The Pandas provides two data structures for processing the data, i.e., Series and DataFrame, which are discussed below:
1) Series
It is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. We can easily convert the list, tuple, and dictionary into series using "series' method. A Series cannot contain multiple columns. It has one parameter:
Data: It can be any list, dictionary, or scalar value.
Creating Series from Array:
Before creating a Series, Firstly, we have to import the numpy module and then use array() function in the program.
Output
0 P 1 a 2 n 3 d 4 a 5 s dtype: object
Explanation: In this code, firstly, we have imported the pandas and numpy library with the pd and np alias. Then, we have taken a variable named "info" that consist of an array of some values. We have called the info variable through a Series method and defined it in an "a" variable. The Series has printed by calling the print(a) method.
Python Pandas DataFrame
It is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:
- The columns can be heterogeneous types like int, bool, and so on.
- It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as "columns" in case of columns and "index" in case of rows.
Create a DataFrame using List:
We can easily create a DataFrame in Pandas using list.
Output
0 0 Python 1 Pandas
Explanation: In this code, we have defined a variable named "x" that consist of string values. The DataFrame constructor is being called on a list to print the values.