November 21, 2024

Introduction to Pandas DataFrame in Python

In this machine learning and data science tutorial, we explain the basics of Pandas DataFrame. We explain how to construct DataFrame objects from Python NumPy arrays and how to perform basic operations on DataFrame. We explain how to access the column and row entries of DataFrame and we explain how to use loc and iloc access. Pandas DataFrame data structure is the primary object in Pandas. In this tutorial, we explain how to construct Pandas DataFrame in Python from NumPy arrays and how to perform basic operations on Pandas DataFrame objects. The YouTube tutorial is given below.

Pandas DataFrame can be seen as two-dimensional tables that can store heterogeneous data: integers, floats, strings, etc. To use Pandas DataFrame objects, we first need to install NumPy and Pandas. We can do that by opening a terminal and typing

pip install numpy 
pip install pandas 

We create Pandas DataFrame from a NumPy matrix as follows


# create a data frame from a numpy matrix
import numpy as np
import pandas as pd

# create a random matrix
matrix1=np.random.randn(4,4)
# create a data frame
frame1=pd.DataFrame(matrix1)

The output is

          0         1         2         3
0  0.563836  0.397757  1.305083 -0.770596
1  0.182204 -0.073712  0.890952  0.312442
2  1.487856 -1.396047  0.864808  1.452968
3 -1.571482  0.158208  2.035292 -0.322442

We can observe that the rows and column labels are indices. If the labels of rows and columns are not provided when constructing DataFrame objects, Pandas will automatically assign integer indices for rows and columns.

When creating DataFrame from NumPy arrays, we can also directly specify the labels for rows and columns

matrix2=np.arange(9).reshape(3,3)
frame2=pd.DataFrame(matrix2,index=['p','q','r'],columns=['Set 1', 'Set 2','Set 3'])

The output will look like this

   Set 1  Set 2  Set 3
p      0      1      2
q      3      4      5
r      6      7      8

We can also create DataFrame objects by using Pandas Series objects. We do it like this

column1=np.random.randn(5)
column2=np.random.randn(5)

dict1={'c1':column1, 'c2': column2}

frame3=pd.DataFrame(dict1)

The output is

         c1        c2
0 -0.998464 -0.549452
1 -0.564923 -0.247877
2  1.001370  0.088014
3  1.070600 -0.787748
4  0.427591 -0.631744

We can manually change the labels of rows of the constructed DataFrame object like this

frame3.index=['r1','r2','r3','r4','r5']

The output is

          c1        c2
r1 -0.998464 -0.549452
r2 -0.564923 -0.247877
r3  1.001370  0.088014
r4  1.070600 -0.787748
r5  0.427591 -0.631744

We can print and retrieve the shape, row names, and column names of Pandas DataFrame objects like this:


# we can retrieve the shape

frame2.shape

# we can get the index names

frame2.index

# we can get the column names

frame2.columns

We can access the stored values and retrieve the stored NumPy matrix like this

matrix2Extracted = frame2.values

We can change the column and index names like this

frame2.columns=['c1','c2','c3']

frame2.index=['r1','r2','r3']

We access the columns of Pandas DataFrame like this

frame2['c1']

frame2['c2']

We can access several columns of DataFrame objects like this

frame2[['c1','c2']]

We can access the columns by using attribute access


frame2.c1
frame2.c2

We can get the stored values in DataFrame at the particular column like this

frame2.c1.values

There are several ways for accessing rows of DataFrame objects. Some approaches are confusing. Consequently, let us stick to the less confusing approaches based on loc and iloc. Rows can be accessed like this

frame2.loc['r1']
c1    0
c2    1
c3    2
Name: r1, dtype: int32

We can access two rows at the same time

frame2.loc[['r1','r2']]

Rows can be accessed by location using .iloc

frame2.iloc[0]

frame2.iloc[[0,1]]

We can access or retrieve the values of the Pandas DataFrame by using loc and at operators

frame2.loc['r1','c1']

frame2.c1.r2

frame2.at['r2','c2']

We can also access or retrieve the values of the Pandas DataFrame by using index iloc and iat operators

frame2.iat[0,1]

frame2.iloc[2,2]