In this machine learning and data science tutorial, we explain the basics of Pandas DataFrame. We explain how to construct DataFrame objects from Python NumPy arrays and how to perform basic operations on DataFrame. We explain how to access the column and row entries of DataFrame and we explain how to use loc and iloc access. Pandas DataFrame data structure is the primary object in Pandas. In this tutorial, we explain how to construct Pandas DataFrame in Python from NumPy arrays and how to perform basic operations on Pandas DataFrame objects. The YouTube tutorial is given below.
Pandas DataFrame can be seen as two-dimensional tables that can store heterogeneous data: integers, floats, strings, etc. To use Pandas DataFrame objects, we first need to install NumPy and Pandas. We can do that by opening a terminal and typing
pip install numpy
pip install pandas
We create Pandas DataFrame from a NumPy matrix as follows
# create a data frame from a numpy matrix
import numpy as np
import pandas as pd
# create a random matrix
matrix1=np.random.randn(4,4)
# create a data frame
frame1=pd.DataFrame(matrix1)
The output is
0 1 2 3
0 0.563836 0.397757 1.305083 -0.770596
1 0.182204 -0.073712 0.890952 0.312442
2 1.487856 -1.396047 0.864808 1.452968
3 -1.571482 0.158208 2.035292 -0.322442
We can observe that the rows and column labels are indices. If the labels of rows and columns are not provided when constructing DataFrame objects, Pandas will automatically assign integer indices for rows and columns.
When creating DataFrame from NumPy arrays, we can also directly specify the labels for rows and columns
matrix2=np.arange(9).reshape(3,3)
frame2=pd.DataFrame(matrix2,index=['p','q','r'],columns=['Set 1', 'Set 2','Set 3'])
The output will look like this
Set 1 Set 2 Set 3
p 0 1 2
q 3 4 5
r 6 7 8
We can also create DataFrame objects by using Pandas Series objects. We do it like this
column1=np.random.randn(5)
column2=np.random.randn(5)
dict1={'c1':column1, 'c2': column2}
frame3=pd.DataFrame(dict1)
The output is
c1 c2
0 -0.998464 -0.549452
1 -0.564923 -0.247877
2 1.001370 0.088014
3 1.070600 -0.787748
4 0.427591 -0.631744
We can manually change the labels of rows of the constructed DataFrame object like this
frame3.index=['r1','r2','r3','r4','r5']
The output is
c1 c2
r1 -0.998464 -0.549452
r2 -0.564923 -0.247877
r3 1.001370 0.088014
r4 1.070600 -0.787748
r5 0.427591 -0.631744
We can print and retrieve the shape, row names, and column names of Pandas DataFrame objects like this:
# we can retrieve the shape
frame2.shape
# we can get the index names
frame2.index
# we can get the column names
frame2.columns
We can access the stored values and retrieve the stored NumPy matrix like this
matrix2Extracted = frame2.values
We can change the column and index names like this
frame2.columns=['c1','c2','c3']
frame2.index=['r1','r2','r3']
We access the columns of Pandas DataFrame like this
frame2['c1']
frame2['c2']
We can access several columns of DataFrame objects like this
frame2[['c1','c2']]
We can access the columns by using attribute access
frame2.c1
frame2.c2
We can get the stored values in DataFrame at the particular column like this
frame2.c1.values
There are several ways for accessing rows of DataFrame objects. Some approaches are confusing. Consequently, let us stick to the less confusing approaches based on loc and iloc. Rows can be accessed like this
frame2.loc['r1']
c1 0
c2 1
c3 2
Name: r1, dtype: int32
We can access two rows at the same time
frame2.loc[['r1','r2']]
Rows can be accessed by location using .iloc
frame2.iloc[0]
frame2.iloc[[0,1]]
We can access or retrieve the values of the Pandas DataFrame by using loc and at operators
frame2.loc['r1','c1']
frame2.c1.r2
frame2.at['r2','c2']
We can also access or retrieve the values of the Pandas DataFrame by using index iloc and iat operators
frame2.iat[0,1]
frame2.iloc[2,2]