December 16, 2024

2. Introduction to Pandas structures: Series and DataFrame structures

This post is a sequel of the post on time series analysis in Python. In this post, we introduce two Pandas structures that are important for time series modeling in Python. These two data structures are “Series” and “DataFrame” structures. Note: I am using the Spyder Python editing environment, so I do not need to type “print(variable)” to print the value of “variable”. I can simply write “variable” if I want to print the value of “variable”. The GitHub page with the codes used in this and in previous tutorials can be found here. The YouTube video accompanying this post is given here:

1. Pandas Series

We start with the Series data structure. The following code lines define the Series data structure from a random Numpy array.

import pandas as pd 
import numpy as np

###############################################################################
#                   The Series Pandas structure
###############################################################################

# Create a simple random series
random_array=np.random.randn(10)
series=pd.Series(random_array)

print(series) 

The result is:

0    1.084495
1   -0.097727
2   -0.491279
3    0.421549
4    3.336025
5    0.910696
6   -0.023946
7   -0.235681
8    1.621403
9   -0.648340
dtype: float64

Series are defined by indices and the stored values at the index locations. They can be accessed as follows

# series consists of an index and a sequence of values 
print(series.index)
print(series.values)
RangeIndex(start=0, stop=10, step=1) 

The result is

RangeIndex(start=0, stop=10, step=1)
[ 1.08449476 -0.09772724 -0.49127897  0.42154907  3.33602525  0.91069632
 -0.02394647 -0.23568147  1.62140268 -0.64834049]

We can also define Series by labeling the indices and by specifying the values.

# create series by specifying values and by labeling the indices
series2=pd.Series([10,20,30,40], index=['i1','i2','i3','i4'])

print(series2)

The result is

i1    10
i2    20
i3    30
i4    40
dtype: int64

We can also define Series by defining a dictionary and by converting dictionaries to Series:

# create series from dictionaries - keys of dictionary are used as index labes

dict1={'i1':10, 'i2':20, 'i3':30, 'i4':40}
series3=pd.Series(dict1)

Similarly to MATLAB notation, we can access the stored values

#indexing 
series[0]
series[1]
series[2]

#multiple value indexing
series[[0, 1, 2]]

# slice notation - similar to MATLAB 
series[0:3]

To access the first or last few entries of Series, we use the following commands:

# beginning and end 
series.head() 
series.tail()

The result is

0    1.084495
1   -0.097727
2   -0.491279
3    0.421549
4    3.336025
dtype: float64

5    0.910696
6   -0.023946
7   -0.235681
8    1.621403
9   -0.648340
dtype: float64

We can determine the length, shape, and the number of entries of Series as follows:

# length, shape, number of elements, etc. 

len(series)
series.shape
series.count()

We can also count entries that do not repeat themselves:

#construct a time series with repeated entries 

series4=pd.Series(np.array([1,1,1,2]))

# count only entries that do not repeat
series4.unique()

The result is

array([1, 2], dtype=int64)

That is, the repeated entries are ignored, and we return an array of the entries that are not repeated.

We can also perform basic algebraic operations on Series. Here it should be mentioned that the Series structures are aligned according to the index names and not according to the index location. This can be illustrated by the following code lines:

# alignment via index labels

series5=pd.Series([1,2,3], index=['i1','i2','i3'])

series6=pd.Series([5,6,7], index=['i2','i3','i1'])

series7=series5+series6

series7

The result is

i1    8
i2    7
i3    9
dtype: int64

DataFrame Pandas Structure

DataFrame objects can be defined as follows

# create a DataFrame structure from a NumpyArray
# index and column names are automatically assigned

frame1=pd.DataFrame(np.array([[1,2],[3,4]]))
frame1

#construction from a list of Series objects
seriesList1=[pd.Series([1,2,3,4]),pd.Series([5,6,7,8])]
dataFrame1=pd.DataFrame(seriesList1)
dataFrame1

The results are

   0  1
0  1  2
1  3  4

   0  1  2  3
0  1  2  3  4
1  5  6  7  8

The shape of the DataFrame object can be determined as follows

#shape
dataFrame1.shape

While defining the DataFrame objects we can also specify the column names:

#specify the column names while creating the DataFrame

dataFrame2=pd.DataFrame(np.array([[1,2],[3,4]]), columns=['a','b'])

dataFrame2

The result is

   c1  c2
0   1   2
1   3   4

We can access or change the column names as follows

#access the column names

dataFrame2.columns

#change the column names
dataFrame2.columns = ['c1','c2']

dataFrame2

We can also specify the index labels and column names while defining the DataFrame

#specify the column names and index labels while creating the DataFrame

dataFrame3=pd.DataFrame(np.array([[10,20],[30,40]]),columns=['c1','c2'],index=['r1','r2'])

dataFrame3

The result is

    c1  c2
r1  10  20
r2  30  40

We can access the index and values as follows

# access the index values 

dataFrame3.index

# get the matrix values

dataFrame3.values

The results are:

Index(['r1', 'r2'], dtype='object')
array([[10, 20],
       [30, 40]])

We can select the columns, rows, values as follows

# selecting the columns, values, rows, etc. 

# construct a DataFrame
dataFrame4=pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]),
                    columns=['c1','c2','c3'],
                    index=['r1','r2','r3'])

# select the second column
dataFrame4['c2']

# another way
dataFrame4.c2

#select two columns at the same time 
dataFrame4[['c1','c2']]


# row selection 

# select the first two rows
dataFrame4[:2]

dataFrame4['r1':'r2']
 
# explicitly select the row by specifying the index label 

dataFrame4.loc['r1']

dataFrame4.iloc[0]

# two rows at the same time 

dataFrame4.loc[['r1','r2']]

dataFrame4.iloc[[0,1]]


# scalar lookup at the values at certain locations

dataFrame4.at['r2','c2']

dataFrame4.iat[1,1]