This post is a sequel of the post on time series analysis in Python. In this post, we introduce two Pandas structures that are important for time series modeling in Python. These two data structures are “Series” and “DataFrame” structures. Note: I am using the Spyder Python editing environment, so I do not need to type “print(variable)” to print the value of “variable”. I can simply write “variable” if I want to print the value of “variable”. The GitHub page with the codes used in this and in previous tutorials can be found here. The YouTube video accompanying this post is given here:
1. Pandas Series
We start with the Series data structure. The following code lines define the Series data structure from a random Numpy array.
import pandas as pd
import numpy as np
###############################################################################
# The Series Pandas structure
###############################################################################
# Create a simple random series
random_array=np.random.randn(10)
series=pd.Series(random_array)
print(series)
The result is:
0 1.084495
1 -0.097727
2 -0.491279
3 0.421549
4 3.336025
5 0.910696
6 -0.023946
7 -0.235681
8 1.621403
9 -0.648340
dtype: float64
Series are defined by indices and the stored values at the index locations. They can be accessed as follows
# series consists of an index and a sequence of values
print(series.index)
print(series.values)
RangeIndex(start=0, stop=10, step=1)
The result is
RangeIndex(start=0, stop=10, step=1)
[ 1.08449476 -0.09772724 -0.49127897 0.42154907 3.33602525 0.91069632
-0.02394647 -0.23568147 1.62140268 -0.64834049]
We can also define Series by labeling the indices and by specifying the values.
# create series by specifying values and by labeling the indices
series2=pd.Series([10,20,30,40], index=['i1','i2','i3','i4'])
print(series2)
The result is
i1 10
i2 20
i3 30
i4 40
dtype: int64
We can also define Series by defining a dictionary and by converting dictionaries to Series:
# create series from dictionaries - keys of dictionary are used as index labes
dict1={'i1':10, 'i2':20, 'i3':30, 'i4':40}
series3=pd.Series(dict1)
Similarly to MATLAB notation, we can access the stored values
#indexing
series[0]
series[1]
series[2]
#multiple value indexing
series[[0, 1, 2]]
# slice notation - similar to MATLAB
series[0:3]
To access the first or last few entries of Series, we use the following commands:
# beginning and end
series.head()
series.tail()
The result is
0 1.084495
1 -0.097727
2 -0.491279
3 0.421549
4 3.336025
dtype: float64
5 0.910696
6 -0.023946
7 -0.235681
8 1.621403
9 -0.648340
dtype: float64
We can determine the length, shape, and the number of entries of Series as follows:
# length, shape, number of elements, etc.
len(series)
series.shape
series.count()
We can also count entries that do not repeat themselves:
#construct a time series with repeated entries
series4=pd.Series(np.array([1,1,1,2]))
# count only entries that do not repeat
series4.unique()
The result is
array([1, 2], dtype=int64)
That is, the repeated entries are ignored, and we return an array of the entries that are not repeated.
We can also perform basic algebraic operations on Series. Here it should be mentioned that the Series structures are aligned according to the index names and not according to the index location. This can be illustrated by the following code lines:
# alignment via index labels
series5=pd.Series([1,2,3], index=['i1','i2','i3'])
series6=pd.Series([5,6,7], index=['i2','i3','i1'])
series7=series5+series6
series7
The result is
i1 8
i2 7
i3 9
dtype: int64
DataFrame Pandas Structure
DataFrame objects can be defined as follows
# create a DataFrame structure from a NumpyArray
# index and column names are automatically assigned
frame1=pd.DataFrame(np.array([[1,2],[3,4]]))
frame1
#construction from a list of Series objects
seriesList1=[pd.Series([1,2,3,4]),pd.Series([5,6,7,8])]
dataFrame1=pd.DataFrame(seriesList1)
dataFrame1
The results are
0 1
0 1 2
1 3 4
0 1 2 3
0 1 2 3 4
1 5 6 7 8
The shape of the DataFrame object can be determined as follows
#shape
dataFrame1.shape
While defining the DataFrame objects we can also specify the column names:
#specify the column names while creating the DataFrame
dataFrame2=pd.DataFrame(np.array([[1,2],[3,4]]), columns=['a','b'])
dataFrame2
The result is
c1 c2
0 1 2
1 3 4
We can access or change the column names as follows
#access the column names
dataFrame2.columns
#change the column names
dataFrame2.columns = ['c1','c2']
dataFrame2
We can also specify the index labels and column names while defining the DataFrame
#specify the column names and index labels while creating the DataFrame
dataFrame3=pd.DataFrame(np.array([[10,20],[30,40]]),columns=['c1','c2'],index=['r1','r2'])
dataFrame3
The result is
c1 c2
r1 10 20
r2 30 40
We can access the index and values as follows
# access the index values
dataFrame3.index
# get the matrix values
dataFrame3.values
The results are:
Index(['r1', 'r2'], dtype='object')
array([[10, 20],
[30, 40]])
We can select the columns, rows, values as follows
# selecting the columns, values, rows, etc.
# construct a DataFrame
dataFrame4=pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9]]),
columns=['c1','c2','c3'],
index=['r1','r2','r3'])
# select the second column
dataFrame4['c2']
# another way
dataFrame4.c2
#select two columns at the same time
dataFrame4[['c1','c2']]
# row selection
# select the first two rows
dataFrame4[:2]
dataFrame4['r1':'r2']
# explicitly select the row by specifying the index label
dataFrame4.loc['r1']
dataFrame4.iloc[0]
# two rows at the same time
dataFrame4.loc[['r1','r2']]
dataFrame4.iloc[[0,1]]
# scalar lookup at the values at certain locations
dataFrame4.at['r2','c2']
dataFrame4.iat[1,1]