November 22, 2024

Starting With Multi-class Classification in Scikit-learn and Python (Using Support Vector Machines)


In this tutorial, we provide a hands-on introduction to multi-class classification in Scikit-learn and Python. We mainly focus on the implementation and very briefly explain the main theoretical concepts behind the classification problems. The YouTube video tutorial is given below.

Dataset and Classification in Scikit-learn

In this tutorial, we use the Iris flower data set. This set consists of 50 samples of three different species of Iris flower. The data set consists of four features (attributes). The features are the length and width of sepals and petals (sepal and petals are characteristic parts of a flower that can be used to distinguish different species of flowers). We covered this data set in our previous tutorial.

The four features are

[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]

The input data looks like this (rows are samples and columns are features)

[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4], …

Where columns are ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, and
‘petal width (cm)’, respectively. Thus, the first row has the following interpretation

sepal length (cm) =5.1
sepal width (cm)=3.5
petal length (cm)=1.4
petal width (cm)=0.2

The three classes or subspecies of the Iris flower are

[‘setosa’, ‘versicolor’, ‘virginica’]

Every sample (row in a data set) corresponds to a certain class. The classes are encoded like this ‘setosa’=0, ‘versicolor’=1, ‘virginica’=2.

Our goal is to train the classifier that will properly predict the class (subspecies) for a given sample. Thus, if we give to our trained classifier a sample

[5.4, 3.9, 1.7, 0.4]

it should output the proper class (0,1, or 2).

In the sequel, we explain the Python code. First, we import the necessary libraries

# here, we import the data set library
from sklearn import datasets 
# here, we import the standard scaler in order to standardize the data
from sklearn.preprocessing import StandardScaler 
# here, we import the function for splitting the data set into training and test data sets
from sklearn.model_selection import train_test_split
# support vector machine classifier
from sklearn.svm import SVC
import numpy as np

First, we import the dataset from the Scikit-learn library. Namely, this library contains a number of data sets that can be very useful for testing classification algorithms. Then, we import StandardScalar from the Scikit-learn library. As will be explained later, StandardScalar is used to remove the mean and scale the data set such that it has a unit variance. Then, we import “train_test_split”. This function is used to create the train and test data sets from the original data set. Next, we import “SVC” classifier. This classifier implements the support vector machine classifier. Finally, we import the numpy library.

Next, we load the data set, extract the input and output samples (starting with X and Y, respectively), and split the samples into training and testing data sets.

# load the data set
dataSet=datasets.load_iris()

# input data for classification
Xtotal=dataSet['data']
Ytotal=dataSet['target']

# split the data set into training and test data sets
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Xtrain, Xtest, Ytrain, Ytest = train_test_split(Xtotal, Ytotal, test_size=0.3)

The parameter test_size=0.3, means that the test data set size is 30 \% of the original data set. The training data set is used to train the classifier. The test data set is used to test the trained classifier on the data set it did not see during the training process. In this way, we can truly test the performance of the classifier. Also, it is important to mention that the function “train_test_split” also shuffles the data sets, such that we have a good representation of all classes in the training and test data sets.

Next, we scale the data sets.

# create a standard scaler
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
scaler1=StandardScaler()
# scale the training and test input data
# fit_transform performs both fit and transform at the same time
XtrainScaled=scaler1.fit_transform(Xtrain)
# here we only need to transform
XtestScaled=scaler1.transform(Xtest)

StandarScaler() will remove the mean and scale the data such that the scaled data is zero mean with the unit variance. Roughly speaking, this is done since data sets with wide variations of values across the samples and features can create numerical difficulties when training the classifiers or neural networks. Consequently, it is a good idea to scale the data. The standardization is performed according to the classical definition of the z-value in statistics

(1)   \begin{align*}z=\frac{x - \mu}{\sigma}\end{align*}

where x is the sample, \mu is the mean of all the samples, and \sigma is the standard deviation of all the samples.

After creating the scaler, we call the function “fit_transform” on the input training data set. This function will estimate (fit) the standard deviation and mean of the training sample, and transform the input training data. After that, we call “transform()” function on the input test data set.

Next, we create the SVC classifier, fit the data, and make predictions on the basis of the test data:

# initialize the classifier
# automatically recognizes that the problem is the multiclass problem 
# and performs one-vs-one classification
classifierSVM=SVC(decision_function_shape='ovo')

# train the classifier
classifierSVM.fit(Xtrain,Ytrain)

#predict classes by using the trained classifier
# complete prediction
predictedY=classifierSVM.predict(Xtest)
# peform basic comparison
Ytest==predictedY

Every classifier has the function fit() that fits the data, and the function “predict()” that predicts the data. The support vector machine classifier uses one-vs-one (“one against one” or “ovo”) classification strategy. The idea is to split the problem into a series of binary classifiers:

Classifier 1: class 0 against class 1
Classifier 2: class 0 against class 2
Classifier 3: class 1 against class 2

Each classifier predicts a single class or a probability of a sample belonging to a certain class, and then we use the majority vote strategy to find the most appropriate class.

We use the input test data set to perform predictions. The predictions are denoted by “predicted”. Finally, we compare the predictions “predictedY” with the test outputs “Ytest”, and we can observe that all samples are accurately predicted. That is, the statement “Ytest==predictedY” returns an array of “True” values. This means that the entries of “Ytest” and “predictedY” perfectly match each other.

We can also make a prediction on the basis of a single sample. This is done by using the following code lines

# single sample prediction 
predictedSampleY=classifierSVM.predict(Xtest[5,:].reshape(1,-1))
Ytest[5]