In our series of articles discussing detailed information about machine learning models, we have already covered the basic and theoretical parts of support vector machine algorithms. In an overview, we can say that this algorithm is based on a hyperplane that separates the data points. The data points nearest to the separating hyperplane are called support vectors, and they are responsible for the position and orientation of the hyperplane. This algorithm gives a higher accuracy because it maximises the margin between the classes while minimising the error in regression or classification.
Now that we know how the support vector machine works, we must check this algorithm with real-world data. In this article, we are going to look at how this algorithm works and how we can implement it in our machine-learning project. To complete this motive of our, we will follow the below table of content.
Table of Content
- Importing data
- Data Analysis
- Data Processing
- Data Modelling
- Model Evaluation
Let’s start by gathering data,
Importing data
In this article, we are going to use the MNIST dataset, which is a popular image classification dataset and holds a large database of handwritten digits that is commonly used for image classification tasks.
So here, we will try to model this data with a support vector machine, which can predict which image belongs to which class. This data is also available within the sklearn library.
Now let’s just start by importing the data into the environment.
import pandas as pd
from sklearn.datasets import fetch_openml
mnist = fetch_openml(‘mnist_784’)
Now let’s convert the data into a Pandas Dataframe object
X, y = pd.DataFrame(mnist.data), pd.Series(mnist.target)
X.info()
Output:
Here we can see that the data is in the form of a DataFrame, and it has around 70000 entries aligned with 784 columns, and the column name varies from pixel1 to pixel784. As we have already discussed that SVM gives high performance with data including a large number of features, So here, SVM can give optimal results. Before applying this data to an SVM model, we need to perform some data analysis. So let’s start by exploring insights into the data.
Data Analysis
We will divide this section into two steps where we will look at the descriptive insights of the data, and we will perform exploratory data analysis. Let’s find out information from the data.
Statistical Data Analysis
Here in this sub-part, we will take a look at the statistical details hidden inside the data.
X.info()
Output:
Here we can see the name of all 23 columns while we can also see that there are no null values in any columns of the data. Let’s use the describe method with the data.
X.describe()
Output:
Here, we can see some more details about the data. Here we can see that the maximum value in any of the columns is 254, and the minimum is 0, which indicates that the pixel number of any image varies from 0 to 255. Let’s take a look at the shape of the data.
print(“shape of X”, X.shape, “shape of y”, y.shape)
Output:
Let’s see the head of the X side.
After describing and seeing some rows, we are clear that any column in the data has no null values, as well as we can make it clear in our next step. Let’s move towards the basic EDA.
Basic EDA
Let’s start by analysing our target variable then slowly we will move towards the other independent variables of the data.
import matplotlib.pyplot as plt
print(y.value_counts())
y_counts = y.value_counts()
plt.figure(figsize=(8,6))
plt.bar(y_counts.index, y_counts.values)
plt.xlabel(‘Class Label’)
plt.ylabel(‘Count’)
plt.title(‘Distribution of Classes’)
plt.show()
Output:
Here we can see that there is enough data for every class of the data, ensuring lesser chances of the class imbalance problem. Also, we can see how the count of different classes is distributed throughout the data. Now let’s move towards the independent variable.
Let’s check for the null values on the independent data side.
#countoing missing value in the data
missing_values_count = X.isnull().apply(pd.value_counts)
counts = missing_values_count.sum(axis=1)
counts.plot(kind=’bar’)
Output:
Here we can see that there is no null value in the data. Low lets try to draw one of the image from the data.
import matplotlib.pyplot as plt
# Plot the first number in X
plt.imshow(X.iloc[0].values.reshape(28, 28), cmap=’gray’)
plt.axis(‘off’)
plt.show()
Output:
Here we can see how the images inside the data is bounded. Now our term is to preprocess the data because the model package defined under the sklearn library requires preprocessing data to model it.
Data Preprocessing
As we know that the values under this data are numerical, we would need to standardise and normalise the data. We do this to save the model from becoming overfitted.
X = X/255.0
from sklearn.preprocessing import scale
X_scaled = scale(X)
The above code helps us normalise and scale the data. Now we can split the data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, train_size = 0.2 ,random_state = 10)
After splitting the data, we are ready to model the data.
Data Modelling
To model this data using the SVM algorithm, we are going to use the model package given by the sklearn library under the SVM package.
from sklearn.svm import SVC
first_model = SVC(kernel = ‘rbf’)
first_model.fit(X_train, y_train)
Output:
This is how we can simply call and fit the model on the data. Let’s validate its results.
Model Evaluation
Till now, we have seen the data analysis, preprocessing and modelling. Now once we have the trained model, we need to validate the process we followed is optimum or not. To do so we can use a confusion matrix and accuracy. Using the below code, we can visualise our model performance as a confusion matrix.
y_pred = first_model.predict(X_test)
import seaborn as sns
# accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
print(“accuracy:”, accuracy_score(y_true=y_test, y_pred=y_pred), “\n”)
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
cmap = sns.diverging_palette(10, 220, sep=80, n=7)
# Plot the confusion matrix as a heatmap
sns.heatmap(cm,annot=True, cmap=cmap ,fmt=’g’)
Output:
Here we can see that the model we have defined is more than 94% accurate, and also, in the confusion matrix there are no major classes predicted wrong by the models. Now we can also check for the classification report of the model.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Output:
Here we can see that model is performing great, and its accuracy is 94%. Now let’s conclude this topic as we have got an optimum model for MNIST image classification.
Conclusion
In this article, we have seen how an SVM model can perform with real-life data when there is a huge number of features. As explained in the last article, the SVM is high performing when the feature of the data is higher than the data points, and there are rare fields where such data generates. So if we have a huge number of data features in a dataset and the task is classification, SVM becomes an optimum option to model the data that also requires less calculation and power than the other statistical machine learning algorithms.
About DSW
DSW, specializing in Artificial Intelligence and Data Science, provides platforms and solutions for leveraging data through AI and advanced analytics. With offices located in Mumbai, India, and Dublin, Ireland, the company serves a broad range of customers across the globe.
Our mission is to democratize AI and Data Science, empowering customers with informed decision-making. Through fostering the AI ecosystem with data-driven, open-source technology solutions, we aim to benefit businesses, customers, and stakeholders and make AI available for everyone.
Our flagship platform ‘UnifyAI’ aims to streamline the data engineering process, provide a unified pipeline, and integrate AI capabilities to support businesses in transitioning from experimentation to full-scale production, ultimately enhancing operational efficiency and driving growth.