Skip to content Skip to footer

Getting Started with Machine Learning Algorithms: Logistic Regression

In the field of data science, we mainly find a variety of algorithms or models to perform regression and classification modelling. Logistic regression can be considered the first point of your learning line of data science, classification and predictive modelling. Since it comes under the regression model family it uses a curve to classify data in classes. We at DSW highly prefer to model small use-cases and problems utilising such small algorithms because these are highly robust and easy to interpret. In this article, we are going to talk about logistic regression. Let’s just start by what the logistic regression algorithm is.

What is logistic regression?

Logistic regression is one of the most basic and traditional algorithms or models that comes under the supervised machine learning classes used for classification and predictive modelling. In basics, these algorithms can be used to model the probability of an event or a class. Since it comes from the regression class models it uses the lines or curves to model the data and we use it where the dataset we have can be separated using lines, and the outcome from the model is required to be binary or dichotomous.

That means we use logistic regression for binary classification and binary classification works when the target variable is separated into two classes. Simple examples of binary classification are yes/no, 0/1, win/loss etc.

There are two types of logistic regression

  • Simple logistic regression
  • Multiple logistic regression

Where simple logistic regression is utilised where only one independent variable is affecting the dependent variable and multiple logistic regression is utilised when there are more than two independent variables affecting the dependent or target variable.

However, this can also be extended to the multinomial logistic regression and ordinal logistic regression where the number of classes is discrete in more than two classes or more than two classes of an ordinal nature. Since we majorly believe that using one line we can not separate more than two classes accurately we are going to learn about the simplest versions of logistic regression that can be utilised for binary classification. Let’s see how this algorithm works.

How does Logistic Regression Work?

As discussed above logistic regression works by separating linear separable data just like linear regression. To understand the working of logistic regression we are required to understand the mathematics behind it.

Mathematics

Let’s consider there is one predictor or independent variable ‘X’ and one dependent variable y and the probability of y being 1 is P. In such a situation the equation of linear regression can be written as:

p = mx + m0 …..(1).

The right side of the above equation is a linear equation and can hold beyond the range 0 to 1. And we all know that probability can vary between 0 to 1 only. So to overcome that we can predict odds in place of probability using the following formula:

Odds = p/(1-p)

Where,

p = probability of occurrence of any event.

1-p = probability of non-occurrence of any event.

According to odds, 1 can be written as:

p/(1-p) = mx + m0 .….(2)

Here we also need only a positive number that can be handled using the log form of the left side of equation 2.

log(p/(1-p)) = mx + m0 …..(3)

To recover the above equation we need to use the exponential form of both side

e(log(p/(1-p)) = e(mx + mo) …..(4)

While simplifying equation four we will get the following equation

p = (1-p)(e(mx + mo)) ..…(5)

We can also write this equation as follows

p = p((e(mx + mo))/p — e(mx + mo))

p =e(mx + mo)/(1 + e(mx + mo)) .….(6)

Now we can also multiply e(mx + mo) / e(mx + mo) on equation 6.

p = 1/(1 + e-(mx + mo))

Above is the final probability that logistic regression uses if the above-given condition is true. But if there are n predictors then the calculatory equation will be as follows

p = 1/(1+ e-(m0+m1x1+m2x2+m3x3+ — — +mnxn))

The above is the final equation of logistic regression when there is n predictors. Some experienced persons compare this equation with the sigmoid function because it also controls the range of output between 0 to 1.

In the above, we can see how we started with the linear equation and ended with the curve.

A Mathematically sigmoid function can be written as follows:

(z) = 1/(1+e-z)

In the above, we are required to replace z with e-(mx + mo) to make it an equation of logistic regression. Looking at the equation we can say the below image will be a representation of the working of logistic regression.

In the above image, we can see how logistic regression keeps the curve between the values 0 and 1. Now before utilising logistic regression on any data we are required to consider some of the assumptions. Let’s take a look at the assumptions.

Assumptions

Before modelling data using a basic logistic regression algorithm we are required to consider the following assumptions:

  • If any extensions are not applied then data needs to have a dependent variable with binary data points.
  • The data points under the data need to be independent of each other.
  • The independent variables of the data need to have no or small multicollinearity with each other.
  • The independent variables and their odds need to be linear to each other.
  • One thing that sometimes becomes mandatory according to the suggestions is that it is good to work with a large data size while utilising logistic regression in the process.

Here we have seen some of the assumptions that need to be covered before applying logistic regression. Let’s see how we can apply logistic regression to any data.

Implementation

In this section, we will look at how we can apply logistic regression to data using the Python programming language. However, we can also use R, MATLAB and excel for performing logistic regression but considering the size of the article we are using only Python.

In Python, Sklearn is a library which provides functions for applying every kind of machine learning algorithm in our datasets and for applying logistic regression we have the LogisticRegression method under the linear_model package which we utilise here.

Let’s start with making a synthetic dataset using the make_classification function of Sklearn.

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=5, random_state=42)

Here we have created a dataset which has 1000 rows and 5 columns in the independent variable dataset and 1 dependent variable with two classes 0 and 1. We can validate it by converting these arrays in the pandas DataFrame.

import pandas as pd

df = pd.DataFrame(data=X)

df[‘Target’] = y

df

Output:

Here we can see all the independent and dependent variables in one place.

Before going for modelling we are required to know which variables from our dataset have a better correlation with the target variable. Let’s check the correlation.

import seaborn as sns

corr=df.corr()

sns.heatmap(corr, annot=True,

fmt=’.1%’)

Output:

Here we can see that variables 0, 1, and 3 have higher correlations with the target variable and we can consider them in the data modelling with logistic regression.

Let’s trim and split the datasets

X = X[:, [0,1,3]]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.29, random_state = 0)

Let’s import the function from Sklearn and model the data.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

Let’s make some predictions so that we can validate the model

y_pred = model.predict(X_test)

y_pred

Output:

Here we can see the prediction made by the model. Now we need to evaluate this model.

Evaluation

Evaluation of a classification model can be done in various ways. Since it’s a binary classification model we find that there are two prime methods which can help us in the evaluation. These methods are as follows:

Accuracy score

It is the universal method for evaluation of any classification model which mainly compares actual and predicted values using the following formula

accuracy(y,ŷ) =( 1/nsample)i=0nsample-11(ŷ=yi)

Where,

y = actual values

ŷ= predicted values

This simply calculates error predictions and gives the calculated error in the form of percentage values. Let’s see how we can calculate it for the above model.

from sklearn.metrics import accuracy_score

print (“Accuracy of binary classification : “, accuracy_score(y_test, y_pred)*100,”%”)

Output:

Here we can see the accuracy of our model is good enough. Let’s verify it using another evaluation method.

Confusion matrix

This method tells us how many right decisions were taken by the model. As the name suggests this is a matrix which holds the following values under the cells

Here we can see that the true positives and true negatives are the values that the model has predicted right and other values are wrongly predicted. Let’s check how many right values our model is predicting right from the test data.

import seaborn as sns

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm/np.sum(cm), annot=True,

fmt=’.1%’)

Output:

In the above, we can see that 11.7% (7.6 + 4.1) values are not accurately predicted by the model and 88.3% values are correctly predicted. Let’s take a look at how the logistic regression model is being utilised in the real world.

Application of logistic regression

There are a variety of use-cases that can be found solved using logistic regression in a variety of fields like medical, politics, engineering and marketing. For a simple example, these models can be utilised to predict the risk of disease development in a human body by observing its characteristics or for predicting the mortality of injured humans and animals.

In politics, we can use it for predicting the number of voters in an election who are going to vote for a party by observing the demographics of voters. In engineering, we find that this model is being utilised for the failure optimization or prediction of various components, processes and systems. In marketing, it is being utilised for predicting the propensity of customers regarding the purchase of any product or service using the analysis of demographics of customers.

This model can be extended to perform operations in different domains of AI where sequential data is being collected and analysed like NLP and computer vision.

Final words

In this article, we have discussed one of the basic algorithms in machine learning which is logistic regression. There are a variety of use cases where we can find this algorithm most reliable. For instance, it has been utilised for predicting the mortality of injured humans in Trauma and Injury Severity Score (TRISS). Such examples make us believe that for some simple cases we can rely on logistic regression.

References

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai