In data modelling, after a point, it becomes easy to train a model using historical data. Still, because of the different characteristics of models and datasets, it becomes difficult to evaluate the model using the right set of evaluation metrics. The model evaluation process goes through understanding the model and data to understand the right evaluation metrics for a problem. Before applying any evaluation metric in the process, we should be knowledgeable about important metrics to evaluate the model correctly. So in this article, we will cover the basics of different evaluation metrics. The list of these evaluation metrics is as follows:
Table of contents
- Confusion Matrix
- Classification Accuracy
- Precision/Specificity
- Recall
- F-1 Score
- AUC-ROC
- Root Mean Square Error(RMSE)
- Cross-entropy Loss
- Gini Coefficient
- Jaccard Score
Confusion Matrix
It is a matrix of size (a x a) where ‘a’ is the number of classes available in the classification data. The x-axis of this matrix can consist of the actual values and the y-axis can consist of the predicted values or vice versa. If the dataset has only two classes or belongs to the binary classification problem then the size of the matrix will be 2 X 2.
We can also call it an error matrix which is a matrix representation of model performance comparing the predictions of the model to ground truth labels. The below image is an example of a confusion matrix for model classifying between Spam and Not Spam.
Here is the following interpretation we can do with the confusion matrix:
- True Positive(TP): Correct Positive Predictions
- True Negative(TN): Correct Negative Predictions
- False Positive(FP): Incorrect Positive Predictions
- False Negative(FN): Incorrect Negative Predictions
Using the above values, we can calculate the following rates:
- True Positive Rate(TPR) = TP/Actual Positive = TP/ (TP + FN) = 45/(45+25) = 0.65
- False Negative Rate(FNR) = FN/Actual Positive = FN/ (TP + FN) = 25/(45+25) = 0.36
- True Negative rate = TN/Actual Negative = TN/ (TN + FP) = 30/(30+5) = 0.85
- False positive rate = FP/Actual Negative = FP/ (TN + FP) =5/(30+5) = 0.15
Here using the values in the above confusion matrix we have calculated 4 evaluation metrics.
Classification Accuracy
Using the above interpretation, we can easily calculate the classification accuracy using the following formula:
Classification accuracy = (correct prediction) / (all prediction) =(TP + TN) / (TP + TN + FP + FN)
According to the above confusion matrix, classification accuracy will be
Classification accuracy = (45 + 30)/ (45 + 30 + 5 + 25) = 0.71
Here we can see the accuracy of the model is 0.71 or 71%. This means that model will give 71 right answers out of 100 questions.
Precision/Specificity
With imbalanced data, classification accuracy is not the best indication to represent the model performance. In such conditions, we need to deal with a class-specific problem and precision or specificity is the best way to check the model’s performance. To get the value of this indicator, we need to have the true positive divided by the sum value to false positive and true positive.
Precision/Specificity = True Positive(TP) / (True Positive(TP) + False Positive(FP))
By this calculation, we quantify the predictions from the model that actually belongs to the positive class. Let’s have a look at the below diagram:

Using the above values we can calculate the precision
Precision = 45 / (45+5) = 0.90
Here we can say that 90% of retrieved items are relevant.
Recall
The recall is a metric that represents the quantification of correct positive predictions that are made out of all positive predictions. Unlike precision, metrics recall comments on only the correct positive predictions made out of all positive predictions so that an indication of missed positive predictions can be provided. The below formula can be used to calculate the recall of any model:
Recall = True Positive(TP) / (True Positive(TP) + False Negative(FN))
Let’s take a look at the below diagram:
According to the above diagram, the recall will be:
Recall = 45/(45 + 25) = 0.64
This means 64% time relevant items are retrieved.
F-1 Score
We can calculate the F1 score using precision and recall, which can be considered an excellent metric to use when the data is imbalanced.
F1 Score = 2*(Precision *Recall) / (Precision + Recall)
Using the above precision and recall we can calculate the F1 score in the following way:
F1 score = 2 * (0.90 * 0.64)/(0.90 + 0.64) = 0.75
We can consider this metric as the harmonic mean of precision and recall. That’s why this gives equal importance to precision and recall metrics. This metric can also be manipulated by adding in the equation so that more weight to one of them can be given. For example
F = (1 + 2)* (Precision * Recall)/ (Precision*2) + Recall
AUC-ROC
AUC-ROC (Area Under the Curve-Receiver Operator Characteristic) is a curve that makes a plot between TPR and FPR at different threshold values while separating the signals from the noise. The area Under the Curve represents the ability of the model to predict between classes, and the plot uses it as the summary of the ROC curve.
The AUC varies between 0 and 1, and as the AUC increases, the classifier’s performance improves. If the AUC is one, then we can think of a classifier that is highly capable of distinguishing between all the Positive and the Negative classes correctly.
Root Mean Square Error(RMSE)
This metric is used to measure the performance of a regression model, which assumes the errors are normally distributed and unbiased. This is the standard deviation of the prediction errors. Prediction errors are a measure of the distance of the data points to the prediction line. Using the below formula, we can calculate it:
RMSE = i = 1N(Prediction — Actual)2/N
Using the below image we can understand what is the prediction errors.
Cross-entropy Loss
It is also known as Log Loss and is famous for evaluating neural networks’ performance because it helps overcome the vanishing gradient problems. By summing the log value of prediction probability distribution for incorrect predictions, we can calculate the Cross-entropy Loss.
Hp(q) = -1/n(i=1Nyi.log(p(yi))+(1-yi).log(1-p(yi)))
To evaluate the model using these metrics we usually make a graph between log loss and predicted probability as given in the below image:
Gini Coefficient
This can be calculated using the AUC-ROC number, this is a ratio of ROC and diagonal line. If the value of this coefficient is more than 60%, then model performance is good, and one thing which is important here is that we use it only with classification models.
Gini = 2*AUC — 1
Jaccard Score
This score represents the similarity index between two datasets. Similar to RMSE, it gives a value between 0 and 1, where 1 represents closer similarity. To calculate this, we need to divide the total number of data points in both sets by the number of observations in either set.
J(A, B) = |A∩B| / |A∪B|
Final words
Here in the above, we have discussed some of the important metrics we use to evaluate the data models in real life. Since models and datasets have different conditions and characteristics, we can optimise different performance levels. So the model performance evaluation needs to be done rightly by knowing the characteristics of different evaluation metrics.