When evaluating models for a given ML algorithm, we need to define in advance what would be our metric to measure success. How would we decide if this model performs well or if this model is better than this model? Or even deciding which hyper-parameters to use (aka which ones fine-tune to a better model?)

This post is about defining what is ‘best’ or ‘better’ when comparing different supervised models. we’ll have 2 main parts: measure of success for regression models and measure of success for classification models.

In ML linguo, we are talking about Loss functions: how to measure and minimize the errors between the predicted (\(\hat y\)) and the actual values (\(y\)). So the loss function is used to measure the error on a single observation. On the other hand, the Cost Function is the average loss over the entire dataset.

Regression models

When modeling for regression, we somehow measure the distance between our prediction and the actual observed value. When comparing models, we usually want to keep the model which give the smallest sum of distance.

It has to be noted that quite a few of these concepts have deeper connections in ML as they are not only ‘success’ metrics but also loss functions of ML algorithms.

RMSE

This is probably the most well-known measure when comparing regression models. Because we are squaring the distance between the predicted and the observed, this penalizes predicted values that are far off the real values. Hence this measures is used when we want to avoid ‘outliers’ predictions (prediction that are far off.)

\[RMSE = \sqrt \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n}\]

The SSE \(\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\) (aka sum of square error, aka without square root and average) is also the loss function in the linear regression algorithm. It is a convex function; hence we can find a minimum; or better than solving partial derivatives, one can use Gradient Descent (more feasible/practical when it comes to ML).

MAE

With Mean Absolute Error, we just take the average of the errors. Useful when we don’t really care if predictions is far off from the observed data.

\[MAE = \frac {\sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert}{n}\]

Without the averaging process, this is also called the L1 loss or Laplace Loss. It is robust to outliers (comparing to L2 Loss). But Absolute Error is not smooth (aka not differentiable everywhere). This has implication when using Gradient Descent Algorithms.

Huber Loss

Huber loss is a mixture of MSE and MAE. Kind of the best of both world basically. We define the Huber Loss function using a step wise functions.

\[L_\delta (y, f(x)) = \begin{cases} \frac{1}{2} (y - f(x))^2 & \lvert y - f(x) \rvert\leq \delta \\ \delta \lvert y - f(x) \rvert-\frac{1}{2} \delta^2 & otherwise \end{cases} \]

Note how the Huber Loss function require a parameter \(\delta\) which acts as a threshold value (usually by default it is 1).

Basically the function says that for value less than \(\delta\) uses MSE and for loss values great than \(\delta\) uses MAE.

Some advantages of Huber Loss functions:

Robust to outliers
differentiable everywhere (even at the junction of the MAE and MSE). Meanings it can be used with Gradient Descent algorithms as well.
The transition from quadratic to linear behaviour in Huber loss results in a smoother optimization landscape compared to MSE. This can prevent issues related to gradient explosions and vanishing gradients, which may occur in certain cases with MSE.

The main disadvantage of the Huber Loss function is how to tune that \(\delta\) parameter.

Classfication models

Type I and Type II errors

Type I error is the probability of a false positive; it is the probability of falsely rejecting \(H_0\), the null-hypothesis. The type I error is consistent with the chosen significant level.
Type II error is the probability of a false negative; it is the probability of NOT rejecting a null-hypothesis that is actually false.

The log-loss function

The log-loss function is used with logistic regressions.

Accuracy

Shortcomings:

for imbalanced dataset, we can have good accuracy by just predicting most observation with the most frequent class. For instance in the case of a rare disease or big financial meltdown, we can just predict

Precision

If you call it true, is it indeed true? In other words, the proportion of predicted positive that are actually positive.

Recall

If there is a positive, did the model predict a positive.

F1 score

It is the harmonic mean of both precision and recall. The harmonic mean penalizes model that have very low precision or recall. Which wouldn’t be the case with arithmetic mean.

\[\frac{2 \cdot Precision \cdot Recall}{Precision + Recall}\]

AUC & ROC Curve

need to get the prediction as a probability

library(yardstick)