This is part 1 of a two part article on logistic regression.

Introduction

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). Although most logisitc regression should be called binomial logistic regression, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model multinomial logistic regression. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.

The logistic equation.

The general equation of the logit model

Y = β0 + β1x1 + β2x2 + … + βnxn where Y is the variable to predict.

β is the coefficients of the predictors and the xi are the predictors (aka independent variables).

In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). We should then rewrite our function

p = e(β0 + β1x1 + β2x2 + … + βnxn)

This however does not garantee to have p between 0 and 1. Let’s then have

or

where p is the probability of success. With little further manipulations, we have

and If we remember what was Y, we get This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%.

Performance of Logistic Regression Model.

To evaluate the performance of a logistic regression model, we can consider a few metrics.

  • AIC (Akaike Information Criteria) The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
  • Null Deviance and Residual Deviance Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.
  • Confusion Matrix It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting.
  • We can calcualate the accuracy of our model by
  • From confusion matrix, Specificity and Sensitivity can be derived as
  • ROC Curve Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

Setting up

As usual we will use the tidyverse and caret package

library(tidyverse)
library(caret)      # for confusion matrix
library(ROCR)       # For the ROC curve

We can now get straight to business and see how to model logisitc regression with R and then have the more interesting discussion on its performance.

Example 1

We use a dataset about factors influencing graduate admission that can be downloaded from the UCLA Institute for Digital Research and Education

The dataset has 4 variables

  • admit is the response variable
  • gre is the result of a standardized test
  • gpa is the result of the student GPA (school reported)
  • rank is the type of university the student apply for (4 = Ivy League, 1 = lower level entry U.)

Let’s have a quick look at the data and their summary. The goal is to get familiar with the data, type of predictors (continuous, discrete, categorical, etc.)

df <- read_csv("dataset/grad_admission.csv")
glimpse(df)
## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre   <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa   <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank  <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2,...
#Quick check to see if our response variable is balanced-ish
table(df$admit)
## 
##   0   1 
## 273 127
## Two-way contingency table of categorical outcome and predictors
round(prop.table(table(df$admit, df$rank), 2), 2)
##    
##        1    2    3    4
##   0 0.46 0.64 0.77 0.82
##   1 0.54 0.36 0.23 0.18

It seems about right … most students applying to Ivy Leagues are not being admitted.

Modeling

Before we can run our model, let’s transform the rank explanatory variable to a factor.

df$rank <- factor(df$rank)

# Run the model
model_admission_lr <- glm(admit ~ ., data = df, family = "binomial")
summary(model_admission_lr)
## 
## Call:
## glm(formula = admit ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002. For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804. The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.

To see how the variables in the model participates in the decrease of Residual Deviance, we can use the ANOVA function on our model.

anova(model_admission_lr)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev
## NULL                   399     499.98
## gre   1  13.9204       398     486.06
## gpa   1   5.7122       397     480.34
## rank  3  21.8265       394     458.52

We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.

library(aod)
wald.test(Sigma = vcov(model_admission_lr), b = coef(model_admission_lr), Terms = 4:6)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011

The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.

Let’s check how our model is performing. As mentioned earlier, we need to make a choice on the cutoff value (returned probability) to check our accuracy. In this first example, let’s just stick with the usual 0.5 cutoff value.

prediction_admission_lr <- predict(model_admission_lr, data = df, type = "response")
prediction_admission_lr <- if_else(prediction_admission_lr > 0.5 , 1, 0)
confusionMatrix(data = prediction_admission_lr, 
                reference = df$admit, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 254  97
##          1  19  30
##                                          
##                Accuracy : 0.71           
##                  95% CI : (0.6628, 0.754)
##     No Information Rate : 0.6825         
##     P-Value [Acc > NIR] : 0.1293         
##                                          
##                   Kappa : 0.1994         
##  Mcnemar's Test P-Value : 8.724e-13      
##                                          
##             Sensitivity : 0.2362         
##             Specificity : 0.9304         
##          Pos Pred Value : 0.6122         
##          Neg Pred Value : 0.7236         
##              Prevalence : 0.3175         
##          Detection Rate : 0.0750         
##    Detection Prevalence : 0.1225         
##       Balanced Accuracy : 0.5833         
##                                          
##        'Positive' Class : 1              
## 

We have an interesting situation here. Although all our variables were significant in our model, the accuracy of our model, 71% is just a little bit higher than the basic benchmark which is the no-information model (ie. we just predict the highest class) in this case 68.25%.

ROC and cutoff point

Before we do a ROC curve, let’s have a quick reminder on ROC. ROC are plotting the proprotion of TP to FP. So ideally we want to have 100% TP and 0% FP. Perfect ROC Curve

Pure Random guessing should lead to this curve random guess

With that in mind, let’s do a ROC curve on out model

prediction_admission_lr <- predict(model_admission_lr, data = df, type="response")
pr_admission <- prediction(prediction_admission_lr, df$admit)
prf_admission <- performance(pr_admission, measure = "tpr", x.measure = "fpr")
plot(prf_admission, colorize = TRUE, lwd=3)

At least it is better than just random guessing.

In some applications of ROC curves, you want the point closest to the TPR of 1 and FPR of 0. This cut point is “optimal” in the sense it weighs both sensitivity and specificity equally. Now, there is a cost measure in the ROCR package that you can use to create a performance object. Use it to find the cutoff with minimum cost.

cost_admission_perf = performance(pr_admission, "cost")
pr_admission@cutoffs[[1]][which.min(cost_admission_perf@y.values[[1]])]
##      392 
## 0.487194

Using that cutoff value we should get our sensitivity and specificity a bit more in balance. Let’s try

prediction_admission_lr <- predict(model_admission_lr, data = df, type = "response")
prediction_admission_lr <- if_else(prediction_admission_lr > 0.4721949 , 1, 0)
confusionMatrix(data = prediction_admission_lr, 
                reference = df$admit, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 247  89
##          1  26  38
##                                           
##                Accuracy : 0.7125          
##                  95% CI : (0.6654, 0.7564)
##     No Information Rate : 0.6825          
##     P-Value [Acc > NIR] : 0.1077          
##                                           
##                   Kappa : 0.2352          
##  Mcnemar's Test P-Value : 7.402e-09       
##                                           
##             Sensitivity : 0.2992          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.5938          
##          Neg Pred Value : 0.7351          
##              Prevalence : 0.3175          
##          Detection Rate : 0.0950          
##    Detection Prevalence : 0.1600          
##       Balanced Accuracy : 0.6020          
##                                           
##        'Positive' Class : 1               
## 

And bonus, we even gained some accuracy!

I have seen a very cool graph on this website that plots this tradeoff between specificity and sensitivity and shows how this cutoff point can enhance the understanding of the predictive power of our model.

# Create tibble with both prediction and actual value
cutoff = 0.5
cutoff_plot <- tibble(predicted = predict(model_admission_lr, data = df, type = "response"), 
                      actual = as.factor(df$admit)) %>% 
                 mutate(type = if_else(predicted >= cutoff & actual == 1, "TP", 
                                       if_else(predicted >= cutoff & actual == 0, "FP", 
                                               if_else(predicted < cutoff & actual == 0, "TN", "FN"))))

cutoff_plot$type <- as.factor(cutoff_plot$type)

ggplot(cutoff_plot, aes(x = actual, y = predicted, color = type)) + 
  geom_violin(fill = "white", color = NA) + 
  geom_jitter(shape = 1) + 
  geom_hline(yintercept = cutoff, color = "blue", alpha = 0.5) + 
  scale_y_continuous(limits = c(0, 1)) + 
  ggtitle(paste0("Confusion Matrix with cutoff at ", cutoff))

AUC

Last thing … the AUC, aka Area Under the Curve. The AUC is basically the area under the ROC curve. You can think of the AUC as sort of a holistic number that represents how well your TP and FP is looking in aggregate.

AUC=0 -> BAD AUC=1 -> GOOD

Area under ROC

So in the context of an ROC curve, the more “up and left” it looks, the larger the AUC will be and thus, the better your classifier is. Comparing AUC values is also really useful when comparing different models, as we can select the model with the high AUC value, rather than just look at the curves.

In our situation with our model model_admission_lr, we can find our AUC with the ROCR package.

prediction_admission_lr <- predict(model_admission_lr, data = df, type="response")
pr_admission <- prediction(prediction_admission_lr, df$admit)
auc_admission <- performance(pr_admission, measure = "auc")

# and to get the exact value  
auc_admission@y.values[[1]]
## [1] 0.6928413

References