 Introduction
 The logistic equation.
 Performance of Logistic Regression Model.
 Setting up
 Example 1
 References
This is part 1 of a two part article on logistic regression.
Introduction
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). Although most logisitc regression should be called binomial logistic regression, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model multinomial logistic regression. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.
The logistic equation.
The general equation of the logit model
Y = β_{0} + β_{1}x_{1} + β_{2}x_{2} + … + β_{n}x_{n} where Y is the variable to predict.
β is the coefficients of the predictors and the x_{i} are the predictors (aka independent variables).
In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). We should then rewrite our function
p = e^{(β0 + β1x1 + β2x2 + … + βnxn)}
This however does not garantee to have p between 0 and 1. Let’s then have
or
where p is the probability of success. With little further manipulations, we have
and If we remember what was Y, we get This is the equation used in Logistic Regression. Here (p/1p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%.
Performance of Logistic Regression Model.
To evaluate the performance of a logistic regression model, we can consider a few metrics.
 AIC (Akaike Information Criteria) The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
 Null Deviance and Residual Deviance Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.
 Confusion Matrix It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting.
 We can calcualate the accuracy of our model by
 From confusion matrix, Specificity and Sensitivity can be derived as
 ROC Curve Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1 specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.
Setting up
As usual we will use the tidyverse
and caret
package
library(tidyverse)
library(caret) # for confusion matrix
library(ROCR) # For the ROC curve
We can now get straight to business and see how to model logisitc regression with R and then have the more interesting discussion on its performance.
Example 1
We use a dataset about factors influencing graduate admission that can be downloaded from the UCLA Institute for Digital Research and Education
The dataset has 4 variables
admit
is the response variablegre
is the result of a standardized testgpa
is the result of the student GPA (school reported)rank
is the type of university the student apply for (4 = Ivy League, 1 = lower level entry U.)
Let’s have a quick look at the data and their summary. The goal is to get familiar with the data, type of predictors (continuous, discrete, categorical, etc.)
df < read_csv("dataset/grad_admission.csv")
glimpse(df)
## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2,...
#Quick check to see if our response variable is balancedish
table(df$admit)
##
## 0 1
## 273 127
## Twoway contingency table of categorical outcome and predictors
round(prop.table(table(df$admit, df$rank), 2), 2)
##
## 1 2 3 4
## 0 0.46 0.64 0.77 0.82
## 1 0.54 0.36 0.23 0.18
It seems about right … most students applying to Ivy Leagues are not being admitted.
Modeling
Before we can run our model, let’s transform the rank
explanatory variable to a factor.
df$rank < factor(df$rank)
# Run the model
model_admission_lr < glm(admit ~ ., data = df, family = "binomial")
summary(model_admission_lr)
##
## Call:
## glm(formula = admit ~ ., family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 1.6268 0.8662 0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 3.989979 1.139951 3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 0.675443 0.316490 2.134 0.032829 *
## rank3 1.340204 0.345306 3.881 0.000104 ***
## rank4 1.551464 0.417832 3.713 0.000205 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
The next part of the output shows the coefficients, their standard errors, the zstatistic (sometimes called a Wald zstatistic), and the associated pvalues. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
For every one unit change in gre
, the log odds of admission (versus nonadmission) increases by 0.002.
For a one unit increase in gpa
, the log odds of being admitted to graduate school increases by 0.804.
The indicator variables for rank
have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by 0.675.
To see how the variables in the model participates in the decrease of Residual Deviance, we can use the ANOVA
function on our model.
anova(model_admission_lr)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: admit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 399 499.98
## gre 1 13.9204 398 486.06
## gpa 1 5.7122 397 480.34
## rank 3 21.8265 394 458.52
We can test for an overall effect of rank
using the wald.test function
of the aod
library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b
supplies the coefficients, while Sigma
supplies the variance covariance matrix of the error terms, finally Terms
tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank
.
library(aod)
wald.test(Sigma = vcov(model_admission_lr), b = coef(model_admission_lr), Terms = 4:6)
## Wald test:
## 
##
## Chisquared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
The chisquared test statistic of 20.9, with three degrees of freedom is associated with a pvalue of 0.00011 indicating that the overall effect of rank is statistically significant.
Let’s check how our model is performing. As mentioned earlier, we need to make a choice on the cutoff value (returned probability) to check our accuracy. In this first example, let’s just stick with the usual 0.5
cutoff value.
prediction_admission_lr < predict(model_admission_lr, data = df, type = "response")
prediction_admission_lr < if_else(prediction_admission_lr > 0.5 , 1, 0)
confusionMatrix(data = prediction_admission_lr,
reference = df$admit,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 254 97
## 1 19 30
##
## Accuracy : 0.71
## 95% CI : (0.6628, 0.754)
## No Information Rate : 0.6825
## PValue [Acc > NIR] : 0.1293
##
## Kappa : 0.1994
## Mcnemar's Test PValue : 8.724e13
##
## Sensitivity : 0.2362
## Specificity : 0.9304
## Pos Pred Value : 0.6122
## Neg Pred Value : 0.7236
## Prevalence : 0.3175
## Detection Rate : 0.0750
## Detection Prevalence : 0.1225
## Balanced Accuracy : 0.5833
##
## 'Positive' Class : 1
##
We have an interesting situation here. Although all our variables were significant in our model, the accuracy of our model, 71%
is just a little bit higher than the basic benchmark which is the noinformation model (ie. we just predict the highest class) in this case 68.25%
.
ROC and cutoff point
Before we do a ROC curve, let’s have a quick reminder on ROC. ROC are plotting the proprotion of TP to FP. So ideally we want to have 100% TP and 0% FP.
Pure Random guessing should lead to this curve
With that in mind, let’s do a ROC curve on out model
prediction_admission_lr < predict(model_admission_lr, data = df, type="response")
pr_admission < prediction(prediction_admission_lr, df$admit)
prf_admission < performance(pr_admission, measure = "tpr", x.measure = "fpr")
plot(prf_admission, colorize = TRUE, lwd=3)
At least it is better than just random guessing.
In some applications of ROC curves, you want the point closest to the TPR of 1 and FPR of 0. This cut point is “optimal” in the sense it weighs both sensitivity and specificity equally. Now, there is a cost measure in the ROCR package that you can use to create a performance object. Use it to find the cutoff with minimum cost.
cost_admission_perf = performance(pr_admission, "cost")
pr_admission@cutoffs[[1]][which.min(cost_admission_perf@y.values[[1]])]
## 392
## 0.487194
Using that cutoff value we should get our sensitivity and specificity a bit more in balance. Let’s try
prediction_admission_lr < predict(model_admission_lr, data = df, type = "response")
prediction_admission_lr < if_else(prediction_admission_lr > 0.4721949 , 1, 0)
confusionMatrix(data = prediction_admission_lr,
reference = df$admit,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 247 89
## 1 26 38
##
## Accuracy : 0.7125
## 95% CI : (0.6654, 0.7564)
## No Information Rate : 0.6825
## PValue [Acc > NIR] : 0.1077
##
## Kappa : 0.2352
## Mcnemar's Test PValue : 7.402e09
##
## Sensitivity : 0.2992
## Specificity : 0.9048
## Pos Pred Value : 0.5938
## Neg Pred Value : 0.7351
## Prevalence : 0.3175
## Detection Rate : 0.0950
## Detection Prevalence : 0.1600
## Balanced Accuracy : 0.6020
##
## 'Positive' Class : 1
##
And bonus, we even gained some accuracy!
I have seen a very cool graph on this website that plots this tradeoff between specificity and sensitivity and shows how this cutoff point can enhance the understanding of the predictive power of our model.
# Create tibble with both prediction and actual value
cutoff = 0.5
cutoff_plot < tibble(predicted = predict(model_admission_lr, data = df, type = "response"),
actual = as.factor(df$admit)) %>%
mutate(type = if_else(predicted >= cutoff & actual == 1, "TP",
if_else(predicted >= cutoff & actual == 0, "FP",
if_else(predicted < cutoff & actual == 0, "TN", "FN"))))
cutoff_plot$type < as.factor(cutoff_plot$type)
ggplot(cutoff_plot, aes(x = actual, y = predicted, color = type)) +
geom_violin(fill = "white", color = NA) +
geom_jitter(shape = 1) +
geom_hline(yintercept = cutoff, color = "blue", alpha = 0.5) +
scale_y_continuous(limits = c(0, 1)) +
ggtitle(paste0("Confusion Matrix with cutoff at ", cutoff))
AUC
Last thing … the AUC, aka Area Under the Curve. The AUC is basically the area under the ROC curve. You can think of the AUC as sort of a holistic number that represents how well your TP and FP is looking in aggregate.
AUC=0 > BAD AUC=1 > GOOD
So in the context of an ROC curve, the more “up and left” it looks, the larger the AUC will be and thus, the better your classifier is. Comparing AUC values is also really useful when comparing different models, as we can select the model with the high AUC value, rather than just look at the curves.
In our situation with our model model_admission_lr
, we can find our AUC with the ROCR
package.
prediction_admission_lr < predict(model_admission_lr, data = df, type="response")
pr_admission < prediction(prediction_admission_lr, df$admit)
auc_admission < performance(pr_admission, measure = "auc")
# and to get the exact value
auc_admission@y.values[[1]]
## [1] 0.6928413
References

The Introduction is from the AV website

The UCLA Institute for Digital Research and Education site where we got the dataset for our first example.

The UCI Machine learning site where we got the dataset for our second example.