Chapter 7 KNN - K Nearest Neighbour
Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining.
The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications.
The KNN classifier is also a non parametric and instance-based learning algorithm.
Non-parametric means it makes no explicit assumptions about the functional form of h, avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.
Instance-based learning means that our algorithm doesn’t explicitly learn a model (lazy learner). Instead, it chooses to memorize the training instances which are subsequently used as “knowledge” for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.
It is worth noting that the minimal training phase of KNN comes both at a memory cost, since we must store a potentially huge data set, as well as a computational cost during test time since classifying a given observation requires a run down of the whole data set. Practically speaking, this is undesirable since we usually want fast responses.
The principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples that are closest in the distance to a new point & predict a label for our new point using these samples.
When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind” to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged.
On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.
What we are observing here is that increasing k will decrease variance and increase bias. While decreasing k will increase variance and decrease bias. Take a look at how variable the predictions are for different data sets at low k. As k increases this variability is reduced. But if we increase k too much, then we no longer follow the true boundary line and we observe high bias. This is the nature of the Bias-Variance Tradeoff.
Clustering can be broadly divided into two subgroups:
- Hard clustering: in hard clustering, each data object or point either belongs to a cluster completely or not. For example in the Uber dataset, each location belongs to either one borough or the other.
- Soft clustering: in soft clustering, a data point can belong to more than one cluster with some probability or likelihood value. For example, you could identify some locations as the border points belonging to two or more boroughs.
7.1 Example 1. Prostate Cancer dataset
df <- read_csv("dataset/prostate_cancer.csv")
glimpse(df)
## Observations: 100
## Variables: 10
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ diagnosis_result <chr> "M", "B", "M", "M", "M", "B", "M", "M", "M", "…
## $ radius <dbl> 23, 9, 21, 14, 9, 25, 16, 15, 19, 25, 24, 17, …
## $ texture <dbl> 12, 13, 27, 16, 19, 25, 26, 18, 24, 11, 21, 15…
## $ perimeter <dbl> 151, 133, 130, 78, 135, 83, 120, 90, 88, 84, 1…
## $ area <dbl> 954, 1326, 1203, 386, 1297, 477, 1040, 578, 52…
## $ smoothness <dbl> 0.143, 0.143, 0.125, 0.070, 0.141, 0.128, 0.09…
## $ compactness <dbl> 0.278, 0.079, 0.160, 0.284, 0.133, 0.170, 0.10…
## $ symmetry <dbl> 0.242, 0.181, 0.207, 0.260, 0.181, 0.209, 0.17…
## $ fractal_dimension <dbl> 0.079, 0.057, 0.060, 0.097, 0.059, 0.076, 0.05…
Change the diagnosis result into a factor, then remove the ID
variable as it does not bring anything.
df$diagnosis_result <- factor(df$diagnosis_result, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
df2 <- df %>% select(-id)
# Checking how balance is the dependend variable
prop.table(table(df2$diagnosis_result))
##
## Benign Malignant
## 0.38 0.62
It is quite typical of such medical dataset to be unbalanced. We’ll have to deal with it.
Like with PCA, KNN is quite sensitve to the scale of the variable. So it is important to first standardize the variables. This time we’ll do this using the preProcess
funnction of the caret
package.
library(caret)
param_preproc_df2 <- preProcess(df2[,2:9], method = c("scale", "center"))
df3_stdize <- predict(param_preproc_df2, df2[, 2:9])
summary(df3_stdize)
## radius texture perimeter area
## Min. :-1.60891 Min. :-1.3923 Min. :-1.8914 Min. :-1.5667
## 1st Qu.:-0.99404 1st Qu.:-0.8146 1st Qu.:-0.6031 1st Qu.:-0.7073
## Median : 0.03074 Median :-0.1406 Median :-0.1174 Median :-0.1842
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.85057 3rd Qu.: 0.7741 3rd Qu.: 0.7379 3rd Qu.: 0.6697
## Max. : 1.67039 Max. : 1.6888 Max. : 3.1770 Max. : 3.6756
## smoothness compactness symmetry fractal_dimension
## Min. :-2.23539 Min. :-1.4507 Min. :-1.8896 Min. :-1.4342
## 1st Qu.:-0.63039 1st Qu.:-0.7556 1st Qu.:-0.6877 1st Qu.:-0.6981
## Median :-0.04986 Median :-0.1341 Median :-0.1030 Median :-0.2073
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.63312 3rd Qu.: 0.4956 3rd Qu.: 0.5142 3rd Qu.: 0.5288
## Max. : 2.75035 Max. : 3.5703 Max. : 3.6001 Max. : 3.9639
We can now see that all means are centered around 0. Now we reconstruct our df with the response variable and we split the df into a training and testing set.
df3_stdize <- bind_cols(diagnosis = df2$diagnosis_result, df3_stdize)
param_split<- createDataPartition(df3_stdize$diagnosis, times = 1, p = 0.8,
list = FALSE)
train_df3 <- df3_stdize[param_split, ]
test_df3 <- df3_stdize[-param_split, ]
#We can check that we still have the same kind of split
prop.table(table(train_df3$diagnosis))
##
## Benign Malignant
## 0.382716 0.617284
Nice to see that the proportion of Malign vs Benin has been conserved.
We use KNN with cross-validation (discussed in more details in this section 14.3 to train our model.
trnctrl_df3 <- trainControl(method = "cv", number = 10)
model_knn_df3 <- train(diagnosis ~., data = train_df3, method = "knn",
trControl = trnctrl_df3,
tuneLength = 10)
model_knn_df3
## k-Nearest Neighbors
##
## 81 samples
## 8 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 73, 73, 73, 73, 73, 73, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8319444 0.6205678
## 7 0.8555556 0.6662155
## 9 0.8555556 0.6662155
## 11 0.8555556 0.6700251
## 13 0.8555556 0.6662155
## 15 0.8555556 0.6624060
## 17 0.8555556 0.6761056
## 19 0.8305556 0.6260615
## 21 0.8305556 0.6195489
## 23 0.8430556 0.6580104
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.
plot(model_knn_df3)
predict_knn_df3 <- predict(model_knn_df3, test_df3)
confusionMatrix(predict_knn_df3, test_df3$diagnosis, positive = "Malignant")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 4 0
## Malignant 3 12
##
## Accuracy : 0.8421
## 95% CI : (0.6042, 0.9662)
## No Information Rate : 0.6316
## P-Value [Acc > NIR] : 0.04241
##
## Kappa : 0.6275
## Mcnemar's Test P-Value : 0.24821
##
## Sensitivity : 1.0000
## Specificity : 0.5714
## Pos Pred Value : 0.8000
## Neg Pred Value : 1.0000
## Prevalence : 0.6316
## Detection Rate : 0.6316
## Detection Prevalence : 0.7895
## Balanced Accuracy : 0.7857
##
## 'Positive' Class : Malignant
##
7.2 Example 2. Wine dataset
We load the dataset and do some quick cleaning
df <- read_csv("dataset/Wine_UCI.csv", col_names = FALSE)
colnames(df) <- c("Origin", "Alcohol", "Malic_acid", "Ash", "Alkalinity_of_ash",
"Magnesium", "Total_phenols", "Flavanoids", "Nonflavonoids_phenols",
"Proanthocyanins", "Color_intensity", "Hue", "OD280_OD315_diluted_wines",
"Proline")
glimpse(df)
## Observations: 178
## Variables: 14
## $ Origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Alcohol <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.…
## $ Malic_acid <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.…
## $ Ash <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.…
## $ Alkalinity_of_ash <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14…
## $ Magnesium <dbl> 127, 100, 101, 113, 118, 112, 96, 121,…
## $ Total_phenols <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.…
## $ Flavanoids <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.…
## $ Nonflavonoids_phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.…
## $ Proanthocyanins <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.…
## $ Color_intensity <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.…
## $ Hue <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.…
## $ OD280_OD315_diluted_wines <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.…
## $ Proline <dbl> 1065, 1050, 1185, 1480, 735, 1450, 129…
The origin is our dependent variable. Let’s make it a factor.
df$Origin <- as.factor(df$Origin)
#Let's check our explained variable distribution of origin
round(prop.table(table(df$Origin)), 2)
##
## 1 2 3
## 0.33 0.40 0.27
That’s nice, our explained variable is almost equally distributed with the 3 set of origin.
# Let's also check if we have any NA values
summary(df)
## Origin Alcohol Malic_acid Ash Alkalinity_of_ash
## 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_phenols Flavanoids Nonflavonoids_phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_intensity Hue
## Min. :0.410 Min. : 1.280 Min. :0.4800
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825
## Median :1.555 Median : 4.690 Median :0.9650
## Mean :1.591 Mean : 5.058 Mean :0.9574
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200
## Max. :3.580 Max. :13.000 Max. :1.7100
## OD280_OD315_diluted_wines Proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.938 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.612 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
Here we noticed that the range of values in our variable is quite wide. It means our data will need to be standardize. We also note that we no “NA” values. That’s quite a nice surprise!
7.2.1 Understand the data
We first slide our data in a training and testing set.
df2 <- df
param_split_df2 <- createDataPartition(df2$Origin, p = 0.75, list = FALSE)
train_df2 <- df2[param_split_df2, ]
test_df2 <- df2[-param_split_df2, ]
The great with caret is we can standardize our data in the the training phase.
7.2.1.1 Model the data
Let’s keep using caret
for our training.
trnctrl_df2 <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model_knn_df2 <- train(Origin ~., data = train_df2, method = "knn",
trControl = trnctrl_df2,
preProcess = c("center", "scale"),
tuneLength = 10)
model_knn_df2
## k-Nearest Neighbors
##
## 135 samples
## 13 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 123, 123, 121, 121, 121, 122, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9548291 0.9319335
## 7 0.9724664 0.9584705
## 9 0.9748779 0.9618555
## 11 0.9800061 0.9697717
## 13 0.9778083 0.9663714
## 15 0.9801893 0.9699160
## 17 0.9801893 0.9699160
## 19 0.9851343 0.9775459
## 21 0.9825702 0.9736065
## 23 0.9825702 0.9736065
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.
plot(model_knn_df2)
Let’s use our model to make our prediction
prediction_knn_df2 <- predict(model_knn_df2, newdata = test_df2)
confusionMatrix(prediction_knn_df2, reference = test_df2$Origin)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 14 1 0
## 2 0 15 0
## 3 0 1 12
##
## Overall Statistics
##
## Accuracy : 0.9535
## 95% CI : (0.8419, 0.9943)
## No Information Rate : 0.3953
## P-Value [Acc > NIR] : 1.02e-14
##
## Kappa : 0.93
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 0.8824 1.0000
## Specificity 0.9655 1.0000 0.9677
## Pos Pred Value 0.9333 1.0000 0.9231
## Neg Pred Value 1.0000 0.9286 1.0000
## Prevalence 0.3256 0.3953 0.2791
## Detection Rate 0.3256 0.3488 0.2791
## Detection Prevalence 0.3488 0.3488 0.3023
## Balanced Accuracy 0.9828 0.9412 0.9839