= [4, 5, 10, 4, 3, 11, 14 , 8, 10, 12]
x = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
y = [0, 0, 1, 0, 0, 1, 1, 0, 1, 1]
classes
import matplotlib.pyplot as plt
= classes) plt.scatter(x, y, c
Some very basic ML using KNN and python and the tidymodel framework.
Introduction
KNN stands for K Nearest Neighbor.
KNN is not really a machine learning techniques in the sense that it trains a model. In the case of KNN, there is no training. We are waiting for test data to see what label (or value) will the the new data will get. We then say that KNN is a lazy learner (as opposed to eager learners like SVM or RF). Nonetheless, it is a supervised ML algorithm that can be used for both classification and regression. The intuition behind the model is that observations that are closed to each other (close in terms of distance in a hyperplane) have similar labels (classification) or values (regression).
As mentioned, there is no training phase when using KNN. Instead, there is only prediction.
We take an observation and check the K observations next to it. We check the label of the K observations next to our data to be labeled and using a majority voting system we assign the label. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point.
Looking at the above image, we can see that, using k=3, the 3 observations closest to the star (our data to be classified) are all brown circle. Hence we should classify the star as a brown circle instead of an orange rectangle.
Because KNN use distance, it is important to scale the data as a pre-processing steps. Otherwise, features with big scale (let’s say price) will skew the distance against features with lower scale (let’s say percentage).
Using probability terminology, one can say that KNN method is a direct attempt at approximating the conditional expectation using actual data.
In the case of regression, the estimation function could be written as \[\hat{f}(x) = \text{Average } [y_i|x_i \in \mathcal{N}_k(x)]\] where \(\mathcal{N}_k (x)\) is the neighborhood of x containing the k-closest observations.
In the case of classification, we use a majority voting system.
Pros-Cons of KNN
Pros
- Easy to understand intuition, mathematics (Euclidean Distance)
- KNN is non-parametric. It’s not making any assumptions on the the type of distribution of the data
- only one parameter to tune
Cons
- non-efficient in terms of memory
- non-efficient on speed of execution with new data
- not suitable for high dimensional data
- not suitable for big data sets
Starting example with simple data
Now let’s create a KNN object and a new point
from sklearn.neighbors import KNeighborsClassifier
= KNeighborsClassifier(n_neighbors = 3)
knn list(zip(x, y)), classes) knn.fit(
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
= 3
new_x = 20
new_y = [(new_x, new_y)]
new_point = knn.predict(new_point)
prediction
+ [new_x], y + [new_y], c = classes + [prediction[0]])
plt.scatter(x = new_x-1, y = new_y-1, s = f"new point, class:{prediction[0]}") plt.text(x
library(dplyr)
library(ggplot2)
<- tibble(x = c(4, 5, 10, 4, 3, 11, 14 , 8, 10, 12),
df y = c(21, 19, 24, 17, 16, 25, 24, 22, 21, 21),
classes = as.factor(c(0, 0, 1, 0, 0, 1, 1, 0, 1, 1)))
ggplot(df, aes(x = x, y = y, color = classes)) +
geom_point() +
theme(legend.position = 'none')
Using the tidymodel framework, we are first creating a recipe, then applying the transformation to the original data. We then apply these transformation to the new points as well. Using ‘parnsip’, we create the KNN object.
library(recipes)
library(parsnip)
<- recipe(classes ~ ., data = df) |>
df_rec step_normalize(-classes) |>
prep()
<- juice(df_rec)
df_juiced
= tibble(x = c(3), y = c(20))
new_point
<- bake(df_rec, new_data = new_point)
df_baked
<- nearest_neighbor(neighbors = 3) |>
knn_model set_engine('kknn') |>
set_mode('classification')
<- knn_model |> fit(classes ~., data = df_juiced)
knn_fit knn_fit
parsnip model object
Call:
kknn::train.kknn(formula = classes ~ ., data = data, ks = min_rows(3, data, 5))
Type of response variable: nominal
Minimal misclassification: 0.1
Best kernel: optimal
Best k: 3
# evaluate the model now on new data
<- predict(object = knn_fit, new_data = df_baked)
prediction <- bind_cols(new_point, prediction) |>
new_point_pred rename(classes = .pred_class)
ggplot(data = bind_rows(df, new_point_pred),
aes(x = x, y = y, color = classes)) +
geom_point() +
geom_text(data = new_point_pred,
mapping = aes(label = paste0('new point, class:', new_point_pred$classes)),
nudge_x = 0.9, nudge_y = -0.2)
Example with synthetic data
In this example, we create a data set of a 1000 observations (using numbers taken from a normal distribution with a sd of 2.). We’ll make 4 clusters of 250 observations. Because the data comes from a normal distribution, there is no need to scale them this time. We’ll then do a usual split 80-20 % for training and testing set. And we’ll test our data using either K = 5 or K = 19. And then check the accuracy score.
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# create our synthetic data
= make_blobs(n_samples = 1000, n_features = 2,
X, y = 4, cluster_std = 2,
centers = 4)
random_state
# visualizing the dataset
0], X[:,1], c = y, s = 20) plt.scatter(X[:,
Splitting our dataset into training & testing + running KNN on the data
# spliting our data into training and testing
= train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 0)
X_train, X_test, y_train, y_test
= KNeighborsClassifier(n_neighbors = 5)
knn5 = KNeighborsClassifier(n_neighbors = 19)
knn19
# fit our 'model' with either '5' or '19' Nearest Neighbors
knn5.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
knn19.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=19)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=19)
# apply prediction on our test set
= knn5.predict(X_test)
y_pred_5 = knn19.predict(X_test)
y_pred_19
from sklearn.metrics import accuracy_score
print('Accuracy with K = 5 is', round(accuracy_score(y_test, y_pred_5)*100, 2), '%')
Accuracy with K = 5 is 89.0 %
print('Accuracy with k = 19 is', round(accuracy_score(y_test, y_pred_19)*100, 2), '%')
Accuracy with k = 19 is 89.0 %
Let’s visualize both ‘models’ and the impact of the choice of K.
#using subplots to compare
= (9, 5))
plt.figure(figsize
# first subplot
1, 2, 1)
plt.subplot(0], X_test[:, 1], c = y_pred_5, s=20)
plt.scatter(X_test[:, 'Predictions with K=5')
plt.title(
# second subplot
1, 2, 2)
plt.subplot(0], X_test[:, 1], c = y_pred_19, s=20)
plt.scatter(X_test[:, 'Prediction with K=19') plt.title(
# create our synthetic data
<- tibble(x = rnorm(n = 250, mean = 0, sd = 2) + 4,
df1 y = rnorm(n = 250, mean = 0, sd = 2) - 6,
classes = c(1))
<- tibble(x = rnorm(n = 250, mean = 0, sd = 2) + 9,
df2 y = rnorm(n = 250, mean = 0, sd = 2) - 9,
classes = c(2))
<- tibble(x = rnorm(n = 250, mean = 0, sd = 2) + 9,
df3 y = rnorm(n = 250, mean = 0, sd = 2) + 2,
classes = c(3))
<- tibble(x = rnorm(n = 250, mean = 0, sd = 2) + 10,
df4 y = rnorm(n = 250, mean = 0, sd = 2) + 5,
classes = c(4))
<- bind_rows(df1, df2, df3, df4) |>
df mutate(classes = as.factor(classes))
# visualizing the dataset
ggplot(df, aes(x, y, color = classes)) +
geom_point() +
theme(legend.position = 'none')
This time, we split the data in a training / testing set.
Also because the data were already scaled during the generative process; there is no need to redo that step.
library(rsample)
library(yardstick)
<- initial_split(df, prop = 0.8, strata = classes)
df_split <- training(df_split)
df_train <- testing(df_split)
df_test
#let's first try the ideal model
<- nearest_neighbor(neighbors = 5) |>
knn_model_5 set_engine('kknn') |>
set_mode('classification')
<- nearest_neighbor(neighbors = 19) |>
knn_model_19 set_engine('kknn') |>
set_mode('classification')
<- knn_model_5 |> fit(classes ~., data = df_train)
knn_fit_5 <- knn_fit_5 |> predict(df_test) |>
knn5 bind_cols(df_test) |> mutate(model = 'knn5')
|> metrics(truth = classes, estimate = .pred_class) knn5
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.825
2 kap multiclass 0.767
<- knn_model_19 |> fit(classes ~., data = df_train)
knn_fit_19 <- knn_fit_19 |> predict(df_test) |>
knn19 bind_cols(df_test) |> mutate(model = 'knn19')
|> metrics(truth = classes, estimate = .pred_class) knn19
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.8
2 kap multiclass 0.733
Let’s visualize both ‘models’ and the impact of the choice of K.
<- bind_rows(knn5, knn19)
df_pred
ggplot(df_pred, aes(x = x, y = y, color = .pred_class)) +
facet_wrap(~ model, nrow = 1) +
geom_point() +
theme(legend.position = 'none')
Because the data are already pretty well separated, the only changes we see easily are the ones in the junction between 2 clusters of observations.
Example with a financial / time-series dataset
This time, we are going to use a stock price to perform KNN.
Loading the data, setting up and feature engineering
Loading and checking the data
import pandas as pd
import matplotlib.pyplot as plt
= pd.read_csv('../../../raw_data/AA.csv')
df 'date'] = pd.to_datetime(df['date'])
df[= df.sort_values(by = 'date', inplace = False)
df 'date', inplace=True)
df.set_index(
df.shape
(5821, 12)
df.head()
open high low ... vwap label changeOverTime
date ...
2001-01-02 80.50 80.95 76.60 ... 78.35 January 02, 01 -0.0373
2001-01-03 77.50 80.50 75.24 ... 78.10 January 03, 01 0.0135
2001-01-04 78.55 81.25 77.65 ... 80.00 January 04, 01 0.0325
2001-01-05 81.10 81.70 78.85 ... 80.05 January 05, 01 -0.0185
2001-01-08 79.60 85.91 79.00 ... 81.90 January 08, 01 0.0151
[5 rows x 12 columns]
'adjClose'])
plt.plot(df[ plt.show()
#only keep useful columns
= df.drop(['unadjustedVolume', 'change', 'changePercent', 'vwap', 'label', 'changeOverTime'], axis = 1)
df1a #or easier actually ;-)
= df.iloc[:, :5]
df1b
df1a.describe()
open high ... adjClose volume
count 5821.000000 5821.000000 ... 5821.000000 5.821000e+03
mean 46.372343 47.097133 ... 40.404558 6.519558e+06
std 24.755757 25.075361 ... 18.945874 5.452542e+06
min 5.500000 5.950000 ... 5.360000 4.254680e+05
25% 24.990000 25.420000 ... 23.620000 2.656970e+06
50% 38.260000 38.780000 ... 36.270000 5.129900e+06
75% 69.210000 69.980000 ... 56.910000 8.773242e+06
max 115.010000 117.190000 ... 96.360000 1.007518e+08
[8 rows x 6 columns]
# check missing values
sum() df1a.isnull().
open 0
high 0
low 0
close 0
adjClose 0
volume 0
dtype: int64
No missing values and we can go ahead!
Setting up a few predictors.
import numpy as np
'o_c'] = (df1a['open'] - df1a['close']) / df1a['close']
df1a['h_l'] = (df1a['high'] - df1a['low']) / df1a['close']
df1a['ret_21d'] = np.log(df1a['close'] / df1a['close'].shift(21))
df1a['roll_sd_ret21d_1Y'] = df1a['ret_21d'].rolling(window = 251).std()
df1a['volum_sma200'] = df1a['volume'].rolling(window = 200).mean()
df1a['perc_above_volu_sma200'] = np.log(df1a['volume'] / df1a['volum_sma200'])
df1a['roll_sd_volum_1Y'] = df1a['volume'].rolling(window = 251).std()
df1a['sma50'] = df1a['close'].rolling(window = 50).mean()
df1a['perc_above_sma50'] = np.log(df1a['close'] / df1a['sma50'])
df1a['sma200'] = df1a['close'].rolling(window = 200).mean()
df1a['perc_above_sma200'] = np.log(df1a['close'] / df1a['sma200'])
df1a['roll_corr_sma50_sma200'] = df1a['sma200'].rolling(window = 252).corr(df1a['sma50'])
df1a[
'target'] = np.where(df1a['close'].shift(-11) > 1.05*df['close'], 1, -1)
df1a[
= df1a.drop(['open', 'high', 'low', 'close', 'adjClose', 'volume', 'sma50', 'sma200', 'volum_sma200'], axis = 1)
df1a = df1a.dropna()
df1a = df1a['target']
target = df1a.drop(['target'], axis = 1)
df1a
df1a.shape
(5371, 9)
Splitting the dataset for training and testing
from sklearn.model_selection import (train_test_split, GridSearchCV)
= train_test_split(df1a, target, test_size = 0.2, shuffle = False)
x_train, x_test, y_train, y_test
print(f"The size for the train and test dataset are {len(x_train)}, {len(x_test)} observations")
The size for the train and test dataset are 4296, 1075 observations
library(readr)
library(dplyr)
library(ggplot2)
<- read_csv('../../../raw_data/AA.csv') |>
df select(date, open, high, low, close, volume)
ggplot(df, aes(x = date, y = close)) +
geom_line()
# check for missing values
|> summarize_all(funs(sum(is.na(.)))) df
# A tibble: 1 × 6
date open high low close volume
<int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0
No missing values, we can go ahead!
Base Model
This is when we create the model and create the predictions.
Now we will start building our KNN model using a pipeline or a workflow (for R), First, we need to scale the data then we can go on the classification task.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
= Pipeline([
knn_model 'scaler', MinMaxScaler()),
('classifier', KNeighborsClassifier())
(
])
knn_model.fit(x_train, y_train)
Pipeline(steps=[('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())])
MinMaxScaler()
KNeighborsClassifier()
Go onto predictions
from sklearn.metrics import (accuracy_score, ConfusionMatrixDisplay, classification_report)
= knn_model.predict(x_test)
y_pred
# or we can also use a probability model
= knn_model.predict_proba(x_test)
y_pred_proba
# few checking
knn_model.classes_
array([-1, 1])
-10:, ] y_pred_proba[
array([[0.4, 0.6],
[1. , 0. ],
[0.8, 0.2],
[0.4, 0.6],
[0.8, 0.2],
[0.6, 0.4],
[0.8, 0.2],
[0.8, 0.2],
[1. , 0. ],
[1. , 0. ]])
# checking accuracy
= accuracy_score(y_train, knn_model.predict(x_train))
acc_train = accuracy_score(y_test, knn_model.predict(x_test))
acc_test
print(f"Accuracy for training set is {acc_train:0.3} and Accuracy for testing set is {acc_test:0.3}")
Accuracy for training set is 0.886 and Accuracy for testing set is 0.582
Checking quality of predictions
= ConfusionMatrixDisplay.from_estimator(
conf_mat =plt.cm.Blues
knn_model, x_test, y_test, cmap
)
'Confusion Matrix')
plt.title( plt.show()
print(classification_report(y_test, y_pred))
precision recall f1-score support
-1 0.64 0.81 0.71 687
1 0.35 0.18 0.24 388
accuracy 0.58 1075
macro avg 0.49 0.49 0.48 1075
weighted avg 0.53 0.58 0.54 1075
from sklearn.metrics import (roc_curve, RocCurveDisplay)
= RocCurveDisplay.from_estimator(knn_model, x_test, y_test, name = 'Tuned KNN')
rocCurve 'ROC Curve')
plt.title(0,1], [0,1], linestyle = '--', label = 'Random 50:50')
plt.plot([
plt.legend() plt.show()
Hyperparameter Tuning
We can always try to fine tune our KNN algorithm to see if we can get a better result on our test set.
from sklearn.model_selection import TimeSeriesSplit
= TimeSeriesSplit(n_splits=10, gap=10)
tscv #for train, test in tscv.split(df1a):
# print(train, test)
Let’s find the best parameters through a grid search
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, auc
knn_model.get_params()
{'memory': None, 'steps': [('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())], 'verbose': False, 'scaler': MinMaxScaler(), 'classifier': KNeighborsClassifier(), 'scaler__clip': False, 'scaler__copy': True, 'scaler__feature_range': (0, 1), 'classifier__algorithm': 'auto', 'classifier__leaf_size': 30, 'classifier__metric': 'minkowski', 'classifier__metric_params': None, 'classifier__n_jobs': None, 'classifier__n_neighbors': 5, 'classifier__p': 2, 'classifier__weights': 'uniform'}
= {'classifier__n_neighbors': np.arange(2, 51, 1)}
param_grid
= GridSearchCV(knn_model, param_grid, scoring = 'roc_auc', n_jobs = -1, cv = tscv, verbose = 1)
gs gs.fit(x_train, y_train)
GridSearchCV(cv=TimeSeriesSplit(gap=10, max_train_size=None, n_splits=10, test_size=None), estimator=Pipeline(steps=[('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())]), n_jobs=-1, param_grid={'classifier__n_neighbors': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])}, scoring='roc_auc', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=TimeSeriesSplit(gap=10, max_train_size=None, n_splits=10, test_size=None), estimator=Pipeline(steps=[('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())]), n_jobs=-1, param_grid={'classifier__n_neighbors': array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])}, scoring='roc_auc', verbose=1)
Pipeline(steps=[('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())])
MinMaxScaler()
KNeighborsClassifier()
print(f"Optimal Neighbours: {gs.best_params_['classifier__n_neighbors']}, Best, Score: {round(gs.best_score_,4)}")
Optimal Neighbours: 49, Best, Score: 0.4906
Let’s now use the best parameter found for our model.
= KNeighborsClassifier(n_neighbors = gs.best_params_['classifier__n_neighbors'])
tuned_knn_model
tuned_knn_model.fit(x_train, y_train)
KNeighborsClassifier(n_neighbors=49)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=49)
= tuned_knn_model.predict(x_test)
y_pred_tuned
= accuracy_score(y_train, tuned_knn_model.predict(x_train))
acc_train = accuracy_score(y_test, y_pred_tuned)
acc_test
print(f'\n Training Accuracy \t: {acc_train :0.4} \n Test Accuracy \t\t: {acc_test :0.4}')
Training Accuracy : 0.7691
Test Accuracy : 0.6186
= ConfusionMatrixDisplay.from_estimator(
disp
tuned_knn_model,
x_test,
y_test,=tuned_knn_model.classes_,
display_labels=plt.cm.Blues
cmap
)'Confusion matrix')
plt.title( plt.show()
print(classification_report(y_test, y_pred_tuned))
precision recall f1-score support
-1 0.63 0.95 0.76 687
1 0.27 0.03 0.06 388
accuracy 0.62 1075
macro avg 0.45 0.49 0.41 1075
weighted avg 0.50 0.62 0.51 1075
Example with the adult dataset (from Kaggle)
import numpy as np
import pandas as pd
= pd.read_csv('../../../raw_data/adult.csv')
data data.head()
age workclass fnlwgt ... hours-per-week native-country income
0 25 Private 226802 ... 40 United-States <=50K
1 38 Private 89814 ... 50 United-States <=50K
2 28 Local-gov 336951 ... 40 United-States >50K
3 44 Private 160323 ... 40 United-States >50K
4 18 ? 103497 ... 30 United-States <=50K
[5 rows x 15 columns]
Doing some cleaning of data. All the cleaning ideas come from the Kaggle notebooks (I have not deeply analysed the data)
# in a few columns, many cells have '?', let's replace those with Nan
'workclass'] = data['workclass'].replace('?', np.nan)
data['occupation'] = data['occupation'].replace('?', np.nan)
data['native-country'] = data['native-country'] .replace('?', np.nan)
data[
= data.copy()
df
# dropping row with NA
= 'any', inplace = True)
df.dropna(how
# dropping dupplicates rows
= df.drop_duplicates()
df
# drop columns