Xgboost

xgboost
tidymodel
Author

Francois de Ryckel

Published

April 15, 2024

Modified

April 16, 2024

Using Xgboost from a quant perspective. We do a whole cycle of model building on a financial time-series. We’ll again show how to do it with both framework Sklearn for Python and tidymodel for R.

We have taken a stock, but this can be applied on an index, or commodity futures, etc.

Setting up the data frame

We are just loading the data set and doing the initial cleaning so the features engineering can be achieved smoothly.

import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np

df = pd.read_csv('../../../raw_data/AA.csv')
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date', inplace = False)
df.set_index('date', inplace=True)

df = df[['open', 'high', 'low', 'close', 'volume', 'adjClose']]

df.head()
             open   high    low  close   volume  adjClose
date                                                     
2001-01-02  80.50  80.95  76.60  77.50  1592010     57.23
2001-01-03  77.50  80.50  75.24  78.55  2011985     58.01
2001-01-04  78.55  81.25  77.65  81.10  1992468     59.89
2001-01-05  81.10  81.70  78.85  79.60  1623845     58.78
2001-01-08  79.60  85.91  79.00  80.80  3073616     59.67
df.describe()
              open         high  ...        volume     adjClose
count  5821.000000  5821.000000  ...  5.821000e+03  5821.000000
mean     46.372343    47.097133  ...  6.519558e+06    40.404558
std      24.755757    25.075361  ...  5.452542e+06    18.945874
min       5.500000     5.950000  ...  4.254680e+05     5.360000
25%      24.990000    25.420000  ...  2.656970e+06    23.620000
50%      38.260000    38.780000  ...  5.129900e+06    36.270000
75%      69.210000    69.980000  ...  8.773242e+06    56.910000
max     115.010000   117.190000  ...  1.007518e+08    96.360000

[8 rows x 6 columns]
df.isnull().sum()
open        0
high        0
low         0
close       0
volume      0
adjClose    0
dtype: int64
library(readr)
library(dplyr)
library(skimr)

dfr = read_csv('../../../raw_data/AA.csv') |> 
  select(date, open, high, low, close, volume, adjClose)

skim(dfr)
Data summary
Name dfr
Number of rows 5821
Number of columns 7
_______________________
Column type frequency:
Date 1
numeric 6
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2001-01-02 2024-02-22 2012-07-27 5821

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
open 0 1 46.37 24.76 5.50 24.99 38.26 69.21 115.01 ▇▇▃▅▁
high 0 1 47.10 25.08 5.95 25.42 38.78 69.98 117.19 ▇▇▃▅▁
low 0 1 45.57 24.38 5.16 24.53 37.72 68.03 111.88 ▇▇▃▅▁
close 0 1 46.33 24.74 5.48 25.01 38.26 69.04 113.78 ▇▇▃▅▁
volume 0 1 6519558.18 5452542.48 425468.00 2656970.00 5129900.00 8773242.00 100751769.00 ▇▁▁▁▁
adjClose 0 1 40.40 18.95 5.36 23.62 36.27 56.91 96.36 ▆▇▅▅▁

Feature engineering

df['returns'] = np.log(df['adjClose'] / df['adjClose'].shift(1))
df['ret_1m'] = df['returns'].rolling(20).sum()

feature_list = []

for r in range(11, 81, 5): 
  df['ret_' + str(r)] = df['returns'].rolling(r).sum()
  df['std_' + str(r)] = df['returns'].rolling(r).std()
  feature_list.append('ret_' + str(r))
  feature_list.append('std_' + str(r))

df1a = df

df1a['o_c'] = (df1a['open'] - df1a['close']) / df1a['close']
df1a['h_l'] = (df1a['high'] - df1a['low']) / df1a['close']
df1a['ret_21d'] = np.log(df1a['close'] / df1a['close'].shift(21))
df1a['roll_sd_ret21d_1Y'] = df1a['ret_21d'].rolling(window = 251).std()
df1a['volum_sma200'] = df1a['volume'].rolling(window = 200).mean()
df1a['perc_above_volu_sma200'] = np.log(df1a['volume'] / df1a['volum_sma200'])
df1a['roll_sd_volum_1Y'] = df1a['volume'].rolling(window = 251).std()
df1a['sma50'] = df1a['close'].rolling(window = 50).mean()
df1a['perc_above_sma50'] = np.log(df1a['close'] / df1a['sma50'])
df1a['sma200'] = df1a['close'].rolling(window = 200).mean()
df1a['perc_above_sma200'] = np.log(df1a['close'] / df1a['sma200'])
df1a['roll_corr_sma50_sma200'] = df1a['sma200'].rolling(window = 252).corr(df1a['sma50'])

# setting up a target variable. 
# is the stock above 5% in 2 weeks time. 
df1a['target'] = np.where(df1a['close'].shift(-41) > 1.01 * df1a['close'], 1, 0)

df1a = df1a.drop(['open', 'high', 'low', 'close', 'adjClose', 'volume', 'sma50', 'sma200', 'volum_sma200', 'returns'], axis = 1)
df1a = df1a.dropna()

target = df1a['target']
df1a = df1a.drop(['target'], axis = 1)


df.dropna(inplace = True)  

df1a.values
array([[ 0.15564846,  0.21637398,  0.04071166, ...,  0.05192485,
        -0.27383066,  0.86749277],
       [ 0.1761608 ,  0.22536981,  0.03979403, ...,  0.03793276,
        -0.28789777,  0.87185623],
       [ 0.123086  ,  0.19948936,  0.0422375 , ...,  0.00811607,
        -0.31777658,  0.87606062],
       ...,
       [-0.03425118, -0.09976226,  0.04401878, ..., -0.09835741,
        -0.12808572,  0.89240416],
       [-0.05395427,  0.04088161,  0.03663959, ..., -0.05392038,
        -0.0805411 ,  0.88992296],
       [-0.06992937,  0.00469569,  0.03579207, ..., -0.06330803,
        -0.08669331,  0.8874239 ]])

Base Model

from sklearn.model_selection import (train_test_split, RandomizedSearchCV, TimeSeriesSplit)

x_train, x_test, y_train, y_test = train_test_split(df1a, target, test_size = 0.2, random_state = 42, shuffle = False)

print(f"Train set size is {len(x_train)} and test set size is {len(x_test)}")
Train set size is 4296 and test set size is 1075

Let’s now fit a basic model without any tuning

from xgboost import XGBClassifier

model_xgb = XGBClassifier(verbosity = 1, random_state = 42)
model_xgb.fit(x_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# and now go onto prediction 
y_pred = model_xgb.predict(x_test)

# or we can also use probability prediction
y_pred_proba = model_xgb.predict_proba(x_test)

And we can check our result on this basic xgboost model

from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, RocCurveDisplay

acc_train = accuracy_score(y_train, model_xgb.predict(x_train))
acc_test = accuracy_score(y_test, model_xgb.predict(x_test))


disp = ConfusionMatrixDisplay.from_estimator(
        model_xgb,
        x_test,
        y_test,
        display_labels = model_xgb.classes_,
        cmap=plt.cm.Blues
    )
disp.ax_.set_title('Confusion matrix')
plt.show()

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.49      0.85      0.62       537
           1       0.46      0.13      0.21       538

    accuracy                           0.49      1075
   macro avg       0.48      0.49      0.42      1075
weighted avg       0.48      0.49      0.42      1075

And the ROC curve

#plt.clf()
disp_roc = RocCurveDisplay.from_estimator(
            model_xgb,
            x_test,
            y_test,
            name='XGBoost')
disp_roc.ax_.set_title('ROC Curve')
plt.plot([0,1], [0,1], linestyle='--')
plt.show()

Hyperparameters and fine tuning

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import RandomizedSearchCV

tscv = TimeSeriesSplit(n_splits = 5, gap = 23)
model_xgb.get_params()
{'objective': 'binary:logistic', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': None, 'n_jobs': None, 'num_parallel_tree': None, 'random_state': 42, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': 1}
param_grid = {'learning_rate': [0.05, 0.10, 0.15, 0.20, 0.25, 0.30],
              'max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
              'min_child_weight': [1, 3, 5, 7],
              'gamma': [0.0, 0.1, 0.2 , 0.3, 0.4],
              'colsample_bytree': [0.3, 0.4, 0.5 , 0.7]}
            
xv_xgb = RandomizedSearchCV(model_xgb, param_grid, n_iter = 100, scoring = 'f1', cv = tscv, verbose = 1)

xv_xgb.fit(x_train, y_train, verbose = 1)
RandomizedSearchCV(cv=TimeSeriesSplit(gap=23, max_train_size=None, n_splits=5, test_size=None),
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, device=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric=None, feature_types=None,
                                           gamma=None, grow_policy=...
                                           monotone_constraints=None,
                                           multi_strategy=None,
                                           n_estimators=None, n_jobs=None,
                                           num_parallel_tree=None,
                                           random_state=42, ...),
                   n_iter=100,
                   param_distributions={'colsample_bytree': [0.3, 0.4, 0.5,
                                                             0.7],
                                        'gamma': [0.0, 0.1, 0.2, 0.3, 0.4],
                                        'learning_rate': [0.05, 0.1, 0.15, 0.2,
                                                          0.25, 0.3],
                                        'max_depth': [3, 4, 5, 6, 8, 10, 12,
                                                      15],
                                        'min_child_weight': [1, 3, 5, 7]},
                   scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
xv_xgb.best_params_
{'min_child_weight': 7, 'max_depth': 12, 'learning_rate': 0.1, 'gamma': 0.1, 'colsample_bytree': 0.7}
xv_xgb.best_score_
0.5009456435677468

Now we need to train the model based on the best paramaters fromt the cross-validation process.

from sklearn.model_selection import cross_val_score

model_xgb_tuned = XGBClassifier(**xv_xgb.best_params_)

model_xgb_tuned.fit(x_train, y_train, 
                    eval_set = [(x_train, y_train), (x_test, y_test)],         
                    #eval_metric = 'precision', 
                    verbose = True)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.7, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0.1, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=12, max_leaves=None,
              min_child_weight=7, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
eval_results = model_xgb_tuned.evals_result()
#eval_results

score = cross_val_score(model_xgb_tuned, x_train, y_train, cv = tscv)
print(f'Mean CV score for: {score.mean():0.4}')
Mean CV score for: 0.4961

Feature importance