Logistic Regression

A dive into Logistic Regression.
logistic-regression
Author

Francois de Ryckel

Published

May 15, 2024

Modified

May 22, 2024

Logistic Regression predict categorical variables based on quantitative data. It does this using the logit function

\[logit = log(oods) = w_0+w_1x_1+w_2x_2+ \cdots + w_nx_n\]

The logit is the log of the odds. This can be mapped back into a probability (with the sigmoid function) and then back to a class.

The logistic sigmoid is defined as \[f(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x+1}\] The sigmoid allows to transform values from \(-\infty \lt x \lt \infty\) into a \(-1 < f(x) < 1\) interval.

Assume p(x) be the linear function. However, the problem is that p is the probability that should vary from 0 to 1 whereas p(x) is an unbounded linear equation. To address this problem, let us assume, log p(x) be a linear function of x and further, to bound it between a range of (0,1), we will use logit transformation.

\[log \left( \frac{p(x)}{1-p(x)} \right) = \alpha_0 + \alpha \cdot x\]

Solve for \(p(x)\) using exponential on both sides, isolate \(p(x)\) and factoringthe coefficient. We get: \[p(x) = \frac{e^{\alpha_0 + \alpha \cdot x}}{e^{\alpha_0 + \alpha \cdot x}+1}\]

Since Logistic regression predicts probabilities, we can fit it using likelihood. Therefore, for each training data point x, the predicted class is y. Now, the likelihood can be written as: \[L(\alpha_0, \alpha) = \prod_{i=1}^n p(x_i)^{y_i} \left(1-p(x_i) \right)^{1-y_i}\] Take log on both side, we can transform that product into a sum. \[Log(L(\alpha_0, \alpha)) = \sum_{i=1}^n y_i \cdot log(p(x_i)) + (1-y_i) \cdot log((1-p(x_i) )\]

Examples

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('../../../raw_data/XME.csv', index_col = 0, parse_dates = True, dayfirst=True)
df = df.sort_index(ascending=True, inplace=False)
df.info()
df.head(5)

plt.plot(df['close'])
plt.show()
<class 'pandas.core.frame.DataFrame'>
Index: 4500 entries, 2006-06-22 to 2024-05-08
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   open              4500 non-null   float64
 1   high              4500 non-null   float64
 2   low               4500 non-null   float64
 3   close             4500 non-null   float64
 4   adjClose          4500 non-null   float64
 5   volume            4500 non-null   int64  
 6   unadjustedVolume  4500 non-null   int64  
 7   change            4500 non-null   float64
 8   changePercent     4500 non-null   float64
 9   vwap              4500 non-null   float64
 10  label             4500 non-null   object 
 11  changeOverTime    4500 non-null   float64
dtypes: float64(9), int64(2), object(1)
memory usage: 457.0+ KB

df.describe().T

# checking for missing values
#df.isnull.sum()
count mean std min 25% 50% 75% max
open 4500.0 4.224556e+01 1.496800e+01 11.7800 3.001000e+01 4.150000e+01 5.173500e+01 9.535000e+01
high 4500.0 4.279536e+01 1.514187e+01 12.1900 3.040000e+01 4.195500e+01 5.241250e+01 9.609000e+01
low 4500.0 4.160857e+01 1.472447e+01 11.3800 2.963000e+01 4.100500e+01 5.110250e+01 9.353000e+01
close 4500.0 4.220901e+01 1.494711e+01 11.9700 2.998750e+01 4.143500e+01 5.175000e+01 9.458000e+01
adjClose 4500.0 3.686377e+01 1.241720e+01 10.6000 2.709750e+01 3.530500e+01 4.583000e+01 7.545000e+01
volume 4500.0 3.141904e+06 2.199371e+06 0.0000 1.769072e+06 2.725635e+06 4.039529e+06 2.454848e+07
unadjustedVolume 4500.0 3.141904e+06 2.199371e+06 0.0000 1.769072e+06 2.725635e+06 4.039529e+06 2.454848e+07
change 4500.0 -3.655796e-02 8.801642e-01 -11.2400 -4.200000e-01 -2.001000e-02 3.900000e-01 3.980000e+00
changePercent 4500.0 -6.299599e-02 2.019533e+00 -13.8100 -1.110000e+00 -5.836411e-02 1.020000e+00 1.414000e+01
vwap 4500.0 4.221463e+01 1.493961e+01 11.8625 2.999438e+01 4.150750e+01 5.178875e+01 9.488750e+01
changeOverTime 4500.0 -6.299599e-04 2.019533e-02 -0.1381 -1.110000e-02 -5.836411e-04 1.020000e-02 1.414000e-01