Logistic Regression predict categorical variables based on quantitative data. It does this using the logit function

\[logit = log(oods) = w_0+w_1x_1+w_2x_2+ \cdots + w_nx_n\]

The logit is the log of the odds. This can be mapped back into a probability (with the sigmoid function) and then back to a class.

The logistic sigmoid is defined as \[f(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x+1}\] The sigmoid allows to transform values from \(-\infty \lt x \lt \infty\) into a \(-1 < f(x) < 1\) interval.

Assume p(x) be the linear function. However, the problem is that p is the probability that should vary from 0 to 1 whereas p(x) is an unbounded linear equation. To address this problem, let us assume, log p(x) be a linear function of x and further, to bound it between a range of (0,1), we will use logit transformation.

\[log \left( \frac{p(x)}{1-p(x)} \right) = \alpha_0 + \alpha \cdot x\]

Solve for \(p(x)\) using exponential on both sides, isolate \(p(x)\) and factoringthe coefficient. We get: \[p(x) = \frac{e^{\alpha_0 + \alpha \cdot x}}{e^{\alpha_0 + \alpha \cdot x}+1}\]

Since Logistic regression predicts probabilities, we can fit it using likelihood. Therefore, for each training data point x, the predicted class is y. Now, the likelihood can be written as: \[L(\alpha_0, \alpha) = \prod_{i=1}^n p(x_i)^{y_i} \left(1-p(x_i) \right)^{1-y_i}\] Take log on both side, we can transform that product into a sum. \[Log(L(\alpha_0, \alpha)) = \sum_{i=1}^n y_i \cdot log(p(x_i)) + (1-y_i) \cdot log((1-p(x_i) )\]

Examples

Python
R

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('../../../raw_data/XME.csv', index_col = 0, parse_dates = True, dayfirst=True)
df = df.sort_index(ascending=True, inplace=False)
df.info()
df.head(5)

plt.plot(df['close'])
plt.show()

<class 'pandas.core.frame.DataFrame'>
Index: 4500 entries, 2006-06-22 to 2024-05-08
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   open              4500 non-null   float64
 1   high              4500 non-null   float64
 2   low               4500 non-null   float64
 3   close             4500 non-null   float64
 4   adjClose          4500 non-null   float64
 5   volume            4500 non-null   int64  
 6   unadjustedVolume  4500 non-null   int64  
 7   change            4500 non-null   float64
 8   changePercent     4500 non-null   float64
 9   vwap              4500 non-null   float64
 10  label             4500 non-null   object 
 11  changeOverTime    4500 non-null   float64
dtypes: float64(9), int64(2), object(1)
memory usage: 457.0+ KB

df.describe().T

# checking for missing values
#df.isnull.sum()

	count	mean	std	min	25%	50%	75%	max
open	4500.0	4.224556e+01	1.496800e+01	11.7800	3.001000e+01	4.150000e+01	5.173500e+01	9.535000e+01
high	4500.0	4.279536e+01	1.514187e+01	12.1900	3.040000e+01	4.195500e+01	5.241250e+01	9.609000e+01
low	4500.0	4.160857e+01	1.472447e+01	11.3800	2.963000e+01	4.100500e+01	5.110250e+01	9.353000e+01
close	4500.0	4.220901e+01	1.494711e+01	11.9700	2.998750e+01	4.143500e+01	5.175000e+01	9.458000e+01
adjClose	4500.0	3.686377e+01	1.241720e+01	10.6000	2.709750e+01	3.530500e+01	4.583000e+01	7.545000e+01
volume	4500.0	3.141904e+06	2.199371e+06	0.0000	1.769072e+06	2.725635e+06	4.039529e+06	2.454848e+07
unadjustedVolume	4500.0	3.141904e+06	2.199371e+06	0.0000	1.769072e+06	2.725635e+06	4.039529e+06	2.454848e+07
change	4500.0	-3.655796e-02	8.801642e-01	-11.2400	-4.200000e-01	-2.001000e-02	3.900000e-01	3.980000e+00
changePercent	4500.0	-6.299599e-02	2.019533e+00	-13.8100	-1.110000e+00	-5.836411e-02	1.020000e+00	1.414000e+01
vwap	4500.0	4.221463e+01	1.493961e+01	11.8625	2.999438e+01	4.150750e+01	5.178875e+01	9.488750e+01
changeOverTime	4500.0	-6.299599e-04	2.019533e-02	-0.1381	-1.110000e-02	-5.836411e-04	1.020000e-02	1.414000e-01