Chapter 1 Prerequisites

Welcome to my reference book in machine learning. I have tried to put in it all the tricks, tips, how-to, must-know, etc. I consult it almost everytime I embark on a data science project. It is impossible to remember all the coding practices, hence this book is my data science in R vade-mecum.
This book is basically a record of my journey in data analysis. So often, I spend time reading articles, blog posts, etc. and wish I could put all the things I’m learning in a central location. It is a living document with constant additions.

So this book is a compilation of the techniques I’ve learned along the way. Most of what I have learned is through blog posts, stack overflow questions, etc. I am not taking any credit for all the great ideas, examples, graphs, etc. found in this web book. Most of what you’ll find here has been directly taken from blogs, Kaggle kernels, etc. I have tried as much as I could remember to reference the origins of the ideas. I do take responsibility for all mistakes, typos, unclear explanations, poor labeling / presentation of graphs. If you find anything that require improvement, I would be grateful if you would let me know: f.deryckel@gmail.com or even better post an issue on github here.

I am assuming that you are already somehow familiar with:

  • The math behind most algorithms. This is not a math book.
  • The basics of how to use R. This is not an introductory R book.

I wish you loads of fun in your data science journey, and I hope that this book can contribute positively to your experience.

1.1 Pre-requisite and conventions

As much as it makes sense, we will use the tidyverse and the conventions of tidy data throughout our journey.
Besides the hype surrounding the tidyverse, there is a couple reasons for us to stick with it:

  • learning a language is hard on itself. If we can be proficient and creative with one, it will be much better. All the packages from the tidyverse, might not always be the best ones (more efficient, more elegant), but I’m happy to learn inside out one opinionated framework in order to be able to apply it effortlessly and creatively.
  • Because many of the tidyverse packages do their background work in C++, they are usually pretty efficient in the way they work.
library(broom)
library(skimr)
library(knitr)
library(kableExtra)
#library(tidyverse)

Here are some conventions we will be using throughout the book.

  • df denotes a data frame. Usually the data frame from a raw set of data
  • We’ll use df2, df3, etc. for other, “cleaner” versions of that raw data set
  • model_pca_xxxx, model_lr_xxxx denotes models. The second part denotes the algorithm.
  • predict_svm_xxxx or predict_mlr_xxxx denotes the outcome of applying a model on a set of independent variables.

1.2 Organization

The first part of the book is more about the nitty-gritty of each machine learning algorithm. We do not really go into the depth of how they work and why they work the way they do. Instead it is really on how to leveraged R and various R libraries to use the ML algorithms.
The second part of the book is about various case studies. Most of them come either from the UCI Machine learning repository or Kaggle.

The two parts can (and maybe should?) be read concommitenly. We use machine learning to model real-life situation, so I see it as essential to go from the algortihms and theory to the case study and practical applications.

So in the first part, we start by talking about inference and tests with the Chapter 2. We then go onto the various linear regression technique with the Chapter 3. Chapter 4 is about logisitic regression and the various way to evaluate a logisitic model. We then go onto the K Nearest Neighbour with 7.

After inferences and regressions, we look into unsupervised machine learning algorithms such as the Kmeans, Chapter 8, hierarchical clustering in Chapter 9. We finish with Principal Component Analysis, aka PCA, Chapter 10

The case studies have been put by order of skills required to approach the practical situation.
The chapter ?? is on the Titanic dataset from the very famous Kaggle competition. This case studies is really about exploratory data analysis and feature engineering. The Chapter 16 is based on the mushroom dataset. We delve into data partition, confusion matrix, and a first go on various algorithms such as decision trees, random forest, SVM.
The chapter 18 is on the Breast Cancer dataset. We really focus here on model comparison. We also use the LDA and Neural Net algorithms.

1.3 Packages

In addition to the tidyverse, we are also using the following packages

library(aod)
library(caret)
library(corrr)
library(corrplot)
library(e1071)
library(ggfortify)
library(ggmosaic)
library(grid)
library(kableExtra)
library(leaps)
library(LiblineaR)
library(mice)
library(missForest)
library(pander)
library(randomForest)
library(ROCR)
library(rpart)
library(rpart.plot)
library(RTextTools)
library(simputation)
library(skimr)
library(tm)
library(visdat)
library(stringr)
library(tibble)
library(dplyr)
library(readr)
library(purrr)
library(ggplot2)