Chapter 17 Case study - The adults dataset.
17.1 Introduction
The adult data set is another famous one from the UCI - machine learning repository.
The idea is to predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset. Extraction was done by Barry Becker from the 1994 Census database.
Load the libraries
library(stringr)
17.2 Import the data
df <- read_csv("dataset/adult.csv")
glimpse(df)
## Observations: 32,561
## Variables: 15
## $ AGE <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23…
## $ WORKCLASS <chr> "State-gov", "Self-emp-not-inc", "Private", "Priva…
## $ FNLWGT <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 1601…
## $ EDUCATION <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bach…
## $ EDUCATIONNUM <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12…
## $ MARITALSTATUS <chr> "Never-married", "Married-civ-spouse", "Divorced",…
## $ OCCUPATION <chr> "Adm-clerical", "Exec-managerial", "Handlers-clean…
## $ RELATIONSHIP <chr> "Not-in-family", "Husband", "Not-in-family", "Husb…
## $ RACE <chr> "White", "White", "White", "Black", "Black", "Whit…
## $ SEX <chr> "Male", "Male", "Male", "Male", "Female", "Female"…
## $ CAPITALGAIN <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0…
## $ CAPITALLOSS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HOURSPERWEEK <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30…
## $ NATIVECOUNTRY <chr> "United-States", "United-States", "United-States",…
## $ ABOVE50K <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,…
17.3 Tidy the data
Let’s check the level of missing data
map_dbl(df, function(.x) sum(is.na(.x)))
## AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM
## 0 0 0 0 0
## MARITALSTATUS OCCUPATION RELATIONSHIP RACE SEX
## 0 0 0 0 0
## CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY ABOVE50K
## 0 0 0 0 0
No missing data! That’s great news.
Before we change the
df %>% select_if(is.character) %>% map_if(is.character, unique)
## $WORKCLASS
## [1] "State-gov" "Self-emp-not-inc" "Private"
## [4] "Federal-gov" "Local-gov" "?"
## [7] "Self-emp-inc" "Without-pay" "Never-worked"
##
## $EDUCATION
## [1] "Bachelors" "HS-grad" "11th" "Masters"
## [5] "9th" "Some-college" "Assoc-acdm" "Assoc-voc"
## [9] "7th-8th" "Doctorate" "Prof-school" "5th-6th"
## [13] "10th" "1st-4th" "Preschool" "12th"
##
## $MARITALSTATUS
## [1] "Never-married" "Married-civ-spouse" "Divorced"
## [4] "Married-spouse-absent" "Separated" "Married-AF-spouse"
## [7] "Widowed"
##
## $OCCUPATION
## [1] "Adm-clerical" "Exec-managerial" "Handlers-cleaners"
## [4] "Prof-specialty" "Other-service" "Sales"
## [7] "Craft-repair" "Transport-moving" "Farming-fishing"
## [10] "Machine-op-inspct" "Tech-support" "?"
## [13] "Protective-serv" "Armed-Forces" "Priv-house-serv"
##
## $RELATIONSHIP
## [1] "Not-in-family" "Husband" "Wife" "Own-child"
## [5] "Unmarried" "Other-relative"
##
## $RACE
## [1] "White" "Black" "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"
##
## $SEX
## [1] "Male" "Female"
##
## $NATIVECOUNTRY
## [1] "United-States" "Cuba"
## [3] "Jamaica" "India"
## [5] "?" "Mexico"
## [7] "South" "Puerto-Rico"
## [9] "Honduras" "England"
## [11] "Canada" "Germany"
## [13] "Iran" "Philippines"
## [15] "Italy" "Poland"
## [17] "Columbia" "Cambodia"
## [19] "Thailand" "Ecuador"
## [21] "Laos" "Taiwan"
## [23] "Haiti" "Portugal"
## [25] "Dominican-Republic" "El-Salvador"
## [27] "France" "Guatemala"
## [29] "China" "Japan"
## [31] "Yugoslavia" "Peru"
## [33] "Outlying-US(Guam-USVI-etc)" "Scotland"
## [35] "Trinadad&Tobago" "Greece"
## [37] "Nicaragua" "Vietnam"
## [39] "Hong" "Ireland"
## [41] "Hungary" "Holand-Netherlands"
Allright, so maybe there were no NA, but there are quite a few “?”
The “?” should probably be replaced with NAs.
df <- read_csv("dataset/adult.csv", na = c("NA", "?"))
# Let's redo a check on the NA now
map_int(df, function(.x) sum(is.na(.x)))
## AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM
## 0 1836 0 0 0
## MARITALSTATUS OCCUPATION RELATIONSHIP RACE SEX
## 0 1843 0 0 0
## CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY ABOVE50K
## 0 0 0 583 0
Let’s now rework the column names to better fit our naming conventions
colnames(df) <- c("age", "working_class", "final_weight", "education", "education_num", "marital_status",
"occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "above_50k")
df2 <- df %>% mutate_if(is_character, as.factor)
levels(df2$working_class)
## [1] "Federal-gov" "Local-gov" "Never-worked"
## [4] "Private" "Self-emp-inc" "Self-emp-not-inc"
## [7] "State-gov" "Without-pay"
summary(df2)
## age working_class final_weight
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 State-gov : 1298 Mean : 189778
## 3rd Qu.:48.00 Self-emp-inc : 1116 3rd Qu.: 237051
## Max. :90.00 (Other) : 981 Max. :1484705
## NA's : 1836
## education education_num marital_status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty : 4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair : 4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial: 4066 Other-relative: 981 Black : 3124
## Adm-clerical : 3770 Own-child : 5068 Other : 271
## Sales : 3650 Unmarried : 3446 White :27816
## (Other) :10993 Wife : 1568
## NA's : 1843
## gender capital_gain capital_loss hours_per_week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native_country above_50k
## United-States:29170 Min. :0.0000
## Mexico : 643 1st Qu.:0.0000
## Philippines : 198 Median :0.0000
## Germany : 137 Mean :0.2408
## Canada : 121 3rd Qu.:0.0000
## (Other) : 1709 Max. :1.0000
## NA's : 583