Chapter 17 Case study - The adults dataset.

17.1 Introduction

The adult data set is another famous one from the UCI - machine learning repository.
The idea is to predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset. Extraction was done by Barry Becker from the 1994 Census database.

Load the libraries

library(stringr)

17.2 Import the data

df <- read_csv("dataset/adult.csv")
glimpse(df)

## Observations: 32,561
## Variables: 15
## $ AGE           <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23…
## $ WORKCLASS     <chr> "State-gov", "Self-emp-not-inc", "Private", "Priva…
## $ FNLWGT        <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 1601…
## $ EDUCATION     <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bach…
## $ EDUCATIONNUM  <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12…
## $ MARITALSTATUS <chr> "Never-married", "Married-civ-spouse", "Divorced",…
## $ OCCUPATION    <chr> "Adm-clerical", "Exec-managerial", "Handlers-clean…
## $ RELATIONSHIP  <chr> "Not-in-family", "Husband", "Not-in-family", "Husb…
## $ RACE          <chr> "White", "White", "White", "Black", "Black", "Whit…
## $ SEX           <chr> "Male", "Male", "Male", "Male", "Female", "Female"…
## $ CAPITALGAIN   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0…
## $ CAPITALLOSS   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HOURSPERWEEK  <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30…
## $ NATIVECOUNTRY <chr> "United-States", "United-States", "United-States",…
## $ ABOVE50K      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,…

17.3 Tidy the data

Let’s check the level of missing data

map_dbl(df, function(.x) sum(is.na(.x)))

##           AGE     WORKCLASS        FNLWGT     EDUCATION  EDUCATIONNUM 
##             0             0             0             0             0 
## MARITALSTATUS    OCCUPATION  RELATIONSHIP          RACE           SEX 
##             0             0             0             0             0 
##   CAPITALGAIN   CAPITALLOSS  HOURSPERWEEK NATIVECOUNTRY      ABOVE50K 
##             0             0             0             0             0

No missing data! That’s great news.

Before we change the variables into factors, let’s see what type of levels we have.

df %>% select_if(is.character) %>% map_if(is.character, unique)

## $WORKCLASS
## [1] "State-gov"        "Self-emp-not-inc" "Private"         
## [4] "Federal-gov"      "Local-gov"        "?"               
## [7] "Self-emp-inc"     "Without-pay"      "Never-worked"    
## 
## $EDUCATION
##  [1] "Bachelors"    "HS-grad"      "11th"         "Masters"     
##  [5] "9th"          "Some-college" "Assoc-acdm"   "Assoc-voc"   
##  [9] "7th-8th"      "Doctorate"    "Prof-school"  "5th-6th"     
## [13] "10th"         "1st-4th"      "Preschool"    "12th"        
## 
## $MARITALSTATUS
## [1] "Never-married"         "Married-civ-spouse"    "Divorced"             
## [4] "Married-spouse-absent" "Separated"             "Married-AF-spouse"    
## [7] "Widowed"              
## 
## $OCCUPATION
##  [1] "Adm-clerical"      "Exec-managerial"   "Handlers-cleaners"
##  [4] "Prof-specialty"    "Other-service"     "Sales"            
##  [7] "Craft-repair"      "Transport-moving"  "Farming-fishing"  
## [10] "Machine-op-inspct" "Tech-support"      "?"                
## [13] "Protective-serv"   "Armed-Forces"      "Priv-house-serv"  
## 
## $RELATIONSHIP
## [1] "Not-in-family"  "Husband"        "Wife"           "Own-child"     
## [5] "Unmarried"      "Other-relative"
## 
## $RACE
## [1] "White"              "Black"              "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"             
## 
## $SEX
## [1] "Male"   "Female"
## 
## $NATIVECOUNTRY
##  [1] "United-States"              "Cuba"                      
##  [3] "Jamaica"                    "India"                     
##  [5] "?"                          "Mexico"                    
##  [7] "South"                      "Puerto-Rico"               
##  [9] "Honduras"                   "England"                   
## [11] "Canada"                     "Germany"                   
## [13] "Iran"                       "Philippines"               
## [15] "Italy"                      "Poland"                    
## [17] "Columbia"                   "Cambodia"                  
## [19] "Thailand"                   "Ecuador"                   
## [21] "Laos"                       "Taiwan"                    
## [23] "Haiti"                      "Portugal"                  
## [25] "Dominican-Republic"         "El-Salvador"               
## [27] "France"                     "Guatemala"                 
## [29] "China"                      "Japan"                     
## [31] "Yugoslavia"                 "Peru"                      
## [33] "Outlying-US(Guam-USVI-etc)" "Scotland"                  
## [35] "Trinadad&Tobago"            "Greece"                    
## [37] "Nicaragua"                  "Vietnam"                   
## [39] "Hong"                       "Ireland"                   
## [41] "Hungary"                    "Holand-Netherlands"

Allright, so maybe there were no NA, but there are quite a few “?”

The “?” should probably be replaced with NAs.

df <- read_csv("dataset/adult.csv", na = c("NA", "?"))

# Let's redo a check on the NA now
map_int(df, function(.x) sum(is.na(.x)))

##           AGE     WORKCLASS        FNLWGT     EDUCATION  EDUCATIONNUM 
##             0          1836             0             0             0 
## MARITALSTATUS    OCCUPATION  RELATIONSHIP          RACE           SEX 
##             0          1843             0             0             0 
##   CAPITALGAIN   CAPITALLOSS  HOURSPERWEEK NATIVECOUNTRY      ABOVE50K 
##             0             0             0           583             0

Let’s now rework the column names to better fit our naming conventions

colnames(df) <- c("age", "working_class", "final_weight", "education", "education_num", "marital_status", 
                  "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", 
                  "native_country", "above_50k")

df2 <- df %>% mutate_if(is_character, as.factor)

levels(df2$working_class)

## [1] "Federal-gov"      "Local-gov"        "Never-worked"    
## [4] "Private"          "Self-emp-inc"     "Self-emp-not-inc"
## [7] "State-gov"        "Without-pay"

summary(df2)

##       age                 working_class    final_weight    
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   State-gov       : 1298   Mean   : 189778  
##  3rd Qu.:48.00   Self-emp-inc    : 1116   3rd Qu.: 237051  
##  Max.   :90.00   (Other)         :  981   Max.   :1484705  
##                  NA's            : 1836                    
##         education     education_num                 marital_status 
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation            relationship                   race      
##  Prof-specialty : 4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   : 4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial: 4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   : 3770   Own-child     : 5068   Other             :  271  
##  Sales          : 3650   Unmarried     : 3446   White             :27816  
##  (Other)        :10993   Wife          : 1568                             
##  NA's           : 1843                                                    
##     gender       capital_gain    capital_loss    hours_per_week 
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##        native_country    above_50k     
##  United-States:29170   Min.   :0.0000  
##  Mexico       :  643   1st Qu.:0.0000  
##  Philippines  :  198   Median :0.0000  
##  Germany      :  137   Mean   :0.2408  
##  Canada       :  121   3rd Qu.:0.0000  
##  (Other)      : 1709   Max.   :1.0000  
##  NA's         :  583