Chapter 15 Case Study - Text classification: Spam and Ham.

This chapter has been inspired by the Coursera course on Machine Learning Foundations: A Case Study Approach given by Carlos Guestrin and by Emily Fox from Washington University. This course is part of the Machine Learning Specialization

The task was to apply classfification on an Amazon review dataset. Given a review, we create a model that will decide if the review is positive (associated with a rating of 4 or 5) or negative (associate with a rating of 1 or 2). This is a supervised learning task as the grading associated with the reviews is used as the response variable.

What we have done here is to create a subset of the dataset with only one product. The Philips Avent Bottle.

As usual, let’s first load the libraries

Let’s have a quick look at our data.

df <- read_csv("dataset/toyamazonPhilips.csv")
df <- as_tibble(df)
df2 <- df[,2:3]

#Let's have a quick look at the reviews
library(pander)
pandoc.table(df2[1:3,], 
             justify = 'left', style = 'grid')
## 
## 
## +--------------------------------+--------+
## | review                         | rating |
## +================================+========+
## | I was recommended to use these | 5      |
## | bottles by a girlfriend who    |        |
## | had used them with her baby    |        |
## | and sworn by them. I started   |        |
## | with a set of 4oz bottles with |        |
## | newborn nipples (#1) and have  |        |
## | graduated now to the 9oz       |        |
## | bottles with the next size up  |        |
## | nipples (#2). I will simply    |        |
## | buy new nipples when she's     |        |
## | ready for the next step. I am  |        |
## | absolutely thrilled with these |        |
## | bottles.  I do have an         |        |
## | occasional leak but it is      |        |
## | always my own fault for not    |        |
## | screwing the lid on tight      |        |
## | enough!I have not had any      |        |
## | problems with leaking when I   |        |
## | have them put together         |        |
## | correctly. They were           |        |
## | especially good for going back |        |
## | and forth between breast       |        |
## | feeding and bottle feeding. My |        |
## | now 4mos old baby never had    |        |
## | any difficulties feeding       |        |
## | either way and transitioned    |        |
## | very smoothly. One reader      |        |
## | referred to air being pushed   |        |
## | into the bottle during feeding |        |
## | and causing her baby to        |        |
## | swallow air. The air wooshing  |        |
## | in is a good thing! It         |        |
## | prevents a vacuum from         |        |
## | developing inside the bottle   |        |
## | and allows the baby to         |        |
## | continuously feed without      |        |
## | having to stop to relieve the  |        |
## | pressure inside the bottle.    |        |
## | The only time I ever had any   |        |
## | problems with mine swallowing  |        |
## | air was when she had pretty    |        |
## | much outgrown the newborn      |        |
## | nipples and would try to suck  |        |
## | too hard when she was very     |        |
## | hungry. This was immediately   |        |
## | corrected by buying the next   |        |
## | size up. They're also very     |        |
## | easy to clean and can be       |        |
## | effectively washed in the      |        |
## | dishwasher because of their    |        |
## | wide neck. Regular bottles     |        |
## | cannot. I, out of necessity,   |        |
## | used another brand of bottle   |        |
## | with her today at a relative's |        |
## | house and was reminded why I   |        |
## | like Avent bottles so much!    |        |
## +--------------------------------+--------+
## | If I had not been given a ton  | 2      |
## | of Avent bottles, I would have |        |
## | chosen some other system.  The |        |
## | leaking is terrible!!!  You    |        |
## | have to buy the disks          |        |
## | separately, you should get     |        |
## | them for free because they are |        |
## | absolutely essential.  The     |        |
## | only way to mix formula in the |        |
## | bottle or transport liquid is  |        |
## | to use the disks in the ring,  |        |
## | then switch to the nipple when |        |
## | you are ready to feed.  The    |        |
## | only reason I give it a two is |        |
## | because I do like that you can |        |
## | pump directly into the bottle  |        |
## | with the ISIS breast pump.     |        |
## | And, I like the sippy cups.    |        |
## +--------------------------------+--------+
## | Leaks! Especially difficult to | 1      |
## | get a tight seal if you use    |        |
## | one hand (while holding baby). |        |
## | A much better design is the    |        |
## | Breast Flow Learning Curve     |        |
## | First Years bottles. Instead   |        |
## | buy The First Years 3pk.       |        |
## | Breastflow 5oz. Bottles These  |        |
## | worked much better for me.     |        |
## +--------------------------------+--------+
#Let's see the table of ratings.  
table(df2$rating)
## 
##  1  2  3  4  5 
## 45 33 17 30 66

Interestingly the ratings on the Avent Bottles are quite spread on the extreme. It might be that people only write reviews if they are super excited or very frustrated with a product. Because we want this to be a binary classification exercise, we’ll do some transformation on these ratings. First we combine the positive reviews together (the 4 and 5 ratings) and the negative reviews together (the 1 and 2 ratings). Then we take out the neutral reviews.

# We'll put a 1 for great reviews (4 or 5) or a 0 for bad reviews (1 or 2)
# We remove all the reviews that have a rating of 3
df2 <- df %>% filter(rating != 3) %>% 
                    mutate(rating_new = if_else(rating >= 4, 1, 0))
df_training <-  df2[1:150, ]

Now we create our corpus, then tokenize it, then make it back to a data frame.

library(tm)
corpus_toy <- Corpus(VectorSource(df_training$review))
tdm_toy <- DocumentTermMatrix(corpus_toy, list(removePunctuation = TRUE, 
                                               removeNumbers = TRUE))

training_set_toy <- as.matrix(tdm_toy)

training_set_toy <- cbind(training_set_toy, df_training$rating_new)

colnames(training_set_toy)[ncol(training_set_toy)] <- "y"

training_set_toy <- as.data.frame(training_set_toy)
training_set_toy$y <- as.factor(training_set_toy$y)

Now that we have our data frame ready, let’s create our model using the svmLinear3 method.

review_toy_model <- caret::train(y ~., data = training_set_toy, method = 'svmLinear3')

Now we try our model on new review data

test_review_data <- df2[151:174, ]

test_corpus <- Corpus(VectorSource(test_review_data$review))
test_tdm <- DocumentTermMatrix(test_corpus, control=list(dictionary = Terms(tdm_toy)))
test_tdm <- as.matrix(test_tdm)

#Build the prediction  
model_toy_result <- predict(review_toy_model, newdata = test_tdm)

check_accuracy <- as.data.frame(cbind(prediction = model_toy_result, 
                                      rating = test_review_data$rating_new))

check_accuracy <- check_accuracy %>% mutate(prediction = as.integer(prediction) - 1)

check_accuracy$accuracy <- if_else(check_accuracy$prediction == check_accuracy$rating, 1, 0)
round(prop.table(table(check_accuracy$accuracy)), 3)
## 
##     0     1 
## 0.208 0.792

Another way to deal with text classification is to use the RtextTool library.
We can use the same dataframe that we used in our previous method. Like before we “DocumentTermMatrix”, we create a matrix of terms

library(RTextTools)
product_review_matrix <- create_matrix(df2[,2], language = "English", 
                                       removeNumbers = TRUE, 
                                       removePunctuation = TRUE, 
                                       removeStopwords = FALSE, stemWords = FALSE)

product_review_container <- create_container(product_review_matrix,
                                             df2$rating_new, 
                                             trainSize = 1:150, testSize = 151:174, 
                                             virgin = FALSE)

product_review_model <- train_model(product_review_container, algorithm = "SVM")

product_review_model_result <- classify_model(product_review_container, product_review_model)
x <- as.data.frame(cbind(df2$rating_new[151:174], product_review_model_result$SVM_LABEL))
colnames(x) <- c("actual_ratings", "predicted_ratings")
x <- x %>% mutate(predicted_ratings = predicted_ratings - 1)
round(prop.table(table(x$actual_ratings == x$predicted_ratings)), 3)
## 
## FALSE  TRUE 
##  0.25  0.75