Introduction and aim

The Cereal dataset contains nutrition data for close to 80 different type of cereals. A rating is associated with each cereal box. We don’t know if the rating is from consumers or some health organization. So our first question is what are common attributes of cereals with high rating. This might help us understand how the rating is done.

This post is about running a simple ANOVA test and checking on the assumptions made.

We will first run some basics EDA mainly visuals, then do some correlation and regression analysis.

Loadings of libraries, of the data frame and a first look at the structure.

library(tidyverse)
library(kableExtra)
library(corrplot)

options(knitr.table.format = "html") 
df <- read_csv("dataset/cereal.csv")
glimpse(df)
## Observations: 77
## Variables: 16
## $ name     <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Br...
## $ mfr      <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q"...
## $ type     <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ calories <int> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120, 110...
## $ protein  <int> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1,...
## $ fat      <int> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0,...
## $ sodium   <int> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220,...
## $ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0...
## $ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13....
## $ sugars   <int> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2...
## $ potass   <int> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 10...
## $ vitamins <int> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25...
## $ shelf    <int> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2,...
## $ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00,...
## $ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67,...
## $ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29....

From a first look at the structure of the df, we will have to change a few variable types.

Let’s check now for the presence of any missing values.

#Any NA in the file
map_int(df, function(.x) {sum(is.na(.x))})
##     name      mfr     type calories  protein      fat   sodium    fiber 
##        0        0        0        0        0        0        0        0 
##    carbo   sugars   potass vitamins    shelf   weight     cups   rating 
##        0        0        0        0        0        0        0        0

Goodies!! No missing value!

df$type <- factor(df$type, labels = c("cold", "hot"))
df$mfr <- factor(df$mfr, labels = c("American Home \n Food Products", "General Mills", 
                                           "Kelloggs", "Nabisco", "Post", "Quaker Oats", 
                                           "Ralston Purina"))
df$shelf <- factor(df$shelf)

Look at the best 10 cereals from the rating perspective

knitr::kable(head(df %>% select(name, rating) %>% arrange(desc(rating)), 10), 
             caption = "Top 10 cereals by rating")  %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Top 10 cereals by rating
name rating
All-Bran with Extra Fiber 93.70491
Shredded Wheat 'n'Bran 74.47295
Shredded Wheat spoon size 72.80179
100% Bran 68.40297
Shredded Wheat 68.23588
Cream of Wheat (Quick) 64.53382
Puffed Wheat 63.00565
Puffed Rice 60.75611
Nutri-grain Wheat 59.64284
All-Bran 59.42551

That really looks like a lot of branny stuff …

As we are interested in ratings, let’s see how it is distributed

ggplot(df, aes(x=rating, fill=I("dodgerblue3"))) + 
  geom_histogram(binwidth=5) + 
  labs(x = "Cereal ratings", y = "Frequency", title = "Distribution of cereal ratings")

Cereals ratings have a long right tail.

ggplot(df, aes(x = "rating", y = rating)) + 
  geom_boxplot() + 
  geom_jitter(aes(col = mfr)) + coord_flip() + 
  labs(x = "", title= "Distribution of rating") + 
  theme(legend.position = "bottom", legend.direction = "horizontal")

Looks like General Mills have been getting mostly low rating.

Correlation

Let’s check what variables are correlated with rating.

df_cor <- cor(df %>% select(-mfr, -type, -name, -shelf))
corrplot(df_cor, order="hclust")

From the look of things, it seems like fiber, protein and potassium are positively correlated with rating, while sugar, calories, fat and sodium are negatively correlated with rating. As our top 10 cereal box by rating showed earlier, it seems like the rating is more based on health concerns rather than taste (sugars and sodium :-)

Regression

We first build a complete model, then we will use backward elimination to eliminate non necessary variables and avoid the risk of overfitting.

df2 <- df %>% select(-name, -mfr, -type)

cereal_model <- lm(rating ~., data = df2)
summary(cereal_model)
## 
## Call:
## lm(formula = rating ~ ., data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.665e-07 -2.300e-07  2.131e-08  1.850e-07  5.616e-07 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  5.493e+01  3.322e-07  1.654e+08   <2e-16 ***
## calories    -2.227e-01  5.496e-09 -4.053e+07   <2e-16 ***
## protein      3.273e+00  4.896e-08  6.686e+07   <2e-16 ***
## fat         -1.691e+00  5.999e-08 -2.820e+07   <2e-16 ***
## sodium      -5.449e-02  4.791e-10 -1.137e+08   <2e-16 ***
## fiber        3.443e+00  4.146e-08  8.306e+07   <2e-16 ***
## carbo        1.092e+00  1.676e-08  6.518e+07   <2e-16 ***
## sugars      -7.249e-01  1.801e-08 -4.025e+07   <2e-16 ***
## potass      -3.399e-02  1.445e-09 -2.353e+07   <2e-16 ***
## vitamins    -5.121e-02  1.870e-09 -2.738e+07   <2e-16 ***
## shelf2      -2.680e-07  1.054e-07 -2.544e+00   0.0134 *  
## shelf3      -7.291e-08  1.016e-07 -7.180e-01   0.4757    
## weight      -2.514e-07  5.055e-07 -4.970e-01   0.6207    
## cups         2.304e-07  1.886e-07  1.222e+00   0.2264    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.926e-07 on 63 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.347e+16 on 13 and 63 DF,  p-value: < 2.2e-16
library(leaps)

cereal_model_bw <- regsubsets(formula(cereal_model), data = df, method = "backward")
plot(cereal_model_bw, scale="adjr2")

This graph makes our search of parsimonious explanatory variables quite easier and less tedious. On the y-axis, there is the adjusted r-square value one can reach using the variables (in black) from the x-axis. Now, I can run the model with just calories, protein, fat, sodium, fiber, carbo, sugars and vitamin and get a similar r squared.

cereal_model2 <- lm(rating ~ calories + protein + fat + sodium + 
                      fiber + carbo + sugars + vitamins, data = df2)
summary(cereal_model2)
## 
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber + 
##     carbo + sugars + vitamins, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91756 -0.54129 -0.05143  0.49141  2.43725 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.192430   0.731856   76.78  < 2e-16 ***
## calories    -0.220515   0.012888  -17.11  < 2e-16 ***
## protein      2.909377   0.140979   20.64  < 2e-16 ***
## fat         -2.003414   0.158340  -12.65  < 2e-16 ***
## sodium      -0.053995   0.001403  -38.49  < 2e-16 ***
## fiber        2.549274   0.056362   45.23  < 2e-16 ***
## carbo        1.033630   0.048934   21.12  < 2e-16 ***
## sugars      -0.839300   0.050656  -16.57  < 2e-16 ***
## vitamins    -0.052269   0.005074  -10.30 1.59e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8874 on 68 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.996 
## F-statistic:  2372 on 8 and 68 DF,  p-value: < 2.2e-16

One more time, one can see that the ratings is health based as the variables that are not of interest are not health related such as:

  • shelf: display shelf (1, 2, or 3, counting from the floor)
  • weight: weight in ounces of one serving
  • cups: number of cups in one serving

Diagnostics.

library(ggfortify)
autoplot(cereal_model2, label.size = 3)

Let’s focus on the first diagnostic: the residuals.

df3 <- tibble(sugars = df2$sugars, calories = df2$calories, sodium = df2$sodium, 
              mfr = df$mfr, fiber = df2$fiber, 
             residuals = cereal_model2$residuals, fitted = cereal_model2$fitted)

ggplot(df3, aes(x = fitted, y = residuals)) + 
  geom_point(aes(colour = mfr)) + 
  labs(title = "Scatterplot of residuals vs fitted values") + 
  theme(legend.position = "bottom", legend.direction = "horizontal")

That is a very decent residual vs fitted plot. It has the main characteristics of it at least!

  • The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
  • The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
  • No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers. Well the 90+ rating data point does stand out.

We don’t have a funnel shape type of graph and it should be the case in a homoscedastic linear model with normally distributed errors.

ggplot(df3, aes(x = sugars, y = residuals)) + 
  geom_point(aes(colour = mfr)) + 
  labs(title = "Scatterplot of residuals vs sugars") + 
  theme(legend.position = "bottom", legend.direction = "horizontal")

ggplot(df3, aes(x = sodium, y = residuals)) + 
  geom_point(aes(colour = mfr)) + 
  labs(title = "Scatterplot of residuals vs sodium") + 
  theme(legend.position = "bottom", legend.direction = "horizontal")

ggplot(df3, aes(x = fiber, y = residuals)) + 
  geom_point(aes(colour = mfr)) + 
  labs(title = "Scatterplot of residuals vs fiber") + 
  theme(legend.position = "bottom", legend.direction = "horizontal")

ggplot(df3, aes(x = residuals, fill = I("dodgerblue3"))) + 
  geom_histogram(binwidth = 0.5) + 
  labs(title = "Histogram of residuals")

Conclusions

I am gathering from this that the rating comes from strict pre-defined health criteria rather than consumer feedback.

Zachs Challenge ;-)

On my Kaggle post, I received a comment from Zachs stating: “I would challenge you to try and prove that stores put the sugary cereals on the bottom two shelves so that the kids beg for it!”

Let’s first check the difference in mean sugars for each shelf.

df2 <- df %>% select(shelf, sugars) %>% 
  group_by(shelf) %>% summarize(mean = mean(sugars), 
                                sd = sd(sugars), 
                                n = n()) %>% 
  ungroup()

knitr::kable(df2, caption = "Statistics on sugars based on shelving") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Statistics on sugars based on shelving
shelf mean sd n
1 4.800000 4.572227 20
2 9.619048 4.128876 21
3 6.527778 3.835817 36

Clearly shelf 2 contains cereals with more sugars while most of the cereals are on shelf 3.

Let’s visualize these results. Parallel box & wisker plot might be best.

ggplot(df, aes(x = shelf, y = sugars)) + 
  geom_boxplot() + 
  geom_jitter(aes(color = mfr)) + 
  labs(y = "Amount of sugars", x = "Shelf level") + 
  theme(legend.position = "bottom")

Visually at least, the more sugary cereals seems to be put at medium level on the shelves (eyes / easy grab level). Shelf 2, the middle one, clearly tends to display the more sugary & calorific cereals. If shelf 2 is the one more at eyes level, Zachs had a good feeling. Sugary cereals are promoted more.

Anova & checking Assumptions

Now, it would be great to confirm this statistically. ANOVA seems the appropriate tool to use to confirm this. We are in the situation to compare 3 different means and see if there is a difference between them. The Null Hypothesis: Mean of Sugar shelf 1 = Mean of Sugar shelf 2 = Mean of Sugar shelf 3 Alternative Hypothesis: There is a difference in between at least one of these means.

Assumption of normality

Let’s first check the assumption of normality. 2 way to do this:

  • Q-Q plot
library(ggfortify)
autoplot(aov(sugars ~ shelf, data = df))[[2]]

This is not an ideal plot.

Assumption of homogeneity of variance

Usually the Bartlet test is use for this.

bartlett.test(sugars ~ shelf, data = df)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  sugars by shelf
## Bartlett's K-squared = 0.76837, df = 2, p-value = 0.681

We can keep our Null Hypothesis of homogeneity of variance.

Test performance

sugar_shelf_aov <- aov(sugars ~ shelf, data = df)
summary(sugar_shelf_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## shelf        2  248.4  124.20   7.335 0.00124 **
## Residuals   74 1253.1   16.93                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(sugar_shelf_aov)
##                  2.5 %   97.5 %
## (Intercept)  2.9665288 6.633471
## shelf2       2.2571819 7.380913
## shelf3      -0.5589625 4.014518

We get a high F-value on our test with a p-value of 0.001. Hence we can reject the null hypothesis and say that there is a difference between the mean of sugars in each of the shelf.

All we could conclude is that there is a significant difference between one (or more) of the pairs. As the ANOVA is significant, further ‘post hoc’ tests have to be carried out to confirm where those differences are. The post hoc tests are mostly t-tests with an adjustment to account for the multiple testing. Tukey’s is the most commonly used post hoc test but check if your discipline uses something else.

TukeyHSD(sugar_shelf_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = sugars ~ shelf, data = df)
## 
## $shelf
##          diff       lwr        upr     p adj
## 2-1  4.819048  1.743892  7.8942034 0.0010122
## 3-1  1.727778 -1.017129  4.4726846 0.2942373
## 3-2 -3.091270 -5.793836 -0.3887034 0.0210121

Diagnostic plots.

autoplot(aov(sugars ~ shelf, data = df))

And now I can leave Zach with a challenge ;-) Conversly, are more fiber-ish cereals put at other shelf than shelf 2?

Sugars has been negatively correlated rating. Let’s see if fiber, positively correlated with rating, has also an influence on the shelving.

ggplot(df, aes(x = shelf, y = fiber)) + 
  geom_boxplot() + 
  geom_jitter(aes(color = mfr)) + 
  labs(y = "Amount of fibers", x = "Shelf level") + 
  theme(legend.position = "bottom")

Acknowledgements

The dataset has been gathered and cleaned up by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. The original source can be found here

References

https://onlinecourses.science.psu.edu/stat501/node/36 https://stats.stackexchange.com/questions/76226/interpreting-the-residuals-vs-fitted-values-plot-for-verifying-the-assumptions http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/