- Introduction and aim
- Correlation
- Regression
- Conclusions
- Zachs Challenge ;-)
- Acknowledgements
- References

# Introduction and aim

The Cereal dataset contains nutrition data for close to 80 different type of cereals. A rating is associated with each cereal box. We don’t know if the rating is from consumers or some health organization. So our first question is what are common attributes of cereals with high rating. This might help us understand how the rating is done.

This post is about running a simple ANOVA test and checking on the assumptions made.

We will first run some basics EDA mainly visuals, then do some correlation and regression analysis.

Loadings of libraries, of the data frame and a first look at the structure.

```
library(tidyverse)
library(kableExtra)
library(corrplot)
options(knitr.table.format = "html")
```

```
df <- read_csv("dataset/cereal.csv")
glimpse(df)
```

```
## Observations: 77
## Variables: 16
## $ name <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Br...
## $ mfr <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q"...
## $ type <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ calories <int> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120, 110...
## $ protein <int> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1,...
## $ fat <int> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0,...
## $ sodium <int> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220,...
## $ fiber <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0...
## $ carbo <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13....
## $ sugars <int> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2...
## $ potass <int> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 10...
## $ vitamins <int> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25...
## $ shelf <int> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2,...
## $ weight <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00,...
## $ cups <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67,...
## $ rating <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29....
```

From a first look at the structure of the df, we will have to change a few variable types.

Let’s check now for the presence of any missing values.

```
#Any NA in the file
map_int(df, function(.x) {sum(is.na(.x))})
```

```
## name mfr type calories protein fat sodium fiber
## 0 0 0 0 0 0 0 0
## carbo sugars potass vitamins shelf weight cups rating
## 0 0 0 0 0 0 0 0
```

Goodies!! No missing value!

```
df$type <- factor(df$type, labels = c("cold", "hot"))
df$mfr <- factor(df$mfr, labels = c("American Home \n Food Products", "General Mills",
"Kelloggs", "Nabisco", "Post", "Quaker Oats",
"Ralston Purina"))
df$shelf <- factor(df$shelf)
```

Look at the best 10 cereals from the rating perspective

```
knitr::kable(head(df %>% select(name, rating) %>% arrange(desc(rating)), 10),
caption = "Top 10 cereals by rating") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```

name | rating |
---|---|

All-Bran with Extra Fiber | 93.70491 |

Shredded Wheat 'n'Bran | 74.47295 |

Shredded Wheat spoon size | 72.80179 |

100% Bran | 68.40297 |

Shredded Wheat | 68.23588 |

Cream of Wheat (Quick) | 64.53382 |

Puffed Wheat | 63.00565 |

Puffed Rice | 60.75611 |

Nutri-grain Wheat | 59.64284 |

All-Bran | 59.42551 |

That really looks like a lot of branny stuff …

As we are interested in ratings, let’s see how it is distributed

```
ggplot(df, aes(x=rating, fill=I("dodgerblue3"))) +
geom_histogram(binwidth=5) +
labs(x = "Cereal ratings", y = "Frequency", title = "Distribution of cereal ratings")
```

Cereals ratings have a long right tail.

```
ggplot(df, aes(x = "rating", y = rating)) +
geom_boxplot() +
geom_jitter(aes(col = mfr)) + coord_flip() +
labs(x = "", title= "Distribution of rating") +
theme(legend.position = "bottom", legend.direction = "horizontal")
```

Looks like General Mills have been getting mostly low rating.

# Correlation

Let’s check what variables are correlated with rating.

```
df_cor <- cor(df %>% select(-mfr, -type, -name, -shelf))
corrplot(df_cor, order="hclust")
```

From the look of things, it seems like fiber, protein and potassium are positively correlated with rating, while sugar, calories, fat and sodium are negatively correlated with rating. As our top 10 cereal box by rating showed earlier, it seems like the rating is more based on health concerns rather than taste (sugars and sodium :-)

# Regression

We first build a complete model, then we will use backward elimination to eliminate non necessary variables and avoid the risk of overfitting.

```
df2 <- df %>% select(-name, -mfr, -type)
cereal_model <- lm(rating ~., data = df2)
summary(cereal_model)
```

```
##
## Call:
## lm(formula = rating ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.665e-07 -2.300e-07 2.131e-08 1.850e-07 5.616e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.493e+01 3.322e-07 1.654e+08 <2e-16 ***
## calories -2.227e-01 5.496e-09 -4.053e+07 <2e-16 ***
## protein 3.273e+00 4.896e-08 6.686e+07 <2e-16 ***
## fat -1.691e+00 5.999e-08 -2.820e+07 <2e-16 ***
## sodium -5.449e-02 4.791e-10 -1.137e+08 <2e-16 ***
## fiber 3.443e+00 4.146e-08 8.306e+07 <2e-16 ***
## carbo 1.092e+00 1.676e-08 6.518e+07 <2e-16 ***
## sugars -7.249e-01 1.801e-08 -4.025e+07 <2e-16 ***
## potass -3.399e-02 1.445e-09 -2.353e+07 <2e-16 ***
## vitamins -5.121e-02 1.870e-09 -2.738e+07 <2e-16 ***
## shelf2 -2.680e-07 1.054e-07 -2.544e+00 0.0134 *
## shelf3 -7.291e-08 1.016e-07 -7.180e-01 0.4757
## weight -2.514e-07 5.055e-07 -4.970e-01 0.6207
## cups 2.304e-07 1.886e-07 1.222e+00 0.2264
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.926e-07 on 63 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.347e+16 on 13 and 63 DF, p-value: < 2.2e-16
```

```
library(leaps)
cereal_model_bw <- regsubsets(formula(cereal_model), data = df, method = "backward")
plot(cereal_model_bw, scale="adjr2")
```

This graph makes our search of parsimonious explanatory variables quite easier and less tedious. On the y-axis, there is the adjusted r-square value one can reach using the variables (in black) from the x-axis. Now, I can run the model with just calories, protein, fat, sodium, fiber, carbo, sugars and vitamin and get a similar r squared.

```
cereal_model2 <- lm(rating ~ calories + protein + fat + sodium +
fiber + carbo + sugars + vitamins, data = df2)
summary(cereal_model2)
```

```
##
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber +
## carbo + sugars + vitamins, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91756 -0.54129 -0.05143 0.49141 2.43725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.192430 0.731856 76.78 < 2e-16 ***
## calories -0.220515 0.012888 -17.11 < 2e-16 ***
## protein 2.909377 0.140979 20.64 < 2e-16 ***
## fat -2.003414 0.158340 -12.65 < 2e-16 ***
## sodium -0.053995 0.001403 -38.49 < 2e-16 ***
## fiber 2.549274 0.056362 45.23 < 2e-16 ***
## carbo 1.033630 0.048934 21.12 < 2e-16 ***
## sugars -0.839300 0.050656 -16.57 < 2e-16 ***
## vitamins -0.052269 0.005074 -10.30 1.59e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8874 on 68 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.996
## F-statistic: 2372 on 8 and 68 DF, p-value: < 2.2e-16
```

One more time, one can see that the ratings is health based as the variables that are not of interest are not health related such as:

- shelf: display shelf (1, 2, or 3, counting from the floor)
- weight: weight in ounces of one serving
- cups: number of cups in one serving

## Diagnostics.

```
library(ggfortify)
autoplot(cereal_model2, label.size = 3)
```

Let’s focus on the first diagnostic: the residuals.

```
df3 <- tibble(sugars = df2$sugars, calories = df2$calories, sodium = df2$sodium,
mfr = df$mfr, fiber = df2$fiber,
residuals = cereal_model2$residuals, fitted = cereal_model2$fitted)
ggplot(df3, aes(x = fitted, y = residuals)) +
geom_point(aes(colour = mfr)) +
labs(title = "Scatterplot of residuals vs fitted values") +
theme(legend.position = "bottom", legend.direction = "horizontal")
```

That is a very decent residual vs fitted plot. It has the main characteristics of it at least!

- The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
- The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
- No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers. Well the 90+ rating data point does stand out.

We don’t have a funnel shape type of graph and it should be the case in a homoscedastic linear model with normally distributed errors.

```
ggplot(df3, aes(x = sugars, y = residuals)) +
geom_point(aes(colour = mfr)) +
labs(title = "Scatterplot of residuals vs sugars") +
theme(legend.position = "bottom", legend.direction = "horizontal")
```

```
ggplot(df3, aes(x = sodium, y = residuals)) +
geom_point(aes(colour = mfr)) +
labs(title = "Scatterplot of residuals vs sodium") +
theme(legend.position = "bottom", legend.direction = "horizontal")
```

```
ggplot(df3, aes(x = fiber, y = residuals)) +
geom_point(aes(colour = mfr)) +
labs(title = "Scatterplot of residuals vs fiber") +
theme(legend.position = "bottom", legend.direction = "horizontal")
```

```
ggplot(df3, aes(x = residuals, fill = I("dodgerblue3"))) +
geom_histogram(binwidth = 0.5) +
labs(title = "Histogram of residuals")
```

# Conclusions

I am gathering from this that the rating comes from strict pre-defined health criteria rather than consumer feedback.

# Zachs Challenge ;-)

On my Kaggle post, I received a comment from Zachs stating: “I would challenge you to try and prove that stores put the sugary cereals on the bottom two shelves so that the kids beg for it!”

Let’s first check the difference in mean sugars for each shelf.

```
df2 <- df %>% select(shelf, sugars) %>%
group_by(shelf) %>% summarize(mean = mean(sugars),
sd = sd(sugars),
n = n()) %>%
ungroup()
knitr::kable(df2, caption = "Statistics on sugars based on shelving") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```

shelf | mean | sd | n |
---|---|---|---|

1 | 4.800000 | 4.572227 | 20 |

2 | 9.619048 | 4.128876 | 21 |

3 | 6.527778 | 3.835817 | 36 |

Clearly shelf 2 contains cereals with more sugars while most of the cereals are on shelf 3.

Let’s visualize these results. Parallel box & wisker plot might be best.

```
ggplot(df, aes(x = shelf, y = sugars)) +
geom_boxplot() +
geom_jitter(aes(color = mfr)) +
labs(y = "Amount of sugars", x = "Shelf level") +
theme(legend.position = "bottom")
```

Visually at least, the more sugary cereals seems to be put at medium level on the shelves (eyes / easy grab level). Shelf 2, the middle one, clearly tends to display the more sugary & calorific cereals. If shelf 2 is the one more at eyes level, Zachs had a good feeling. Sugary cereals are promoted more.

## Anova & checking Assumptions

Now, it would be great to confirm this statistically. ANOVA seems the appropriate tool to use to confirm this. We are in the situation to compare 3 different means and see if there is a difference between them. The Null Hypothesis: Mean of Sugar shelf 1 = Mean of Sugar shelf 2 = Mean of Sugar shelf 3 Alternative Hypothesis: There is a difference in between at least one of these means.

### Assumption of normality

Let’s first check the assumption of normality. 2 way to do this:

- Q-Q plot

```
library(ggfortify)
autoplot(aov(sugars ~ shelf, data = df))[[2]]
```

This is not an ideal plot.

### Assumption of homogeneity of variance

Usually the **Bartlet** test is use for this.

```
bartlett.test(sugars ~ shelf, data = df)
```

```
##
## Bartlett test of homogeneity of variances
##
## data: sugars by shelf
## Bartlett's K-squared = 0.76837, df = 2, p-value = 0.681
```

We can keep our Null Hypothesis of homogeneity of variance.

## Test performance

```
sugar_shelf_aov <- aov(sugars ~ shelf, data = df)
summary(sugar_shelf_aov)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## shelf 2 248.4 124.20 7.335 0.00124 **
## Residuals 74 1253.1 16.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
confint(sugar_shelf_aov)
```

```
## 2.5 % 97.5 %
## (Intercept) 2.9665288 6.633471
## shelf2 2.2571819 7.380913
## shelf3 -0.5589625 4.014518
```

We get a high F-value on our test with a p-value of 0.001. Hence we can reject the null hypothesis and say that there is a difference between the mean of sugars in each of the shelf.

All we could conclude is that there is a significant difference between one (or more) of the pairs. As the ANOVA is significant, further ‘post hoc’ tests have to be carried out to confirm where those differences are. The post hoc tests are mostly t-tests with an adjustment to account for the multiple testing. Tukey’s is the most commonly used post hoc test but check if your discipline uses something else.

```
TukeyHSD(sugar_shelf_aov)
```

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sugars ~ shelf, data = df)
##
## $shelf
## diff lwr upr p adj
## 2-1 4.819048 1.743892 7.8942034 0.0010122
## 3-1 1.727778 -1.017129 4.4726846 0.2942373
## 3-2 -3.091270 -5.793836 -0.3887034 0.0210121
```

Diagnostic plots.

```
autoplot(aov(sugars ~ shelf, data = df))
```

And now I can leave Zach with a challenge ;-) Conversly, are more fiber-ish cereals put at other shelf than shelf 2?

Sugars has been negatively correlated rating. Let’s see if fiber, positively correlated with rating, has also an influence on the shelving.

```
ggplot(df, aes(x = shelf, y = fiber)) +
geom_boxplot() +
geom_jitter(aes(color = mfr)) +
labs(y = "Amount of fibers", x = "Shelf level") +
theme(legend.position = "bottom")
```

# Acknowledgements

The dataset has been gathered and cleaned up by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. The original source can be found here

# References

https://onlinecourses.science.psu.edu/stat501/node/36 https://stats.stackexchange.com/questions/76226/interpreting-the-residuals-vs-fitted-values-plot-for-verifying-the-assumptions http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/