This project explores the mtcars
data set and explores how miles per gallon (MPG) is affected by different variables, specifically the affect automatic and manual transmissions have on MPG. The following will be answered,
Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions.
Necessary libraries for loading, plotting, and model selection. Reading the mtcars
dataset and making a copy in a data.table
.
library(data.table)
library(ggplot2)
library(leaps)
library(printr)
data("mtcars")
mtcars_num <-copy(mtcars)
Viewing mtcars
data, and viewing structure of variables.
head(mtcars)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
as.data.frame(t(apply(mtcars,2,class)))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric |
Changing categorical variables to factors. Relabeling am
to Automatic
and Manual
.
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Plotting the miles per gallon (MPG) for automatic and manual transmissions.
plot1 <- ggplot(mtcars, aes(x=am, y=mpg)) +
geom_boxplot(aes(fill = am)) +
xlab("Transmission") +
ylab("MPG") +
theme(legend.position = "none")
plot1
It looks like there is a definite difference in the type of transmission for MPG. Performing a t-test will help verify if the difference in means is significant.
auto_vs_manu_ttest <- t.test(mpg ~ am, mtcars)
auto_vs_manu_ttest
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
The t-test rejected the null-hypothesis that the difference in means is equal to zero, with a p-value of \(.0014\). Therefore there is a difference in transmission type, with manual transmissions having a higher MPG.
Since the project is trying to quantify the difference in MPG for automatic and manual transmissions. The best starting place is a simple linear model with transmission type as the dependent variable.
basic_fit <- lm(mpg ~ am, mtcars)
summary(basic_fit)$coefficients
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 17.147368 | 1.124602 | 15.247492 | 0.000000 |
amManual | 7.244939 | 1.764422 | 4.106127 | 0.000285 |
summary(basic_fit)$r.squared
## [1] 0.3597989
The basic linear model with am
as the only regressor explains \(36\%\) of the variation, not a very good model. To gain a better model it gets tricky after one variable, since regressors can correlate with not only the predictor but also other regressors adding a variable that is highly correlated could help, but could also hurt the prediction.
One method is called stepwise regression which uses AIC to choose the best model, the other method is called best subsets regression which goes through all possible models with the specified regressors and chooses the best model based on different criterion.
everything_fit <- lm(mpg ~ ., mtcars)
step_fit <- step(everything_fit,direction="both",trace=FALSE)
best_subset <- regsubsets(mpg ~ ., mtcars, nvmax = 25)
best_subset_summary <- summary(best_subset)
adjr2 <- which.max(best_subset_summary$adjr2)
cp <- which.min(best_subset_summary$cp)
bic <- which.min(best_subset_summary$bic)
best_set <- best_subset_summary$outmat[c(adjr2,cp),]
best_set[,1:13]
cyl6 | cyl8 | disp | hp | drat | wt | qsec | vs1 | amManual | gear4 | gear5 | carb2 | carb3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 ( 1 ) | * | * | * | * | * | ||||||||
3 ( 1 ) | * | * | * |
sub3_fit <- lm(mpg ~ am + wt + qsec, mtcars)
sub5_fit <- lm(mpg ~ am + cyl + hp + wt + vs, mtcars)
Stepwise regression gave us a best model, but best subsets gave us two different models as well. Using Mallows’s \(C_p\) and BIC both returned model three as the best, while model five has the best for the adjusted \(R^2\). The code below grabs the adjusted \(R^2\) and also the p-value for the transmission type in the regression coefficients. Since the goal of the project is to quantify MPG, the best model would have confidence in this coefficient as well as explain the variance well.
models <- c("mpg ~ am + wt + qsec", "mpg ~ am + wt + cyl + hp", "mpg ~ am + wt + cyl + hp + vs")
adj_r_squared <- round(c(summary(sub3_fit)$adj.r.squared,
summary(step_fit)$adj.r.squared,
summary(sub5_fit)$adj.r.squared),4)
amManual_Pvalues <- round(c(summary(sub3_fit)$coefficients["amManual",4],
summary(step_fit)$coefficients["amManual",4],
summary(sub5_fit)$coefficients["amManual",4]),4)
results <- as.data.frame(cbind(models,adj_r_squared,amManual_Pvalues))
results
models | adj_r_squared | amManual_Pvalues |
---|---|---|
mpg ~ am + wt + qsec | 0.8336 | 0.0467 |
mpg ~ am + wt + cyl + hp | 0.8401 | 0.2065 |
mpg ~ am + wt + cyl + hp + vs | 0.8418 | 0.1032 |
The only model with a p-value for transmission type below \(5\%\) is mpg ~ am + wt + qsec
, it doesn’t have the highest adjusted \(R^2\) but its not much lower than the other two models.
summary(sub3_fit)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## amManual 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Everything so far looks solid, but lets make sure this model fits our data well by printing the diagnostic plots.
par(mfrow = c(2,2))
plot(sub3_fit, col = "blue", lwd = 2)
The best transmission type for MPG would have to be the manual transmission. Its confirmed by the t-test, as well as our final linear model. By having a manual transmission instead of an automatic the MPG will increase by 2.94 as can be seen in the best model’s amManual
coefficient.
The model fit well with a \(p < 0.05\) and and \(R^2 = 0.85\), but the diagnostic plots did warn us that something may be missing in our model. I believe the true cause for these trends are do to the small sample size with little overlap on the parameters wt
and qsec
.