This exercise focuses on applying the base TidyModels framework to the influenza data, and most of the code comes from the TidyModels’ “Build a Model” Tutorial.
The raw data for this exercise comes from the following citation: McKay, Brian et al. (2020), Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of patients infected with influenza, Dryad, Dataset, https://doi.org/10.5061/dryad.51c59zw4v.
The processed data was produced on the Data Processing
page.
Within this analysis, the following definitions exist:
The overall steps of this exercise are as follows:
The following R packages are required for this exercise:
Load the data created on the Data Processing
page.
# path to data
# note the use of the here() package and not absolute paths
<- here::here("data","flu","processeddata.rds")
data_location
# load data using the "ReadRDS" function in base R.
<- base::readRDS(data_location) mydata
For a more robust exploration, refer to the Exploratory Data Analysis
page. However, for reference, we can look at the dataframe summary using the skimr
package.
#summary of data using skimr package
::skim(mydata) skimr
Name | mydata |
Number of rows | 730 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
factor | 31 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
SwollenLymphNodes | 0 | 1 | FALSE | 2 | No: 418, Yes: 312 |
ChestCongestion | 0 | 1 | FALSE | 2 | Yes: 407, No: 323 |
ChillsSweats | 0 | 1 | FALSE | 2 | Yes: 600, No: 130 |
NasalCongestion | 0 | 1 | FALSE | 2 | Yes: 563, No: 167 |
CoughYN | 0 | 1 | FALSE | 2 | Yes: 655, No: 75 |
Sneeze | 0 | 1 | FALSE | 2 | Yes: 391, No: 339 |
Fatigue | 0 | 1 | FALSE | 2 | Yes: 666, No: 64 |
SubjectiveFever | 0 | 1 | FALSE | 2 | Yes: 500, No: 230 |
Headache | 0 | 1 | FALSE | 2 | Yes: 615, No: 115 |
Weakness | 0 | 1 | FALSE | 4 | Mod: 338, Mil: 223, Sev: 120, Non: 49 |
WeaknessYN | 0 | 1 | FALSE | 2 | Yes: 681, No: 49 |
CoughIntensity | 0 | 1 | FALSE | 4 | Mod: 357, Sev: 172, Mil: 154, Non: 47 |
CoughYN2 | 0 | 1 | FALSE | 2 | Yes: 683, No: 47 |
Myalgia | 0 | 1 | FALSE | 4 | Mod: 325, Mil: 213, Sev: 113, Non: 79 |
MyalgiaYN | 0 | 1 | FALSE | 2 | Yes: 651, No: 79 |
RunnyNose | 0 | 1 | FALSE | 2 | Yes: 519, No: 211 |
AbPain | 0 | 1 | FALSE | 2 | No: 639, Yes: 91 |
ChestPain | 0 | 1 | FALSE | 2 | No: 497, Yes: 233 |
Diarrhea | 0 | 1 | FALSE | 2 | No: 631, Yes: 99 |
EyePn | 0 | 1 | FALSE | 2 | No: 617, Yes: 113 |
Insomnia | 0 | 1 | FALSE | 2 | Yes: 415, No: 315 |
ItchyEye | 0 | 1 | FALSE | 2 | No: 551, Yes: 179 |
Nausea | 0 | 1 | FALSE | 2 | No: 475, Yes: 255 |
EarPn | 0 | 1 | FALSE | 2 | No: 568, Yes: 162 |
Hearing | 0 | 1 | FALSE | 2 | No: 700, Yes: 30 |
Pharyngitis | 0 | 1 | FALSE | 2 | Yes: 611, No: 119 |
Breathless | 0 | 1 | FALSE | 2 | No: 436, Yes: 294 |
ToothPn | 0 | 1 | FALSE | 2 | No: 565, Yes: 165 |
Vision | 0 | 1 | FALSE | 2 | No: 711, Yes: 19 |
Vomit | 0 | 1 | FALSE | 2 | No: 652, Yes: 78 |
Wheeze | 0 | 1 | FALSE | 2 | No: 510, Yes: 220 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
BodyTemp | 0 | 1 | 98.94 | 1.2 | 97.2 | 98.2 | 98.5 | 99.3 | 103.1 | ▇▇▂▁▁ |
The first linear model to create is a simple one using the continuous outcome of interest (BodyTemp) and main predictor of interest (RunnyNose). Using the parsnip package, specify the functional form of the model (linear regression) and method for fitting the model, aka engine (“lm”).
#save the model object as lm_mod
<-
lm_mod ::linear_reg() %>%
parsnip::set_engine("lm") parsnip
Now estimate the model using the fit function and summarize the linear model.
<- lm_mod %>%
lm_fit1 fit(BodyTemp ~ RunnyNose, data = mydata)
#summarize linear model
<- broom.mixed::tidy(lm_fit1)
lm_fit1_summary lm_fit1_summary
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 99.1 0.0819 1210. 0
## 2 RunnyNoseYes -0.293 0.0971 -3.01 0.00268
The intercept estimate (no runny nose) is 99.1F, which would make these patients febrile. We can interpret the slope estimate as: Patients with a runny nose on average have a 0.293F lower body temperature than patients without a runny nose.
Next, we can create a box and whisker plot for lm_fit1 output.
<- broom.mixed::tidy(lm_fit1) %>%
lm_fit1_bp ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
lm_fit1_bp
This shows us the estimate is significant (i.e. doesn’t cross the null hypothesis), and runny nose is a protective factor against increased body temperature.
Last, we can use the glance function to examine goodness of fit measures.
<- modelsummary::glance(lm_fit1)
lm_fit1_gf lm_fit1_gf
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0123 0.0110 1.19 9.08 0.00268 1 -1162. 2329. 2343.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
This model has an extremely low R^2, high AIC, BIC.
The second linear model to create includes all variables in the dataset as predictors with the main outcome of interest as BodyTemp. We can use the same lm_mod function, so no need to respecify.
#create model including all predictors (defined using the . instead of specifying all variable names)
#doesn't include interaction terms
<- lm_mod %>%
lm_fit2 fit(BodyTemp ~ ., data = mydata)
Now we can summarize the full linear model
#add print function to show all rows
<- print(broom.mixed::tidy(lm_fit2), n = 38) lm_fit2_summary
## # A tibble: 38 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 97.9 0.304 322. 0
## 2 SwollenLymphNodesYes -0.165 0.0920 -1.80 0.0727
## 3 ChestCongestionYes 0.0873 0.0975 0.895 0.371
## 4 ChillsSweatsYes 0.201 0.127 1.58 0.114
## 5 NasalCongestionYes -0.216 0.114 -1.90 0.0584
## 6 CoughYNYes 0.314 0.241 1.30 0.193
## 7 SneezeYes -0.362 0.0983 -3.68 0.000249
## 8 FatigueYes 0.265 0.161 1.65 0.0996
## 9 SubjectiveFeverYes 0.437 0.103 4.22 0.0000271
## 10 HeadacheYes 0.0115 0.125 0.0913 0.927
## 11 WeaknessMild 0.0182 0.189 0.0964 0.923
## 12 WeaknessModerate 0.0989 0.198 0.500 0.617
## 13 WeaknessSevere 0.373 0.231 1.62 0.106
## 14 WeaknessYNYes NA NA NA NA
## 15 CoughIntensityMild 0.0849 0.280 0.303 0.762
## 16 CoughIntensityModerate -0.0614 0.302 -0.203 0.839
## 17 CoughIntensitySevere -0.0373 0.314 -0.119 0.906
## 18 CoughYN2Yes NA NA NA NA
## 19 MyalgiaMild 0.164 0.160 1.02 0.307
## 20 MyalgiaModerate -0.0241 0.168 -0.143 0.886
## 21 MyalgiaSevere -0.129 0.208 -0.622 0.534
## 22 MyalgiaYNYes NA NA NA NA
## 23 RunnyNoseYes -0.0805 0.109 -0.742 0.459
## 24 AbPainYes 0.0316 0.140 0.225 0.822
## 25 ChestPainYes 0.105 0.107 0.982 0.326
## 26 DiarrheaYes -0.157 0.130 -1.21 0.227
## 27 EyePnYes 0.132 0.130 1.01 0.311
## 28 InsomniaYes -0.00682 0.0908 -0.0752 0.940
## 29 ItchyEyeYes -0.00802 0.110 -0.0727 0.942
## 30 NauseaYes -0.0341 0.102 -0.334 0.739
## 31 EarPnYes 0.0938 0.114 0.824 0.410
## 32 HearingYes 0.232 0.222 1.05 0.296
## 33 PharyngitisYes 0.318 0.121 2.62 0.00906
## 34 BreathlessYes 0.0905 0.0998 0.907 0.365
## 35 ToothPnYes -0.0229 0.114 -0.201 0.841
## 36 VisionYes -0.275 0.278 -0.989 0.323
## 37 VomitYes 0.165 0.151 1.09 0.275
## 38 WheezeYes -0.0467 0.107 -0.436 0.663
lm_fit2_summary
## # A tibble: 38 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 97.9 0.304 322. 0
## 2 SwollenLymphNodesYes -0.165 0.0920 -1.80 0.0727
## 3 ChestCongestionYes 0.0873 0.0975 0.895 0.371
## 4 ChillsSweatsYes 0.201 0.127 1.58 0.114
## 5 NasalCongestionYes -0.216 0.114 -1.90 0.0584
## 6 CoughYNYes 0.314 0.241 1.30 0.193
## 7 SneezeYes -0.362 0.0983 -3.68 0.000249
## 8 FatigueYes 0.265 0.161 1.65 0.0996
## 9 SubjectiveFeverYes 0.437 0.103 4.22 0.0000271
## 10 HeadacheYes 0.0115 0.125 0.0913 0.927
## # ... with 28 more rows
#the NA lines are where all patients are reporting the symptom (so no comparison possibility)
#export results into a table using the gtsummary package
::tbl_regression(lm_fit2) gtsummary
## Extracting {parsnip} model fit with `tbl_regression(x = x$fit, ...)`
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
SwollenLymphNodes | |||
No | — | — | |
Yes | -0.17 | -0.35, 0.02 | 0.073 |
ChestCongestion | |||
No | — | — | |
Yes | 0.09 | -0.10, 0.28 | 0.4 |
ChillsSweats | |||
No | — | — | |
Yes | 0.20 | -0.05, 0.45 | 0.11 |
NasalCongestion | |||
No | — | — | |
Yes | -0.22 | -0.44, 0.01 | 0.058 |
CoughYN | |||
No | — | — | |
Yes | 0.31 | -0.16, 0.79 | 0.2 |
Sneeze | |||
No | — | — | |
Yes | -0.36 | -0.55, -0.17 | <0.001 |
Fatigue | |||
No | — | — | |
Yes | 0.26 | -0.05, 0.58 | 0.10 |
SubjectiveFever | |||
No | — | — | |
Yes | 0.44 | 0.23, 0.64 | <0.001 |
Headache | |||
No | — | — | |
Yes | 0.01 | -0.23, 0.26 | >0.9 |
Weakness | |||
None | — | — | |
Mild | 0.02 | -0.35, 0.39 | >0.9 |
Moderate | 0.10 | -0.29, 0.49 | 0.6 |
Severe | 0.37 | -0.08, 0.83 | 0.11 |
WeaknessYN | |||
No | — | — | |
Yes | |||
CoughIntensity | |||
None | — | — | |
Mild | 0.08 | -0.46, 0.63 | 0.8 |
Moderate | -0.06 | -0.65, 0.53 | 0.8 |
Severe | -0.04 | -0.65, 0.58 | >0.9 |
CoughYN2 | |||
No | — | — | |
Yes | |||
Myalgia | |||
None | — | — | |
Mild | 0.16 | -0.15, 0.48 | 0.3 |
Moderate | -0.02 | -0.35, 0.31 | 0.9 |
Severe | -0.13 | -0.54, 0.28 | 0.5 |
MyalgiaYN | |||
No | — | — | |
Yes | |||
RunnyNose | |||
No | — | — | |
Yes | -0.08 | -0.29, 0.13 | 0.5 |
AbPain | |||
No | — | — | |
Yes | 0.03 | -0.24, 0.31 | 0.8 |
ChestPain | |||
No | — | — | |
Yes | 0.11 | -0.10, 0.32 | 0.3 |
Diarrhea | |||
No | — | — | |
Yes | -0.16 | -0.41, 0.10 | 0.2 |
EyePn | |||
No | — | — | |
Yes | 0.13 | -0.12, 0.39 | 0.3 |
Insomnia | |||
No | — | — | |
Yes | -0.01 | -0.19, 0.17 | >0.9 |
ItchyEye | |||
No | — | — | |
Yes | -0.01 | -0.22, 0.21 | >0.9 |
Nausea | |||
No | — | — | |
Yes | -0.03 | -0.23, 0.17 | 0.7 |
EarPn | |||
No | — | — | |
Yes | 0.09 | -0.13, 0.32 | 0.4 |
Hearing | |||
No | — | — | |
Yes | 0.23 | -0.20, 0.67 | 0.3 |
Pharyngitis | |||
No | — | — | |
Yes | 0.32 | 0.08, 0.56 | 0.009 |
Breathless | |||
No | — | — | |
Yes | 0.09 | -0.11, 0.29 | 0.4 |
ToothPn | |||
No | — | — | |
Yes | -0.02 | -0.25, 0.20 | 0.8 |
Vision | |||
No | — | — | |
Yes | -0.27 | -0.82, 0.27 | 0.3 |
Vomit | |||
No | — | — | |
Yes | 0.17 | -0.13, 0.46 | 0.3 |
Wheeze | |||
No | — | — | |
Yes | -0.05 | -0.26, 0.16 | 0.7 |
1
CI = Confidence Interval
|
Within this fit, significant predictors at alpha = 0.05 are Sneeze, SubjectiveFever, & Pharyngitis.
Now we can create a box and whisker plot for lm_fit2 output.
<- broom.mixed::tidy(lm_fit2) %>%
lm_fit2_bp1 ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
lm_fit2_bp1
#there's a lot of information here, but hard to identify the ones that are significant due to volume
#box and whisker plot for lm_fit2 significant predictors
#first filter significant results
<- broom.mixed::tidy(lm_fit2) %>%
lm_fit2_sig ::filter(p.value < 0.05)
dplyr
#box and whisker plot for lm_fit2 significant predictors
<- lm_fit2_sig %>%
lm_fit2_bp2 ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
lm_fit2_bp2
Last, we can use the glance function to examine goodness of fit measures
<- modelsummary::glance(lm_fit2)
lm_fit2_gf lm_fit2_gf
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.129 0.0860 1.14 3.02 0.0000000420 34 -1116. 2304. 2469.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In comparison to lm_fit1, this model has an increased R^2, slightly lower AIC, BIC, but we need the formal comparison to better understand the significance.
First, we want to combine results of the two models into one table.
#create a list of the two models
<- list(lm_fit1, lm_fit2)
lm_models
#using the model summary package
#list estimate, 95% confidence intervals, and highlight ones with significant p-values
#hide intercept estimate
::modelsummary(lm_models,
modelsummarystars = TRUE,
fmt = '%.3f',
estimate = "{estimate} [{conf.low}, {conf.high}] {stars}",
statistic = NULL,
coef_omit = "Intercept")
## Warning: In version 0.8.0 of the `modelsummary` package, the default significance markers produced by the `stars=TRUE` argument were changed to be consistent with R's defaults.
## This warning is displayed once per session.
Model 1 | Model 2 | |
---|---|---|
RunnyNoseYes | -0.293 [-0.483, -0.102] ** | -0.080 [-0.294, 0.133] |
SwollenLymphNodesYes | -0.165 [-0.346, 0.015] + | |
ChestCongestionYes | 0.087 [-0.104, 0.279] | |
ChillsSweatsYes | 0.201 [-0.049, 0.451] | |
NasalCongestionYes | -0.216 [-0.439, 0.008] + | |
CoughYNYes | 0.314 [-0.159, 0.787] | |
SneezeYes | -0.362 [-0.555, -0.169] *** | |
FatigueYes | 0.265 [-0.050, 0.580] + | |
SubjectiveFeverYes | 0.437 [0.234, 0.640] *** | |
HeadacheYes | 0.011 [-0.235, 0.258] | |
WeaknessMild | 0.018 [-0.353, 0.390] | |
WeaknessModerate | 0.099 [-0.290, 0.487] | |
WeaknessSevere | 0.373 [-0.080, 0.827] | |
WeaknessYNYes | ||
CoughIntensityMild | 0.085 [-0.465, 0.634] | |
CoughIntensityModerate | -0.061 [-0.654, 0.532] | |
CoughIntensitySevere | -0.037 [-0.654, 0.579] | |
CoughYN2Yes | ||
MyalgiaMild | 0.164 [-0.151, 0.479] | |
MyalgiaModerate | -0.024 [-0.354, 0.305] | |
MyalgiaSevere | -0.129 [-0.537, 0.279] | |
MyalgiaYNYes | ||
AbPainYes | 0.032 [-0.244, 0.307] | |
ChestPainYes | 0.105 [-0.105, 0.315] | |
DiarrheaYes | -0.157 [-0.411, 0.098] | |
EyePnYes | 0.132 [-0.123, 0.386] | |
InsomniaYes | -0.007 [-0.185, 0.171] | |
ItchyEyeYes | -0.008 [-0.224, 0.208] | |
NauseaYes | -0.034 [-0.234, 0.166] | |
EarPnYes | 0.094 [-0.130, 0.317] | |
HearingYes | 0.232 [-0.204, 0.668] | |
PharyngitisYes | 0.318 [0.079, 0.556] ** | |
BreathlessYes | 0.091 [-0.105, 0.287] | |
ToothPnYes | -0.023 [-0.246, 0.200] | |
VisionYes | -0.275 [-0.820, 0.271] | |
VomitYes | 0.165 [-0.132, 0.463] | |
WheezeYes | -0.047 [-0.257, 0.163] | |
Num.Obs. | 730 | 730 |
R2 | 0.012 | 0.129 |
R2 Adj. | 0.011 | 0.086 |
AIC | 2329.3 | 2303.8 |
BIC | 2343.1 | 2469.2 |
Log.Lik. | -1161.673 | -1115.920 |
In this comparison, we can see the significance of the runny nose estimate decreases, the runny nose Beta estimate gets closer to zero. Moreover, in comparison to lm_fit1, lm_fit2 has increased R^2, slightly lower AIC, BIC, log likelihood.
Next, we can conduct an ANOVA to formally compare the two linear regression models
<- anova(lm_fit1$fit, lm_fit2$fit, test = "Chisq")
lm_anova lm_anova
## Analysis of Variance Table
##
## Model 1: BodyTemp ~ RunnyNose
## Model 2: BodyTemp ~ SwollenLymphNodes + ChestCongestion + ChillsSweats +
## NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever +
## Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 +
## Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea +
## EyePn + Insomnia + ItchyEye + Nausea + EarPn + Hearing +
## Pharyngitis + Breathless + ToothPn + Vision + Vomit + Wheeze
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 728 1030.53
## 2 695 909.12 33 121.41 0.0000001357 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value from the ANOVA, we can conclude the more complex model better describes the data than the SLR. This is also supported by the comparison of AIC and BIC above.
The next model to create is a simple one using the categorical outcome of interest (Nausea) and main predictor of interest (RunnyNose). Using the parsnip package, specify the functional form of the model (logistic regression) and method for fitting the model, aka engine (“glm”)
#save the model object as log_mod
<-
log_mod ::logistic_reg() %>%
parsnip::set_engine("glm") parsnip
We can now estimate the model using the fit function.
<- log_mod %>%
log_fit1 fit(Nausea ~ RunnyNose, data = mydata)
Next, we can summarize logistic model with the tidy function. However, we want to exponentiate estimates to make them interpretable odds ratios.
<- broom.mixed::tidy(log_fit1, exponentiate = TRUE)
log_fit1_summary log_fit1_summary
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.518 0.145 -4.53 0.00000589
## 2 RunnyNoseYes 1.05 0.172 0.292 0.770
This does not yield a significant estimate (p > 0.05).
Now, we can create a box and whisker plot for log_fit1 output.
<- broom.mixed::tidy(log_fit1) %>%
log_fit1_bp ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
log_fit1_bp
This doesn’t really mean much since we only have one category beyond not significant results.
Lastly, we can use the glance function to examine goodness of fit measures.
<-modelsummary::glance(log_fit1)
log_fit1_gf log_fit1_gf
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 945. 729 -472. 949. 958. 945. 728 730
We can see this fit has a lower AIC, BIC, loglikelihood than linear model. This is obviously not a direct comparison but still an interesting result.
The second logistic model to create includes all variables in the dataset as predictors with the main outcome of interest as Nausea. We can use the same log_mod function, so no need to respecify. We first want to create model including all predictors (defined using the . instead of specifying all variable names).
#doesn't include interaction terms
<- log_mod %>%
log_fit2 fit(Nausea ~ ., data = mydata)
Next, we can summarize logistic model with the tidy function. However, we want to exponentiate estimates to make them interpretable odds ratios.
#adding the print function to see all rows
<- print(broom.mixed::tidy(log_fit2, exponentiate = TRUE), n = 38) log_fit2_summary
## # A tibble: 38 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.25 7.83 0.0285 9.77e- 1
## 2 SwollenLymphNodesYes 0.778 0.196 -1.28 2.00e- 1
## 3 ChestCongestionYes 1.32 0.213 1.30 1.95e- 1
## 4 ChillsSweatsYes 1.32 0.288 0.952 3.41e- 1
## 5 NasalCongestionYes 1.53 0.255 1.67 9.44e- 2
## 6 CoughYNYes 0.869 0.519 -0.271 7.87e- 1
## 7 SneezeYes 1.19 0.210 0.840 4.01e- 1
## 8 FatigueYes 1.26 0.372 0.616 5.38e- 1
## 9 SubjectiveFeverYes 1.32 0.225 1.23 2.18e- 1
## 10 HeadacheYes 1.39 0.285 1.16 2.45e- 1
## 11 WeaknessMild 0.885 0.447 -0.272 7.86e- 1
## 12 WeaknessModerate 1.36 0.454 0.684 4.94e- 1
## 13 WeaknessSevere 2.28 0.510 1.61 1.07e- 1
## 14 WeaknessYNYes NA NA NA NA
## 15 CoughIntensityMild 0.802 0.584 -0.378 7.06e- 1
## 16 CoughIntensityModerate 0.696 0.631 -0.574 5.66e- 1
## 17 CoughIntensitySevere 0.387 0.658 -1.44 1.49e- 1
## 18 CoughYN2Yes NA NA NA NA
## 19 MyalgiaMild 0.996 0.368 -0.0113 9.91e- 1
## 20 MyalgiaModerate 1.23 0.373 0.549 5.83e- 1
## 21 MyalgiaSevere 1.13 0.445 0.271 7.86e- 1
## 22 MyalgiaYNYes NA NA NA NA
## 23 RunnyNoseYes 1.05 0.233 0.195 8.46e- 1
## 24 AbPainYes 2.56 0.281 3.34 8.46e- 4
## 25 ChestPainYes 1.07 0.228 0.311 7.56e- 1
## 26 DiarrheaYes 2.90 0.259 4.11 3.91e- 5
## 27 EyePnYes 0.710 0.278 -1.23 2.18e- 1
## 28 InsomniaYes 1.09 0.193 0.436 6.63e- 1
## 29 ItchyEyeYes 0.939 0.233 -0.273 7.85e- 1
## 30 EarPnYes 0.834 0.239 -0.760 4.47e- 1
## 31 HearingYes 1.38 0.452 0.714 4.75e- 1
## 32 PharyngitisYes 1.32 0.266 1.03 3.01e- 1
## 33 BreathlessYes 1.69 0.209 2.53 1.15e- 2
## 34 ToothPnYes 1.62 0.229 2.09 3.62e- 2
## 35 VisionYes 1.13 0.541 0.232 8.17e- 1
## 36 VomitYes 11.7 0.349 7.05 1.76e-12
## 37 WheezeYes 0.738 0.234 -1.30 1.93e- 1
## 38 BodyTemp 0.969 0.0798 -0.391 6.96e- 1
log_fit2_summary
## # A tibble: 38 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.25 7.83 0.0285 0.977
## 2 SwollenLymphNodesYes 0.778 0.196 -1.28 0.200
## 3 ChestCongestionYes 1.32 0.213 1.30 0.195
## 4 ChillsSweatsYes 1.32 0.288 0.952 0.341
## 5 NasalCongestionYes 1.53 0.255 1.67 0.0944
## 6 CoughYNYes 0.869 0.519 -0.271 0.787
## 7 SneezeYes 1.19 0.210 0.840 0.401
## 8 FatigueYes 1.26 0.372 0.616 0.538
## 9 SubjectiveFeverYes 1.32 0.225 1.23 0.218
## 10 HeadacheYes 1.39 0.285 1.16 0.245
## # ... with 28 more rows
#the NA lines are where all patients are reporting the symptom (so no comparison possibility)
#export results into a table using the gtsummary package
<- gtsummary::tbl_regression(log_fit2, exponentiate = TRUE) log_fit2_gtsummary
## Extracting {parsnip} model fit with `tbl_regression(x = x$fit, ...)`
log_fit2_gtsummary
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
SwollenLymphNodes | |||
No | — | — | |
Yes | 0.78 | 0.53, 1.14 | 0.2 |
ChestCongestion | |||
No | — | — | |
Yes | 1.32 | 0.87, 2.00 | 0.2 |
ChillsSweats | |||
No | — | — | |
Yes | 1.32 | 0.75, 2.34 | 0.3 |
NasalCongestion | |||
No | — | — | |
Yes | 1.53 | 0.94, 2.54 | 0.094 |
CoughYN | |||
No | — | — | |
Yes | 0.87 | 0.32, 2.49 | 0.8 |
Sneeze | |||
No | — | — | |
Yes | 1.19 | 0.79, 1.81 | 0.4 |
Fatigue | |||
No | — | — | |
Yes | 1.26 | 0.62, 2.69 | 0.5 |
SubjectiveFever | |||
No | — | — | |
Yes | 1.32 | 0.85, 2.06 | 0.2 |
Headache | |||
No | — | — | |
Yes | 1.39 | 0.81, 2.47 | 0.2 |
Weakness | |||
None | — | — | |
Mild | 0.89 | 0.38, 2.22 | 0.8 |
Moderate | 1.36 | 0.58, 3.47 | 0.5 |
Severe | 2.28 | 0.86, 6.42 | 0.11 |
WeaknessYN | |||
No | — | — | |
Yes | |||
CoughIntensity | |||
None | — | — | |
Mild | 0.80 | 0.25, 2.49 | 0.7 |
Moderate | 0.70 | 0.20, 2.37 | 0.6 |
Severe | 0.39 | 0.10, 1.38 | 0.15 |
CoughYN2 | |||
No | — | — | |
Yes | |||
Myalgia | |||
None | — | — | |
Mild | 1.00 | 0.49, 2.09 | >0.9 |
Moderate | 1.23 | 0.60, 2.61 | 0.6 |
Severe | 1.13 | 0.48, 2.73 | 0.8 |
MyalgiaYN | |||
No | — | — | |
Yes | |||
RunnyNose | |||
No | — | — | |
Yes | 1.05 | 0.66, 1.66 | 0.8 |
AbPain | |||
No | — | — | |
Yes | 2.56 | 1.48, 4.47 | <0.001 |
ChestPain | |||
No | — | — | |
Yes | 1.07 | 0.68, 1.67 | 0.8 |
Diarrhea | |||
No | — | — | |
Yes | 2.90 | 1.75, 4.83 | <0.001 |
EyePn | |||
No | — | — | |
Yes | 0.71 | 0.41, 1.22 | 0.2 |
Insomnia | |||
No | — | — | |
Yes | 1.09 | 0.75, 1.59 | 0.7 |
ItchyEye | |||
No | — | — | |
Yes | 0.94 | 0.59, 1.48 | 0.8 |
EarPn | |||
No | — | — | |
Yes | 0.83 | 0.52, 1.33 | 0.4 |
Hearing | |||
No | — | — | |
Yes | 1.38 | 0.56, 3.33 | 0.5 |
Pharyngitis | |||
No | — | — | |
Yes | 1.32 | 0.79, 2.24 | 0.3 |
Breathless | |||
No | — | — | |
Yes | 1.69 | 1.12, 2.55 | 0.012 |
ToothPn | |||
No | — | — | |
Yes | 1.62 | 1.03, 2.54 | 0.036 |
Vision | |||
No | — | — | |
Yes | 1.13 | 0.38, 3.29 | 0.8 |
Vomit | |||
No | — | — | |
Yes | 11.7 | 6.07, 24.0 | <0.001 |
Wheeze | |||
No | — | — | |
Yes | 0.74 | 0.46, 1.16 | 0.2 |
BodyTemp | 0.97 | 0.83, 1.13 | 0.7 |
1
OR = Odds Ratio, CI = Confidence Interval
|
In this model, the significant predictors at alpha = 0.05 are AbPain, Diarrhea, Breathless, ToothPn, and Vomit. Vomit and Diarrhea makes sense as they often co-present with nausea, and abdominal pain as well as breathlessness make clinical sense. Tooth pain is a surprising result.
Now we can create a box and whisker plot for log_fit2 output.
<- broom.mixed::tidy(log_fit2) %>%
log_fit2_bp1 ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
log_fit2_bp1
#there's a lot of information here, but hard to identify the ones that are significant due to volume
#box and whisker plot for log_fit2 significant predictors
#first filter significant results
<- broom.mixed::tidy(log_fit2) %>%
log_fit2_sig ::filter(p.value < 0.05)
dplyr
#box and whisker plot for log_fit2 significant predictors
<- log_fit2_sig %>%
log_fit2_bp2 ::dwplot(dot_args = list(size = 2, color = "blue"),
dotwhiskerwhisker_args = list(color = "blue"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
log_fit2_bp2
All significant predictors in this model increase odds of having nausea. It makes sense vomit and diarrhea have the highest OR, given the frequent co-presentation of the three symptoms.
Lastly, we can use the glance function to examine goodness of fit measures.
<- modelsummary::glance(log_fit2)
log_fit2_gf log_fit2_gf
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 945. 729 -376. 821. 982. 751. 695 730
In comparison to log_fit1, this model has a lower log likelihood, AIC, BIC, deviance, but we need the formal comparison to better understand the significance.
First, we need to combine results of the two models into one table.
<- list(log_fit1, log_fit2)
log_models
#list estimate, 95% confidence intervals, and highlight ones with significant p-values
#hide intercept estimate
::modelsummary(log_models,
modelsummarystars = TRUE,
fmt = '%.3f',
exponentiate = TRUE,
estimate = "{estimate} [{conf.low}, {conf.high}] {stars}",
statistic = NULL,
coef_omit = "Intercept")
Model 1 | Model 2 | |
---|---|---|
RunnyNoseYes | 1.051 [0.753, 1.477] | 1.046 [0.664, 1.656] |
SwollenLymphNodesYes | 0.778 [0.528, 1.140] | |
ChestCongestionYes | 1.317 [0.869, 2.003] | |
ChillsSweatsYes | 1.315 [0.755, 2.339] | |
NasalCongestionYes | 1.531 [0.935, 2.541] + | |
CoughYNYes | 0.869 [0.321, 2.487] | |
SneezeYes | 1.193 [0.791, 1.805] | |
FatigueYes | 1.257 [0.621, 2.693] | |
SubjectiveFeverYes | 1.320 [0.851, 2.063] | |
HeadacheYes | 1.393 [0.806, 2.473] | |
WeaknessMild | 0.885 [0.380, 2.219] | |
WeaknessModerate | 1.365 [0.577, 3.474] | |
WeaknessSevere | 2.278 [0.859, 6.423] | |
WeaknessYNYes | ||
CoughIntensityMild | 0.802 [0.249, 2.494] | |
CoughIntensityModerate | 0.696 [0.197, 2.368] | |
CoughIntensitySevere | 0.387 [0.104, 1.384] | |
CoughYN2Yes | ||
MyalgiaMild | 0.996 [0.491, 2.092] | |
MyalgiaModerate | 1.227 [0.600, 2.607] | |
MyalgiaSevere | 1.128 [0.475, 2.733] | |
MyalgiaYNYes | ||
AbPainYes | 2.558 [1.478, 4.469] *** | |
ChestPainYes | 1.073 [0.685, 1.675] | |
DiarrheaYes | 2.898 [1.749, 4.834] *** | |
EyePnYes | 0.710 [0.408, 1.215] | |
InsomniaYes | 1.088 [0.745, 1.590] | |
ItchyEyeYes | 0.939 [0.592, 1.476] | |
EarPnYes | 0.834 [0.519, 1.327] | |
HearingYes | 1.381 [0.558, 3.328] | |
PharyngitisYes | 1.317 [0.788, 2.242] | |
BreathlessYes | 1.694 [1.125, 2.551] * | |
ToothPnYes | 1.617 [1.030, 2.536] * | |
VisionYes | 1.134 [0.385, 3.294] | |
VomitYes | 11.687 [6.072, 23.999] *** | |
WheezeYes | 0.738 [0.464, 1.162] | |
BodyTemp | 0.969 [0.827, 1.132] | |
Num.Obs. | 730 | 730 |
AIC | 948.6 | 821.5 |
BIC | 957.8 | 982.2 |
Log.Lik. | -472.283 | -375.735 |
Interestingly, runny nose is not a significant predictor in either model. In comparison to log_fit1, log_fit2 has a slightly lower AIC and log likelihood, but higher BIC.
Next, we can conduct an ANOVA to compare the two linear regression models
<- anova(log_fit1$fit, log_fit2$fit, test = "Chisq")
log_anova log_anova
## Analysis of Deviance Table
##
## Model 1: Nausea ~ RunnyNose
## Model 2: Nausea ~ SwollenLymphNodes + ChestCongestion + ChillsSweats +
## NasalCongestion + CoughYN + Sneeze + Fatigue + SubjectiveFever +
## Headache + Weakness + WeaknessYN + CoughIntensity + CoughYN2 +
## Myalgia + MyalgiaYN + RunnyNose + AbPain + ChestPain + Diarrhea +
## EyePn + Insomnia + ItchyEye + EarPn + Hearing + Pharyngitis +
## Breathless + ToothPn + Vision + Vomit + Wheeze + BodyTemp
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 728 944.57
## 2 695 751.47 33 193.09 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-value from the ANOVA, we can conclude the more complex model better describes the data than the simple logistic regression, which is also supported by the comparison of measures above. However, the higher BIC suggests the significance noted may be a result from the number of parameters included in the model. In a real-world analysis, we would need to examine the potential of overfitting occurring in the full logistic regression model.