Introduction
This exercise focuses on conducting exploratory data analysis from the influenza data.
The raw data for this exercise comes from the following citation: McKay, Brian et al. (2020), Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of patients infected with influenza, Dryad, Dataset, https://doi.org/10.5061/dryad.51c59zw4v.
The processed data was produced on the Data Processing
tab.
Within this analysis, the main continuous outcome of interest is body temperature and the main categorical outcome of interest is nausea. For each symptom, we will do the following steps:
- Produce and print some numerical output (e.g. table, summary statistics)
- Create a histogram or density plot (continuous variables only)
- Scatterplot, boxplot, or other similar plots against the main outcome of interest
- Any other exploration steps that may be useful.
For each variable, the EDA steps will be labeled by the numbers listed above.
Required Packages
The following R packages are required for this exercise:
- here: for path setting
- tidyverse: for all packages in the Tidyverse (ggplot2, dyplr, tidyr, readr, purr, tibble, stringr, forcats)
- summarytools: for overall dataframe summary
- ggplot2: for plotting data
- car: for creating QQ plots
- table1: for creating tables for summary statistics / other numerical outputs
- scales: for percent calculation
- margrittr: for sequential piping
Load Processed Data
Load the data created on the Data Processing
page.
# path to data
# note the use of the here() package and not absolute paths
data_location <- here::here("data","flu","processeddata.rds")
# load data using the "ReadRDS" function in base R.
processeddata <- base::readRDS(data_location)
# create a mirror dataset to avoid manipulating the cleaned dataset
EDAdata <- processeddata
Data Overview
To better understand the processed data, let’s use summarytools
to visualize the dataframe.
summarytools::dfSummary(EDAdata)
## Data Frame Summary
## Dimensions: 730 x 32
## Duplicates: 0
##
## -----------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------------- ------------------------ -------------------- --------------------- ---------- ---------
## 1 SwollenLymphNodes 1. No 418 (57.3%) IIIIIIIIIII 730 0
## [factor] 2. Yes 312 (42.7%) IIIIIIII (100.0%) (0.0%)
##
## 2 ChestCongestion 1. No 323 (44.2%) IIIIIIII 730 0
## [factor] 2. Yes 407 (55.8%) IIIIIIIIIII (100.0%) (0.0%)
##
## 3 ChillsSweats 1. No 130 (17.8%) III 730 0
## [factor] 2. Yes 600 (82.2%) IIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 4 NasalCongestion 1. No 167 (22.9%) IIII 730 0
## [factor] 2. Yes 563 (77.1%) IIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 5 CoughYN 1. No 75 (10.3%) II 730 0
## [factor] 2. Yes 655 (89.7%) IIIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 6 Sneeze 1. No 339 (46.4%) IIIIIIIII 730 0
## [factor] 2. Yes 391 (53.6%) IIIIIIIIII (100.0%) (0.0%)
##
## 7 Fatigue 1. No 64 ( 8.8%) I 730 0
## [factor] 2. Yes 666 (91.2%) IIIIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 8 SubjectiveFever 1. No 230 (31.5%) IIIIII 730 0
## [factor] 2. Yes 500 (68.5%) IIIIIIIIIIIII (100.0%) (0.0%)
##
## 9 Headache 1. No 115 (15.8%) III 730 0
## [factor] 2. Yes 615 (84.2%) IIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 10 Weakness 1. None 49 ( 6.7%) I 730 0
## [factor] 2. Mild 223 (30.5%) IIIIII (100.0%) (0.0%)
## 3. Moderate 338 (46.3%) IIIIIIIII
## 4. Severe 120 (16.4%) III
##
## 11 WeaknessYN 1. No 49 ( 6.7%) I 730 0
## [factor] 2. Yes 681 (93.3%) IIIIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 12 CoughIntensity 1. None 47 ( 6.4%) I 730 0
## [factor] 2. Mild 154 (21.1%) IIII (100.0%) (0.0%)
## 3. Moderate 357 (48.9%) IIIIIIIII
## 4. Severe 172 (23.6%) IIII
##
## 13 CoughYN2 1. No 47 ( 6.4%) I 730 0
## [factor] 2. Yes 683 (93.6%) IIIIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 14 Myalgia 1. None 79 (10.8%) II 730 0
## [factor] 2. Mild 213 (29.2%) IIIII (100.0%) (0.0%)
## 3. Moderate 325 (44.5%) IIIIIIII
## 4. Severe 113 (15.5%) III
##
## 15 MyalgiaYN 1. No 79 (10.8%) II 730 0
## [factor] 2. Yes 651 (89.2%) IIIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 16 RunnyNose 1. No 211 (28.9%) IIIII 730 0
## [factor] 2. Yes 519 (71.1%) IIIIIIIIIIIIII (100.0%) (0.0%)
##
## 17 AbPain 1. No 639 (87.5%) IIIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 91 (12.5%) II (100.0%) (0.0%)
##
## 18 ChestPain 1. No 497 (68.1%) IIIIIIIIIIIII 730 0
## [factor] 2. Yes 233 (31.9%) IIIIII (100.0%) (0.0%)
##
## 19 Diarrhea 1. No 631 (86.4%) IIIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 99 (13.6%) II (100.0%) (0.0%)
##
## 20 EyePn 1. No 617 (84.5%) IIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 113 (15.5%) III (100.0%) (0.0%)
##
## 21 Insomnia 1. No 315 (43.2%) IIIIIIII 730 0
## [factor] 2. Yes 415 (56.8%) IIIIIIIIIII (100.0%) (0.0%)
##
## 22 ItchyEye 1. No 551 (75.5%) IIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 179 (24.5%) IIII (100.0%) (0.0%)
##
## 23 Nausea 1. No 475 (65.1%) IIIIIIIIIIIII 730 0
## [factor] 2. Yes 255 (34.9%) IIIIII (100.0%) (0.0%)
##
## 24 EarPn 1. No 568 (77.8%) IIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 162 (22.2%) IIII (100.0%) (0.0%)
##
## 25 Hearing 1. No 700 (95.9%) IIIIIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 30 ( 4.1%) (100.0%) (0.0%)
##
## 26 Pharyngitis 1. No 119 (16.3%) III 730 0
## [factor] 2. Yes 611 (83.7%) IIIIIIIIIIIIIIII (100.0%) (0.0%)
##
## 27 Breathless 1. No 436 (59.7%) IIIIIIIIIII 730 0
## [factor] 2. Yes 294 (40.3%) IIIIIIII (100.0%) (0.0%)
##
## 28 ToothPn 1. No 565 (77.4%) IIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 165 (22.6%) IIII (100.0%) (0.0%)
##
## 29 Vision 1. No 711 (97.4%) IIIIIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 19 ( 2.6%) (100.0%) (0.0%)
##
## 30 Vomit 1. No 652 (89.3%) IIIIIIIIIIIIIIIII 730 0
## [factor] 2. Yes 78 (10.7%) II (100.0%) (0.0%)
##
## 31 Wheeze 1. No 510 (69.9%) IIIIIIIIIIIII 730 0
## [factor] 2. Yes 220 (30.1%) IIIIII (100.0%) (0.0%)
##
## 32 BodyTemp Mean (sd) : 98.9 (1.2) 57 distinct values : 730 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 97.2 < 98.5 < 103.1 : :
## IQR (CV) : 1.1 (0) : : :
## : : : : : . . . .
## -----------------------------------------------------------------------------------------------------------------
Looking at the data summary, there are multiple variables for the same symptom (e.g. category and presence). Additionally, several of the categorical variables have an uneven proportion distribution, and BodyTemp
appears to have some skew as well.
The outcome variables of interest were specified as part of the MADA course, so what predictor varaibles may be relevant?
- The data is about influenza, so certainly symptoms commonly associated with influeza (e.g. runny nose, nasal congestion, chills / sweating, and myalgia).
- Since we are interested in nausea, we should also probably include nausea and vomiting (co-presentation).
Main Continuous Outcome of Interest: Body Temperature
# start with main continuous outcome of interest: body temperature
# (1) since it is continuous, we can calculate summary statistics with the base summary function
base::summary(EDAdata$BodyTemp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 97.20 98.20 98.50 98.94 99.30 103.10
# (2) create a density plot for body temperature
ggplot2::ggplot(data = EDAdata, aes(x = BodyTemp)) +
geom_density()
# looking at the histogram and summary statistics, there appears to be a left skew from a normal distribution
# (3) create a boxplot to better visualize
# since body temperature is the main outcome of interest, no need to plot it against anything
ggplot2::ggplot(data = EDAdata, aes(y = BodyTemp)) +
geom_boxplot()
# there is clearly a skew in the data towards normal body temperature (98.6F)
# choosing to keep the points in the 101 - 103 F range as these are clinically reasonable values for influenza patients
# in other words, they are unlikely to be clinically significant outliers
# it is still worth examining the normality assumption, especially since we are moving to linear model fitting next
# (4) create a QQ-plot for body temperature
car::qqPlot(EDAdata$BodyTemp)
## [1] 172 388
# this clearly shows the body temperature data violates the normality assumption for linear models
Looking at the histogram and summary statistics, there appears to be a left skew from a normal distribution. The QQ plot further affirms a violation of the normality assumption. All of the values of body temperature are within clinically reasonable bounds for influenza (likely febrile) patients.
Main Categorical Outcome of Interest: Nausea
# now move onto main categorical outcome of interest: nausea
# before any analysis, create a label for nausea so outputs are more interpretable
EDAdata$Nausea <- base::factor(processeddata$Nausea, levels = c("No", "Yes"), labels = c("Nausea Absent", "Nausea Present"))
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$Nausea, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## -------------------- ------ -------- --------
## Nausea Absent 475 65.07 65.07
## Nausea Present 255 34.93 100.00
## Total 730 100.00 100.00
# (2) skip as nausea is not a continuous variable
# (3) as this is a main outcome of interest, we can only create a bar plot in ggplot to illustrate distribution
# the two geom_text statements tell ggplot to calculate and display count and percentages at the top of each bar
ggplot2::ggplot(data = EDAdata, aes(x = Nausea)) +
geom_bar() +
geom_text(
aes(label = after_stat(count)),
stat = 'count',
nudge_x = -0.06,
nudge_y = 0.2,
vjust = -1) +
geom_text(
aes(label = after_stat(scales::percent(prop, prefix = "(", suffix = "%)", accuracy = 0.1)), group = 1),
stat = 'count',
nudge_x = 0.06,
nudge_y = 0.2,
vjust = -1)
# it is almost a 2/3 vs 1/3 split for nausea in patients captured in this dataset
# there isn't much more descriptive work we can do here
The patients included in this dataset have an almost 2/3 vs 1/3 split for reports of nausea.
Predictor Variable: Runny Nose
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$RunnyNose, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 211 28.90 28.90
## Yes 519 71.10 100.00
## Total 730 100.00 100.00
# almost 3/4 of the patients captured in the dataset had a runny nose
# (2) skip as runny nose is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = RunnyNose, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# looking at this graph, it appears that patients with runny nose are potentially less likely to have a fever
# but this requires further analysis
# (3) now with nausea
# first create a table output of runny nose by nausea
# we can do this using the table 1 package
table1::label(EDAdata$RunnyNose) <- "Runny Nose"
table1::table1(~ RunnyNose | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Runny Nose |
|
|
|
No |
139 (29.3%) |
72 (28.2%) |
211 (28.9%) |
Yes |
336 (70.7%) |
183 (71.8%) |
519 (71.1%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of runny nose within nausea symptoms
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of runny nose patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(RunnyNose, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(RunnyNose) %s>%
dplyr::mutate(count_RunnyNose = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_RunnyNose) %s>% {
ggplot2::ggplot(., aes(x = RunnyNose,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Runny Nose Patients",
x = "Runny Nose?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'RunnyNose'. You can override using the `.groups` argument.
# looking at the results of this graph, it seems that the distribution of the nausea outcome isn't affected by the presence of a runny nose
Predictor Variable: Nasal Congestion
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$NasalCongestion, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 167 22.88 22.88
## Yes 563 77.12 100.00
## Total 730 100.00 100.00
# more than 3/4 of the patients captured in the dataset had nasal congestion
# (2) skip as nasal congestion is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = NasalCongestion, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# looking at this graph, hard to tell potential difference
# likely due to totals in each group (yes = 563, no = 167)
# (3) now with nausea
# first create a table output of nasal congestion by nausea
# we can do this using the table 1 package
table1::label(EDAdata$NasalCongestion) <- "Nasal Congestion"
table1::table1(~ NasalCongestion | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Nasal Congestion |
|
|
|
No |
120 (25.3%) |
47 (18.4%) |
167 (22.9%) |
Yes |
355 (74.7%) |
208 (81.6%) |
563 (77.1%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea and nasal congestion symptoms
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of nasal congestion patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(NasalCongestion, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(NasalCongestion) %s>%
dplyr::mutate(count_NasalCongestion = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_NasalCongestion) %s>% {
ggplot2::ggplot(., aes(x = NasalCongestion,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Nasal Congestion Patients",
x = "Nasal Congestion?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'NasalCongestion'. You can override using the `.groups` argument.
# looking at the results of this graph, potentially more likely to have nausea without nasal congestion
Predictor Variable: Pharyngitis
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$Pharyngitis, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 119 16.30 16.30
## Yes 611 83.70 100.00
## Total 730 100.00 100.00
# more than 80% of patients captured in this dataset have pharyngitis
# (2) skip as phyarngitis is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = Pharyngitis, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# looking at this graph, hard to tell potential difference
# likely due to totals in each group (yes = 611, no = 119)
# (3) now with nausea
# first create a table output of Pharyngitis by nausea
# we can do this using the table 1 package
table1::label(EDAdata$Pharyngitis) <- "Pharyngitis"
table1::table1(~ Pharyngitis | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Pharyngitis |
|
|
|
No |
80 (16.8%) |
39 (15.3%) |
119 (16.3%) |
Yes |
395 (83.2%) |
216 (84.7%) |
611 (83.7%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea and Pharyngitis
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of Pharyngitis patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(Pharyngitis, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(Pharyngitis) %s>%
dplyr::mutate(count_Pharyngitis = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_Pharyngitis) %s>% {
ggplot2::ggplot(., aes(x = Pharyngitis,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Pharyngitis Patients",
x = "Pharyngitis?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'Pharyngitis'. You can override using the `.groups` argument.
# looking at the results of this graph, hard to see any real difference
Predictor Variable: Chills / Sweating
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$ChillsSweats, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 130 17.81 17.81
## Yes 600 82.19 100.00
## Total 730 100.00 100.00
# more than 80% of patients captured in this dataset have chills
# (2) skip as chills is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = ChillsSweats, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# more chills / sweats reported with higher body temperature
# this difference makes sense as chills / sweats are often the result of a fever
# (3) now with nausea
# first create a table output of chills by nausea
# we can do this using the table 1 package
table1::label(EDAdata$ChillsSweats) <- "ChillsSweats"
table1::table1(~ ChillsSweats | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
ChillsSweats |
|
|
|
No |
103 (21.7%) |
27 (10.6%) |
130 (17.8%) |
Yes |
372 (78.3%) |
228 (89.4%) |
600 (82.2%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea and chills
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of chills patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(ChillsSweats, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(ChillsSweats) %s>%
dplyr::mutate(count_ChillsSweats = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_ChillsSweats) %s>% {
ggplot2::ggplot(., aes(x = ChillsSweats,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Chills / Sweats",
x = "Chills or Sweats?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'ChillsSweats'. You can override using the `.groups` argument.
# looking at the results of this graph, potentially more nausea with chills / sweats
# requires further analysis to determine significance
Predictor Variable: Myalgia
# there are multiple variables for myalgia, but we can focus on the one that gives a severity scale of myalgia
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$Myalgia, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## -------------- ------ -------- --------
## None 79 10.82 10.82
## Mild 213 29.18 40.00
## Moderate 325 44.52 84.52
## Severe 113 15.48 100.00
## Total 730 100.00 100.00
# nearly half of the patients in the dataset reported moderate myalgia
# approximately 3/4 of the patients in the dataset reported mild or moderate myalgia
# (2) skip as myalgia is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = Myalgia, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# looking at this graph, it appears that patients with no myalgia were less likely to have a fever
# it doesn't appear to have a great variation among the severity of myalgia symptoms
# but this requires further analysis
# (3) now with nausea
# first create a table output of runny nose by nausea
# we can do this using the table 1 package
table1::label(EDAdata$Myalgia) <- "Myalgia"
table1::table1(~ Myalgia | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Myalgia |
|
|
|
None |
63 (13.3%) |
16 (6.3%) |
79 (10.8%) |
Mild |
159 (33.5%) |
54 (21.2%) |
213 (29.2%) |
Moderate |
198 (41.7%) |
127 (49.8%) |
325 (44.5%) |
Severe |
55 (11.6%) |
58 (22.7%) |
113 (15.5%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea within myalgia symptoms
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of myalgia patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(Myalgia, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(Myalgia) %s>%
dplyr::mutate(count_Myalgia = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_Myalgia) %s>% {
ggplot2::ggplot(., aes(x = Myalgia,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Myalgia Severity",
x = "Myalgia Severity",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'Myalgia'. You can override using the `.groups` argument.
# looking at the results of this graph, it seems that increasing myalgia severity is associated with decreased nausea
# this makes sense clinically
# need further evaluation for significance
Predictor Variable: Diarrhea
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$Diarrhea, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 631 86.44 86.44
## Yes 99 13.56 100.00
## Total 730 100.00 100.00
# more than 80% of patients captured in this dataset have Diarrhea
# (2) skip as Diarrhea is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = Diarrhea, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# no clear difference in body temperature
# (3) now with nausea
# first create a table output of Diarrhea by nausea
# we can do this using the table 1 package
table1::label(EDAdata$Diarrhea) <- "Diarrhea"
table1::table1(~ Diarrhea | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Diarrhea |
|
|
|
No |
435 (91.6%) |
196 (76.9%) |
631 (86.4%) |
Yes |
40 (8.4%) |
59 (23.1%) |
99 (13.6%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea and Diarrhea
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of Diarrhea patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(Diarrhea, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(Diarrhea) %s>%
dplyr::mutate(count_Diarrhea = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_Diarrhea) %s>% {
ggplot2::ggplot(., aes(x = Diarrhea,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Diarrhea",
x = "Diarrhea?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'Diarrhea'. You can override using the `.groups` argument.
# based on the results, it appears that more patients with diarrhea had nausea
# this makes sense clinically as nausea and diarrhea often co-present
Predictor Variable: Vomitting
# (1) since it is categorical, we can only examine frequency and proportions of the variable
# this can be done with the summary tools package function "freq" and options to hide NAs (removed during processing)
summarytools::freq(EDAdata$Vomit, report.nas = FALSE)
## Frequencies
##
## Freq % % Cum.
## ----------- ------ -------- --------
## No 652 89.32 89.32
## Yes 78 10.68 100.00
## Total 730 100.00 100.00
# more than 80% of patients captured in this dataset report vomiting
# (2) skip as Vomit is not a continuous variable
# (3) examine graphical relationship with outcomes
# start with body temperature (i.e. create a box plot)
# include a jitter function to have a better idea of number of measurements and distribution
ggplot2::ggplot(data = EDAdata, aes(x = Vomit, y = BodyTemp)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.2, height = 0.2, color = "tomato")
# most obviously far fewer patients reporting vomiting
# but, potentially associated with an increased body temperature
# (3) now with nausea
# first create a table output of Vomit by nausea
# we can do this using the table 1 package
table1::label(EDAdata$Vomit) <- "Vomit"
table1::table1(~ Vomit | Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Vomit |
|
|
|
No |
461 (97.1%) |
191 (74.9%) |
652 (89.3%) |
Yes |
14 (2.9%) |
64 (25.1%) |
78 (10.7%) |
# since both are categorical variables, we can use a stacked bar plot to understand the distribution of nausea and vomiting
# to be able to include the percentages within each group, we will to calculate percentages before creating a graph
# first need to define a sequential piping operator so the function knows to use objects defined in the operation
`%s>%` <- magrittr::pipe_eager_lexical
# the first part of this piping operation calculates the counts and percentages within the Nausea grouping
# the second part plots it using the ggplot2 package
# trying to visualize the proportion of vomiting patients report outcome of interest (nausea)
# spacing on the labels isn't ideal, so would need to adjust for an actual manuscript
EDAdata %s>%
dplyr::group_by(Vomit, Nausea) %s>%
dplyr::summarise(count_Nausea = n()) %s>%
dplyr::group_by(Vomit) %s>%
dplyr::mutate(count_Vomit = sum(count_Nausea)) %s>%
dplyr::mutate(pct = count_Nausea / count_Vomit) %s>% {
ggplot2::ggplot(., aes(x = Vomit,
y = count_Nausea,
fill = Nausea)) +
ggplot2::geom_bar(
position = "stack",
stat = "identity") +
ggplot2::geom_text(
aes(label = count_Nausea),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.7, reverse = FALSE)) +
ggplot2::geom_text(
aes(label = scales::percent(pct)),
.,
stat = 'identity',
size = 4,
position = position_stack(vjust = 0.5, reverse = FALSE)) +
ggplot2::labs(.,
title = "Frequency of Nausea among Vomiting",
x = "Vomiting?",
y = "Frequency of Nausea")
}
## `summarise()` has grouped output by 'Vomit'. You can override using the `.groups` argument.
# based on the results, it appears that more patients with vomiting had nausea
# this makes sense clinically as nausea and vomiting often co-present
Creating A “Table 1” For Categorical Outcome of Interest (Nausea)
Often the first table of a manuscript lists the predictors on each row with columns representing the outcome in question. The table1
package works extremely well for categorical variables.
# first, create the summary statistics within the Table1 package for predictor variables
# already created earlier, but placed here for reference
table1::label(EDAdata$RunnyNose) <- "Runny Nose"
table1::label(EDAdata$Pharyngitis) <- "Pharyngitis"
table1::label(EDAdata$NasalCongestion) <- "Nasal Congestion"
table1::label(EDAdata$CoughIntensity) <- "Cough Intensity"
table1::label(EDAdata$ChillsSweats) <- "Chills / Sweating"
table1::label(EDAdata$Myalgia) <- "Myalgia"
table1::label(EDAdata$Vomit) <- "Vomit"
table1::label(EDAdata$Diarrhea) <- "Diarrhea"
# now, load all into a table 1 where columns represent nausea categories
table1::table1(~ RunnyNose + Pharyngitis + NasalCongestion + CoughIntensity + ChillsSweats + Myalgia + Vomit + Diarrhea
| Nausea, data = EDAdata)
|
Nausea Absent (N=475) |
Nausea Present (N=255) |
Overall (N=730) |
Runny Nose |
|
|
|
No |
139 (29.3%) |
72 (28.2%) |
211 (28.9%) |
Yes |
336 (70.7%) |
183 (71.8%) |
519 (71.1%) |
Pharyngitis |
|
|
|
No |
80 (16.8%) |
39 (15.3%) |
119 (16.3%) |
Yes |
395 (83.2%) |
216 (84.7%) |
611 (83.7%) |
Nasal Congestion |
|
|
|
No |
120 (25.3%) |
47 (18.4%) |
167 (22.9%) |
Yes |
355 (74.7%) |
208 (81.6%) |
563 (77.1%) |
Cough Intensity |
|
|
|
None |
30 (6.3%) |
17 (6.7%) |
47 (6.4%) |
Mild |
99 (20.8%) |
55 (21.6%) |
154 (21.1%) |
Moderate |
232 (48.8%) |
125 (49.0%) |
357 (48.9%) |
Severe |
114 (24.0%) |
58 (22.7%) |
172 (23.6%) |
Chills / Sweating |
|
|
|
No |
103 (21.7%) |
27 (10.6%) |
130 (17.8%) |
Yes |
372 (78.3%) |
228 (89.4%) |
600 (82.2%) |
Myalgia |
|
|
|
None |
63 (13.3%) |
16 (6.3%) |
79 (10.8%) |
Mild |
159 (33.5%) |
54 (21.2%) |
213 (29.2%) |
Moderate |
198 (41.7%) |
127 (49.8%) |
325 (44.5%) |
Severe |
55 (11.6%) |
58 (22.7%) |
113 (15.5%) |
Vomit |
|
|
|
No |
461 (97.1%) |
191 (74.9%) |
652 (89.3%) |
Yes |
14 (2.9%) |
64 (25.1%) |
78 (10.7%) |
Diarrhea |
|
|
|
No |
435 (91.6%) |
196 (76.9%) |
631 (86.4%) |
Yes |
40 (8.4%) |
59 (23.1%) |
99 (13.6%) |
# in a real analysis, we could also use this table for our univariate analysis, so we could conduct a chi-square test
# to test for differences in each variable across the Nausea strata.