This exercise focuses on loading the raw data and cleaning/processing it for further analysis.
The raw data for this exercise comes from the following citation: McKay, Brian et al. (2020), Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of patients infected with influenza, Dryad, Dataset, https://doi.org/10.5061/dryad.51c59zw4v.
The following R packages are required for this exercise:
Load the raw data downloaded from provided DOI link.
#path to data
#note the use of the here() package and not absolute paths
<- here::here("data","flu","SympAct_Any_Pos.Rda")
data_location
#load data.
#because the data is in an .Rda format, we can use the "ReadRDS" function in base R.
#the typical "load" function does not work (data is RDS not RDA)
<- base::readRDS(data_location)
rawdata
#take a look at the data
::glimpse(rawdata) dplyr
## Rows: 735
## Columns: 63
## $ DxName1 <fct> "Influenza like illness - Clinical Dx", "Acute tonsi~
## $ DxName2 <fct> NA, "Influenza like illness - Clinical Dx", "Acute p~
## $ DxName3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Fever, unspecified"~
## $ DxName4 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Other fatigue", NA,~
## $ DxName5 <fct> NA, NA, NA, NA, NA, NA, NA, NA, "Headache", NA, NA, ~
## $ Unique.Visit <chr> "340_17632125", "340_17794836", "342_17737773", "342~
## $ ActivityLevel <int> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, ~
## $ ActivityLevelF <fct> 10, 6, 2, 2, 5, 3, 4, 0, 0, 5, 9, 1, 3, 6, 5, 2, 2, ~
## $ SwollenLymphNodes <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Y~
## $ ChestCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
## $ ChillsSweats <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, ~
## $ NasalCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
## $ CoughYN <fct> Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, ~
## $ Sneeze <fct> No, No, Yes, Yes, No, Yes, No, Yes, No, No, No, No, ~
## $ Fatigue <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ SubjectiveFever <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes~
## $ Headache <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes~
## $ Weakness <fct> Mild, Severe, Severe, Severe, Moderate, Moderate, Mi~
## $ WeaknessYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ CoughIntensity <fct> Severe, Severe, Mild, Moderate, None, Moderate, Seve~
## $ CoughYN2 <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes~
## $ Myalgia <fct> Mild, Severe, Severe, Severe, Mild, Moderate, Mild, ~
## $ MyalgiaYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ RunnyNose <fct> No, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No~
## $ AbPain <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, Yes, N~
## $ ChestPain <fct> No, No, Yes, No, No, Yes, Yes, No, No, No, No, Yes, ~
## $ Diarrhea <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No,~
## $ EyePn <fct> No, No, No, No, Yes, No, No, No, No, No, Yes, No, Ye~
## $ Insomnia <fct> No, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, Y~
## $ ItchyEye <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes,~
## $ Nausea <fct> No, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Y~
## $ EarPn <fct> No, Yes, No, Yes, No, No, No, No, No, No, No, Yes, Y~
## $ Hearing <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No,~
## $ Pharyngitis <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, ~
## $ Breathless <fct> No, No, Yes, No, No, Yes, No, No, No, Yes, No, Yes, ~
## $ ToothPn <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, N~
## $ Vision <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, ~
## $ Vomit <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, N~
## $ Wheeze <fct> No, No, No, Yes, No, Yes, No, No, No, No, No, Yes, N~
## $ BodyTemp <dbl> 98.3, 100.4, 100.8, 98.8, 100.5, 98.4, 102.5, 98.4, ~
## $ RapidFluA <fct> Presumptive Negative For Influenza A, NA, Presumptiv~
## $ RapidFluB <fct> Presumptive Negative For Influenza B, NA, Presumptiv~
## $ PCRFluA <fct> NA, NA, NA, NA, NA, NA, Influenza A Not Detected, N~
## $ PCRFluB <fct> NA, NA, NA, NA, NA, NA, Influenza B Not Detected, N~
## $ TransScore1 <dbl> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5~
## $ TransScore1F <fct> 1, 3, 4, 5, 0, 2, 2, 5, 4, 4, 2, 3, 2, 5, 3, 5, 1, 5~
## $ TransScore2 <dbl> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4~
## $ TransScore2F <fct> 1, 2, 3, 4, 0, 2, 2, 4, 3, 3, 1, 2, 2, 4, 2, 4, 1, 4~
## $ TransScore3 <dbl> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3~
## $ TransScore3F <fct> 1, 1, 2, 3, 0, 2, 2, 3, 2, 2, 0, 1, 1, 3, 1, 3, 1, 3~
## $ TransScore4 <dbl> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4~
## $ TransScore4F <fct> 0, 2, 4, 4, 0, 1, 1, 4, 3, 3, 2, 2, 2, 4, 3, 4, 0, 4~
## $ ImpactScore <int> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9~
## $ ImpactScore2 <int> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,~
## $ ImpactScore3 <int> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, ~
## $ ImpactScoreF <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9~
## $ ImpactScore2F <fct> 6, 7, 13, 11, 10, 11, 7, 6, 9, 6, 12, 16, 10, 12, 8,~
## $ ImpactScore3F <fct> 3, 4, 9, 7, 6, 7, 3, 3, 6, 4, 7, 11, 6, 8, 4, 4, 5, ~
## $ ImpactScoreFD <fct> 7, 8, 14, 12, 11, 12, 8, 7, 10, 7, 13, 17, 11, 13, 9~
## $ TotalSymp1 <dbl> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1~
## $ TotalSymp1F <fct> 8, 11, 18, 17, 11, 14, 10, 12, 14, 11, 15, 20, 13, 1~
## $ TotalSymp2 <dbl> 8, 10, 17, 16, 11, 14, 10, 11, 13, 10, 14, 19, 13, 1~
## $ TotalSymp3 <dbl> 8, 9, 16, 15, 11, 14, 10, 10, 12, 9, 13, 18, 12, 16,~
The first step is to conduct some over all processing to create a dataset to be used in most of the analysis:
Score
or Total
or FluA
or FluB
or Dxname
or Activity
in the nameUnique.Visit
NA
observations#this can be accomplished using the select function in dplyr / tidyverse
#while we could pipe this into one operation, separating each line makes de-bugging issues easier
#remove variables containing "Score"
<- rawdata %>% dplyr::select(-contains("Score"))
data1
#remove variables containing "Total"
<- data1 %>% dplyr::select(-contains("Total"))
data2
#remove variables containing "FluA"
<- data2 %>% dplyr::select(-contains("FluA"))
data3
#remove variables containing "FluB"
<- data3 %>% dplyr::select(-contains("FluB"))
data4
#remove variables containing "Dxname"
<- data4 %>% dplyr::select(-contains("Dxname"))
data5
#remove variables containing "Activity"
<- data5 %>% dplyr::select(-contains("Activity"))
data6
#remove variable "Unique.Visit"
<- data6 %>% dplyr::select(-contains("Unique.Visit"))
data7
#check to make sure we have the correct columns remaining
::glimpse(data7) dplyr
## Rows: 735
## Columns: 32
## $ SwollenLymphNodes <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Y~
## $ ChestCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
## $ ChillsSweats <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, ~
## $ NasalCongestion <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
## $ CoughYN <fct> Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, ~
## $ Sneeze <fct> No, No, Yes, Yes, No, Yes, No, Yes, No, No, No, No, ~
## $ Fatigue <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ SubjectiveFever <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes~
## $ Headache <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes~
## $ Weakness <fct> Mild, Severe, Severe, Severe, Moderate, Moderate, Mi~
## $ WeaknessYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ CoughIntensity <fct> Severe, Severe, Mild, Moderate, None, Moderate, Seve~
## $ CoughYN2 <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes~
## $ Myalgia <fct> Mild, Severe, Severe, Severe, Mild, Moderate, Mild, ~
## $ MyalgiaYN <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ RunnyNose <fct> No, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No~
## $ AbPain <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, Yes, N~
## $ ChestPain <fct> No, No, Yes, No, No, Yes, Yes, No, No, No, No, Yes, ~
## $ Diarrhea <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No,~
## $ EyePn <fct> No, No, No, No, Yes, No, No, No, No, No, Yes, No, Ye~
## $ Insomnia <fct> No, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, Y~
## $ ItchyEye <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes,~
## $ Nausea <fct> No, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Y~
## $ EarPn <fct> No, Yes, No, Yes, No, No, No, No, No, No, No, Yes, Y~
## $ Hearing <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No,~
## $ Pharyngitis <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, ~
## $ Breathless <fct> No, No, Yes, No, No, Yes, No, No, No, Yes, No, Yes, ~
## $ ToothPn <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, N~
## $ Vision <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, ~
## $ Vomit <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, N~
## $ Wheeze <fct> No, No, No, Yes, No, Yes, No, No, No, No, No, Yes, N~
## $ BodyTemp <dbl> 98.3, 100.4, 100.8, 98.8, 100.5, 98.4, 102.5, 98.4, ~
::summary(data7) base
## SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion CoughYN
## No :421 No :326 No :131 No :170 No : 75
## Yes:314 Yes:409 Yes:604 Yes:565 Yes:660
##
##
##
##
##
## Sneeze Fatigue SubjectiveFever Headache Weakness WeaknessYN
## No :340 No : 64 No :230 No :115 None : 49 No : 49
## Yes:395 Yes:671 Yes:505 Yes:620 Mild :224 Yes:686
## Moderate:341
## Severe :121
##
##
##
## CoughIntensity CoughYN2 Myalgia MyalgiaYN RunnyNose AbPain
## None : 47 No : 47 None : 79 No : 79 No :211 No :642
## Mild :156 Yes:688 Mild :214 Yes:656 Yes:524 Yes: 93
## Moderate:360 Moderate:327
## Severe :172 Severe :115
##
##
##
## ChestPain Diarrhea EyePn Insomnia ItchyEye Nausea EarPn
## No :501 No :636 No :622 No :316 No :553 No :477 No :573
## Yes:234 Yes: 99 Yes:113 Yes:419 Yes:182 Yes:258 Yes:162
##
##
##
##
##
## Hearing Pharyngitis Breathless ToothPn Vision Vomit Wheeze
## No :705 No :121 No :438 No :569 No :716 No :656 No :514
## Yes: 30 Yes:614 Yes:297 Yes:166 Yes: 19 Yes: 79 Yes:221
##
##
##
##
##
## BodyTemp
## Min. : 97.20
## 1st Qu.: 98.20
## Median : 98.50
## Mean : 98.94
## 3rd Qu.: 99.30
## Max. :103.10
## NA's :5
#last step is to remove any NA observations
<- stats::na.omit(data7)
processed_data
#summary of processed data using skimr package
::skim(processed_data) skimr
Name | processed_data |
Number of rows | 730 |
Number of columns | 32 |
_______________________ | |
Column type frequency: | |
factor | 31 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
SwollenLymphNodes | 0 | 1 | FALSE | 2 | No: 418, Yes: 312 |
ChestCongestion | 0 | 1 | FALSE | 2 | Yes: 407, No: 323 |
ChillsSweats | 0 | 1 | FALSE | 2 | Yes: 600, No: 130 |
NasalCongestion | 0 | 1 | FALSE | 2 | Yes: 563, No: 167 |
CoughYN | 0 | 1 | FALSE | 2 | Yes: 655, No: 75 |
Sneeze | 0 | 1 | FALSE | 2 | Yes: 391, No: 339 |
Fatigue | 0 | 1 | FALSE | 2 | Yes: 666, No: 64 |
SubjectiveFever | 0 | 1 | FALSE | 2 | Yes: 500, No: 230 |
Headache | 0 | 1 | FALSE | 2 | Yes: 615, No: 115 |
Weakness | 0 | 1 | FALSE | 4 | Mod: 338, Mil: 223, Sev: 120, Non: 49 |
WeaknessYN | 0 | 1 | FALSE | 2 | Yes: 681, No: 49 |
CoughIntensity | 0 | 1 | FALSE | 4 | Mod: 357, Sev: 172, Mil: 154, Non: 47 |
CoughYN2 | 0 | 1 | FALSE | 2 | Yes: 683, No: 47 |
Myalgia | 0 | 1 | FALSE | 4 | Mod: 325, Mil: 213, Sev: 113, Non: 79 |
MyalgiaYN | 0 | 1 | FALSE | 2 | Yes: 651, No: 79 |
RunnyNose | 0 | 1 | FALSE | 2 | Yes: 519, No: 211 |
AbPain | 0 | 1 | FALSE | 2 | No: 639, Yes: 91 |
ChestPain | 0 | 1 | FALSE | 2 | No: 497, Yes: 233 |
Diarrhea | 0 | 1 | FALSE | 2 | No: 631, Yes: 99 |
EyePn | 0 | 1 | FALSE | 2 | No: 617, Yes: 113 |
Insomnia | 0 | 1 | FALSE | 2 | Yes: 415, No: 315 |
ItchyEye | 0 | 1 | FALSE | 2 | No: 551, Yes: 179 |
Nausea | 0 | 1 | FALSE | 2 | No: 475, Yes: 255 |
EarPn | 0 | 1 | FALSE | 2 | No: 568, Yes: 162 |
Hearing | 0 | 1 | FALSE | 2 | No: 700, Yes: 30 |
Pharyngitis | 0 | 1 | FALSE | 2 | Yes: 611, No: 119 |
Breathless | 0 | 1 | FALSE | 2 | No: 436, Yes: 294 |
ToothPn | 0 | 1 | FALSE | 2 | No: 565, Yes: 165 |
Vision | 0 | 1 | FALSE | 2 | No: 711, Yes: 19 |
Vomit | 0 | 1 | FALSE | 2 | No: 652, Yes: 78 |
Wheeze | 0 | 1 | FALSE | 2 | No: 510, Yes: 220 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
BodyTemp | 0 | 1 | 98.94 | 1.2 | 97.2 | 98.2 | 98.5 | 99.3 | 103.1 | ▇▇▂▁▁ |
The analysis that applies machine learning models requires data that is further processed. There are two steps involved:
In the output above, there are three variables that have both a severity score and a yes/no feature: weakness, cough, and myalgia. There are actually two variables for cough yes/no. These variables are strongly correlated and therefore affect model performance. Solution: remove all yes/no versions of variables for which a severity score exists.
#variable names to remove: WeaknessYN, MyalgiaYN, CoughYN, CoughYN2
<- dplyr::select(processed_data, -c(WeaknessYN, MyalgiaYN, CoughYN, CoughYN2)) featadj_data
These severity scores are also ordered, so we need to specify the order: None < Mild < Moderate < Severe.
#myalgia
$Myalgia <- ordered(featadj_data$Myalgia, labels = c("None", "Mild", "Moderate", "Severe"))
featadj_data
#weakness
$Weakness <- ordered(featadj_data$Weakness, labels = c("None", "Mild", "Moderate", "Severe"))
featadj_data
#cough
$CoughIntensity <- ordered(featadj_data$CoughIntensity, labels = c("None", "Mild", "Moderate", "Severe"))
featadj_data
#double check to confirm code worked
::skim(featadj_data) skimr
Name | featadj_data |
Number of rows | 730 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
factor | 27 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
SwollenLymphNodes | 0 | 1 | FALSE | 2 | No: 418, Yes: 312 |
ChestCongestion | 0 | 1 | FALSE | 2 | Yes: 407, No: 323 |
ChillsSweats | 0 | 1 | FALSE | 2 | Yes: 600, No: 130 |
NasalCongestion | 0 | 1 | FALSE | 2 | Yes: 563, No: 167 |
Sneeze | 0 | 1 | FALSE | 2 | Yes: 391, No: 339 |
Fatigue | 0 | 1 | FALSE | 2 | Yes: 666, No: 64 |
SubjectiveFever | 0 | 1 | FALSE | 2 | Yes: 500, No: 230 |
Headache | 0 | 1 | FALSE | 2 | Yes: 615, No: 115 |
Weakness | 0 | 1 | TRUE | 4 | Mod: 338, Mil: 223, Sev: 120, Non: 49 |
CoughIntensity | 0 | 1 | TRUE | 4 | Mod: 357, Sev: 172, Mil: 154, Non: 47 |
Myalgia | 0 | 1 | TRUE | 4 | Mod: 325, Mil: 213, Sev: 113, Non: 79 |
RunnyNose | 0 | 1 | FALSE | 2 | Yes: 519, No: 211 |
AbPain | 0 | 1 | FALSE | 2 | No: 639, Yes: 91 |
ChestPain | 0 | 1 | FALSE | 2 | No: 497, Yes: 233 |
Diarrhea | 0 | 1 | FALSE | 2 | No: 631, Yes: 99 |
EyePn | 0 | 1 | FALSE | 2 | No: 617, Yes: 113 |
Insomnia | 0 | 1 | FALSE | 2 | Yes: 415, No: 315 |
ItchyEye | 0 | 1 | FALSE | 2 | No: 551, Yes: 179 |
Nausea | 0 | 1 | FALSE | 2 | No: 475, Yes: 255 |
EarPn | 0 | 1 | FALSE | 2 | No: 568, Yes: 162 |
Hearing | 0 | 1 | FALSE | 2 | No: 700, Yes: 30 |
Pharyngitis | 0 | 1 | FALSE | 2 | Yes: 611, No: 119 |
Breathless | 0 | 1 | FALSE | 2 | No: 436, Yes: 294 |
ToothPn | 0 | 1 | FALSE | 2 | No: 565, Yes: 165 |
Vision | 0 | 1 | FALSE | 2 | No: 711, Yes: 19 |
Vomit | 0 | 1 | FALSE | 2 | No: 652, Yes: 78 |
Wheeze | 0 | 1 | FALSE | 2 | No: 510, Yes: 220 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
BodyTemp | 0 | 1 | 98.94 | 1.2 | 97.2 | 98.2 | 98.5 | 99.3 | 103.1 | ▇▇▂▁▁ |
The skimr output shows there are some predictors that are fairly unbalanced with most patients reporting no
and only a few yes
. This can be handled automatically in tidymodels
with step_nzv()
, but it can be better to do it manually to ensure scientific relevance. Here, we will remove binary predictors that have <50 entries in one category. According to the skimr::skim
output, there are two: Hearing
and Vision
.
#drop Hearing and Vision from the dataset to create processed dataset for ML analysis
<- dplyr::select(featadj_data, -c(Hearing, Vision))
ML_processed
#summary of data using skimr package
::skim(ML_processed) skimr
Name | ML_processed |
Number of rows | 730 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
factor | 25 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
SwollenLymphNodes | 0 | 1 | FALSE | 2 | No: 418, Yes: 312 |
ChestCongestion | 0 | 1 | FALSE | 2 | Yes: 407, No: 323 |
ChillsSweats | 0 | 1 | FALSE | 2 | Yes: 600, No: 130 |
NasalCongestion | 0 | 1 | FALSE | 2 | Yes: 563, No: 167 |
Sneeze | 0 | 1 | FALSE | 2 | Yes: 391, No: 339 |
Fatigue | 0 | 1 | FALSE | 2 | Yes: 666, No: 64 |
SubjectiveFever | 0 | 1 | FALSE | 2 | Yes: 500, No: 230 |
Headache | 0 | 1 | FALSE | 2 | Yes: 615, No: 115 |
Weakness | 0 | 1 | TRUE | 4 | Mod: 338, Mil: 223, Sev: 120, Non: 49 |
CoughIntensity | 0 | 1 | TRUE | 4 | Mod: 357, Sev: 172, Mil: 154, Non: 47 |
Myalgia | 0 | 1 | TRUE | 4 | Mod: 325, Mil: 213, Sev: 113, Non: 79 |
RunnyNose | 0 | 1 | FALSE | 2 | Yes: 519, No: 211 |
AbPain | 0 | 1 | FALSE | 2 | No: 639, Yes: 91 |
ChestPain | 0 | 1 | FALSE | 2 | No: 497, Yes: 233 |
Diarrhea | 0 | 1 | FALSE | 2 | No: 631, Yes: 99 |
EyePn | 0 | 1 | FALSE | 2 | No: 617, Yes: 113 |
Insomnia | 0 | 1 | FALSE | 2 | Yes: 415, No: 315 |
ItchyEye | 0 | 1 | FALSE | 2 | No: 551, Yes: 179 |
Nausea | 0 | 1 | FALSE | 2 | No: 475, Yes: 255 |
EarPn | 0 | 1 | FALSE | 2 | No: 568, Yes: 162 |
Pharyngitis | 0 | 1 | FALSE | 2 | Yes: 611, No: 119 |
Breathless | 0 | 1 | FALSE | 2 | No: 436, Yes: 294 |
ToothPn | 0 | 1 | FALSE | 2 | No: 565, Yes: 165 |
Vomit | 0 | 1 | FALSE | 2 | No: 652, Yes: 78 |
Wheeze | 0 | 1 | FALSE | 2 | No: 510, Yes: 220 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
BodyTemp | 0 | 1 | 98.94 | 1.2 | 97.2 | 98.2 | 98.5 | 99.3 | 103.1 | ▇▇▂▁▁ |
We now have a newly processed dataframe with 730 observations and 26 variables to be used for the machine learning analysis.
Save the processed data to be referenced in subsequent analyses.
#for the overall processed data:
# location to save file
<- here::here("data","flu","processeddata.rds")
save_data_location
# save data as RDS
saveRDS(processed_data, file = save_data_location)
#for the machine learning processed data:
# location to save file
<- here::here("data","flu","ML_data.rds")
save_data_location2
# save data as RDS
saveRDS(ML_processed, file = save_data_location2)