The following exercise is for Module 13 in Dr. Andreas Handel’s MADA Course. We are using the Marble Racing dataset, which is a previous TidyTuesday dataset from the week of 06/02/2020.
This analysis will follow these steps:
tidymodels
framework.This analysis fits the following models to the outcome of interest:
Each machine learning model will follow this process:
tune_grid()
functionA few comments about notation and documentation throughout this markdown:
marbles_#
marbles_processed
in the data
folder in the GitHub repository for this websiteLet’s get started!
The following R packages are required for this analysis: ## Required Packages The following R packages are required for this exercise:
The code to load the data is referenced in the GitHub repository linked above. It is also stored in the data
folder in the GitHub repository for this website (linked in the top right corner of the page).
#using the website link
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-02/marbles.csv') marbles
## Rows: 256 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): date, race, site, source, marble_name, team_name, pole, host, notes
## dbl (5): time_s, points, track_length_m, number_laps, avg_time_lap
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#save the raw csv to the `data` folder
<- here::here("data", "tidytuesday2", "marbles_raw.csv")
raw_data_location <- utils::write.csv(marbles, raw_data_location) marbles_raw
From the readme.md on GitHub, this week’s data comes from Randy Olson’s compilation of Jelle’s Marble Runs.
The description of the data from Randy Olson’s blogpost is as follows: >“Jelle’s Marble Runs started as a quirky YouTube channel back in 2006 and has refined the art of marble racing to the point that many — including sponsor John Oliver from Last Week Tonight — consider marble racing a legitimate contender for the national sports spotlight. Given that Jelle’s Marble Runs just completed their popular Marbula One competition last month, I was curious to look at the race results to see if these races were anything more than chaos. > >Do some marbles race better than others? Who would I put my money on in season 2 of Marbula One? … If any of these questions interest you, read on and I’ll answer some of them. > >The first step to answering these questions was to get some data. Thankfully, all of the Marbula One videos are organized in a YouTube playlist available here. From every race, my marble racing analytics team recorded each marble racer’s qualifier performance, total race time, average lap time, final rank, and some other statistics. That dataset is available for download on my website here.”
The data dictionary is described in the readme.md in the repository, but is also replicated below for reference in the analysis.
#create data dictionary dataframe
<- c("date", "race", "site", "source", "marble_name", "team_name", "time_s", "pole", "points", "track_length_m", "number_laps", "avg_time_lap", "host", "notes")
variable <- c("character", "character", "character", "character", "character", "character", "double", "character", "double", "double", "double", "double", "character", "character")
class <- c("date of race", "race id", "site of race", "youtube url", "name of marble", "team name", "time in seconds", "pole position", "points gained", "track length in meters", "number of laps", "average lap time", "host of race", "notes (very few, but some notes about potential errors")
description
<- data.frame(variable, class, description)
data_dictionary print(data_dictionary)
## variable class
## 1 date character
## 2 race character
## 3 site character
## 4 source character
## 5 marble_name character
## 6 team_name character
## 7 time_s double
## 8 pole character
## 9 points double
## 10 track_length_m double
## 11 number_laps double
## 12 avg_time_lap double
## 13 host character
## 14 notes character
## description
## 1 date of race
## 2 race id
## 3 site of race
## 4 youtube url
## 5 name of marble
## 6 team name
## 7 time in seconds
## 8 pole position
## 9 points gained
## 10 track length in meters
## 11 number of laps
## 12 average lap time
## 13 host of race
## 14 notes (very few, but some notes about potential errors
I wanted to better understand the pole position and points gained variables, as I assumed they are unique to the sport of marble racing.
After some digging on the Jelle’s Marble Runs Fandom WikiPage, the points system for season 1 is designed to be identical to Formula One racing:
Just in case we need it later, let’s create a dataframe for points definitions.
<- c(25, 18, 15, 12, 10, 8, 6, 4, 2, 1, 0, 0, 0, 0, 0, 0)
value <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th", "10th", "11th", "12th", "13th", "14th", "15th", "16th")
place
<- data.frame(value, place)
points_dictionary print(points_dictionary)
## value place
## 1 25 1st
## 2 18 2nd
## 3 15 3rd
## 4 12 4th
## 5 10 5th
## 6 8 6th
## 7 6 7th
## 8 4 8th
## 9 2 9th
## 10 1 10th
## 11 0 11th
## 12 0 12th
## 13 0 13th
## 14 0 14th
## 15 0 15th
## 16 0 16th
In motorsports, the pole position is the position at the front at the start of a racing event. That position is typically given to the vehicle and driver with the best qualifying time in the trials before the race. In this dataset, the pole
variable appears to correspond to the positioning at the beginning of the race.
Take an initial look at the raw data. Start by using utils::str()
.
#data structure
::str(marbles) utils
## spec_tbl_df [256 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : chr [1:256] "15-Feb-20" "15-Feb-20" "15-Feb-20" "15-Feb-20" ...
## $ race : chr [1:256] "S1Q1" "S1Q1" "S1Q1" "S1Q1" ...
## $ site : chr [1:256] "Savage Speedway" "Savage Speedway" "Savage Speedway" "Savage Speedway" ...
## $ source : chr [1:256] "https://youtu.be/JtsQ_UydjEI?t=356" "https://youtu.be/JtsQ_UydjEI?t=356" "https://youtu.be/JtsQ_UydjEI?t=356" "https://youtu.be/JtsQ_UydjEI?t=356" ...
## $ marble_name : chr [1:256] "Clementin" "Starry" "Momo" "Yellow" ...
## $ team_name : chr [1:256] "O'rangers" "Team Galactic" "Team Momo" "Mellow Yellow" ...
## $ time_s : num [1:256] 28.1 28.4 28.4 28.7 28.7 ...
## $ pole : chr [1:256] "P1" "P2" "P3" "P4" ...
## $ points : num [1:256] NA NA NA NA NA NA NA NA NA NA ...
## $ track_length_m: num [1:256] 12.8 12.8 12.8 12.8 12.8 ...
## $ number_laps : num [1:256] 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_time_lap : num [1:256] 28.1 28.4 28.4 28.7 28.7 ...
## $ host : chr [1:256] "No" "No" "No" "No" ...
## $ notes : chr [1:256] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. race = col_character(),
## .. site = col_character(),
## .. source = col_character(),
## .. marble_name = col_character(),
## .. team_name = col_character(),
## .. time_s = col_double(),
## .. pole = col_character(),
## .. points = col_double(),
## .. track_length_m = col_double(),
## .. number_laps = col_double(),
## .. avg_time_lap = col_double(),
## .. host = col_character(),
## .. notes = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
This confirms imported data classes are correct and match the data dictionary.
Now use base R function summary()
.
#data summary
summary(marbles)
## date race site source
## Length:256 Length:256 Length:256 Length:256
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## marble_name team_name time_s pole
## Length:256 Length:256 Min. : 17.76 Length:256
## Class :character Class :character 1st Qu.: 28.40 Class :character
## Mode :character Mode :character Median : 36.28 Mode :character
## Mean :190.84
## 3rd Qu.:338.16
## Max. :492.01
## NA's :3
## points track_length_m number_laps avg_time_lap
## Min. : 0.000 Min. :11.90 Min. : 1.00 Min. :17.76
## 1st Qu.: 0.000 1st Qu.:12.62 1st Qu.: 1.00 1st Qu.:25.94
## Median : 3.000 Median :13.02 Median : 5.00 Median :30.05
## Mean : 6.453 Mean :13.22 Mean : 6.25 Mean :29.70
## 3rd Qu.:11.250 3rd Qu.:14.13 3rd Qu.:10.25 3rd Qu.:33.65
## Max. :26.000 Max. :14.55 Max. :16.00 Max. :41.62
## NA's :128 NA's :3
## host notes
## Length:256 Length:256
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
There are 128 NAs in points, 3 NAs in avg_time_lap. Perhaps the NAs in avg_time_lap correspond to those that didn’t finish the race? In this output, we can’t see any information about the character variables.
Now use skim
as it lets us see the distribution of variables.
#data summary
::skim(marbles) skimr
Name | marbles |
Number of rows | 256 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 9 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
date | 0 | 1.00 | 8 | 9 | 0 | 16 | 0 |
race | 0 | 1.00 | 4 | 4 | 0 | 16 | 0 |
site | 0 | 1.00 | 7 | 15 | 0 | 8 | 0 |
source | 0 | 1.00 | 34 | 34 | 0 | 16 | 0 |
marble_name | 0 | 1.00 | 4 | 9 | 0 | 32 | 0 |
team_name | 0 | 1.00 | 6 | 16 | 0 | 16 | 0 |
pole | 128 | 0.50 | 2 | 3 | 0 | 16 | 0 |
host | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
notes | 249 | 0.03 | 37 | 100 | 0 | 7 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
time_s | 3 | 0.99 | 190.84 | 169.13 | 17.76 | 28.40 | 36.28 | 338.16 | 492.01 | <U+2587><U+2581><U+2581><U+2587><U+2581> |
points | 128 | 0.50 | 6.45 | 7.74 | 0.00 | 0.00 | 3.00 | 11.25 | 26.00 | <U+2587><U+2582><U+2582><U+2581><U+2581> |
track_length_m | 0 | 1.00 | 13.22 | 0.95 | 11.90 | 12.62 | 13.02 | 14.13 | 14.55 | <U+2585><U+2585><U+2582><U+2581><U+2587> |
number_laps | 0 | 1.00 | 6.25 | 5.53 | 1.00 | 1.00 | 5.00 | 10.25 | 16.00 | <U+2587><U+2581><U+2583><U+2582><U+2582> |
avg_time_lap | 3 | 0.99 | 29.70 | 5.55 | 17.76 | 25.94 | 30.05 | 33.65 | 41.62 | <U+2583><U+2586><U+2587><U+2587><U+2582> |
A few conclusions from this output:
pole
and points
suggests that there are two rows for each race (opportunity to concatenate the rows)All of this information together suggests there were 16 teams competing in 8 races for Season 1 of Marbula One. Each race was held at a different track, and there was a host team for each race. The first date for each race is the qualifying race date, and the second date is the actual race.
Based on the examination of the data so far, there are a few immediately obvious steps for cleaning the data:
notes
variable.race
and date
variable with a new race_number
variable.source
, notes
, avg_lap_time
, time_s
).Let’s start with step #1: create new variables to specify race vs qualifying times. There are a number of ways to do this, but I’m going to use the fact that pole
is NA for actual races and points
is NA for qualifying races.
#create shadow df to not manipulate the raw df
<- marbles
marbles_1
#for race time in seconds
$qual_time <- ifelse(is.na(marbles_1$points), marbles_1$time_s, NA)
marbles_1$race_time <- ifelse(is.na(marbles_1$pole), marbles_1$time_s, NA)
marbles_1
#for average lap length
$qual_avg_lap <- ifelse(is.na(marbles_1$points), marbles_1$avg_time_lap, NA)
marbles_1$race_avg_lap <- ifelse(is.na(marbles_1$pole), marbles_1$avg_time_lap, NA) marbles_1
Moving onto step #2: investigate the 7 notes
values.
#print the notes
$notes[!is.na(marbles_1$notes)] marbles_1
## [1] "Note: Came to complete stop in Lap 14"
## [2] "*Note: A yellow SAFETY flag is issued due to incident in Lap 1."
## [3] "Shortly after, a red SUSPENDED flag is issued to restart the race, due to major blockage."
## [4] "**Note: Upon the restart, another red flag is issued due to a track invasion incident by a rowdy fan"
## [5] "Race resumed normally after the culprit is escorted by security marbles"
## [6] "*Note: Slight incident between Speedy and Clementin"
## [7] "Ultimately, JMRC reviews and deems no action is necessary"
#are they associated with different races / teams?
!(is.na(marbles_1$notes)), ] marbles_1[
## # A tibble: 7 x 18
## date race site source marble_name team_name time_s pole points
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 22-Mar-20 S1R6 Short Circuit https~ Sublime Limers NA <NA> 0
## 2 29-Mar-20 S1R7 Razzway https~ Smoggy Hazers 331. <NA> 25
## 3 29-Mar-20 S1R7 Razzway https~ Orangin O'rangers 332. <NA> 19
## 4 29-Mar-20 S1R7 Razzway https~ Anarchy Balls of~ 334. <NA> 10
## 5 29-Mar-20 S1R7 Razzway https~ Rapidly Savage S~ 336. <NA> 6
## 6 4-Apr-20 S1Q8 Midnight Bay https~ Speedy Savage S~ 24.5 P1 NA
## 7 4-Apr-20 S1Q8 Midnight Bay https~ Clutter Balls of~ 25.2 P3 NA
## # ... with 9 more variables: track_length_m <dbl>, number_laps <dbl>,
## # avg_time_lap <dbl>, host <chr>, notes <chr>, qual_time <dbl>,
## # race_time <dbl>, qual_avg_lap <dbl>, race_avg_lap <dbl>
#since each note represents a different race / team, we can leave the notes and let them be combined
# I prefer to keep the notes in case we need to explain any anomalies in the data
Next, step 3: Replace race
and date
variable with a new race_number
variable. We can do this by extracting the 4th character in the race
variable, which will tell us the race number.
#just using base R `substr` function to specify finding the 4th value and then selecting the 4th value
$race_number <- substr(marbles_1$race, 4, 4)
marbles_1
<- marbles_1 %>%
marbles_2 ::select(-c(date, race, source, time_s, avg_time_lap)) dplyr
Moving onto step 4: drop irrelevant variables. This is to avoid confusing R when we collapse the rows.
#drop date, race, source, time_s, avg_time_lap
<- marbles_1 %>%
marbles_2 ::select(-c(date, race, source, time_s, avg_time_lap))
dplyr
#reorder columns
<- marbles_2[, c(14, 1:3, 6:8, 4, 10, 12, 5, 11, 13, 9)] marbles_3
The big step, #5: combine all the rows such that there is only one row per race / team.
#create filtered subset for qualifying data
<- marbles_3 %>%
marbles_4 ::select(-c(points, race_time, race_avg_lap)) %>%
dplyr::filter(!is.na(pole)) %>%
dplyr::rename(notes_1 = notes)
dplyr
#create filtered subset for race data
<- marbles_3 %>%
marbles_5 ::select(-c(pole, qual_time, qual_avg_lap)) %>%
dplyr::filter(!is.na(points)) %>%
dplyr::rename(notes_2 = notes)
dplyr
#bind the columns
<- cbind(marbles_4, marbles_5)
marbles_6
#remove duplicate columns
<- marbles_6 %>%
marbles_7 ::select(-c(1:7))
dplyr
#combine notes columns
<- marbles_7 %>%
marbles_8 ::mutate(notes = dplyr::coalesce(notes_1, notes_2)) %>%
dplyr::select(-c(notes_1, notes_2))
dplyr
#reorder columns
<- marbles_8[, c(4:10, 1:3, 12, 13, 11, 14)] marbles_9
We now have a cleaned dataset that has one row per race/team and includes information about qualifying and actual races. Let’s look at the dataframe summary.
::skim(marbles_9) skimr
Name | marbles_9 |
Number of rows | 128 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
race_number | 0 | 1.00 | 1 | 1 | 0 | 8 | 0 |
site | 0 | 1.00 | 7 | 15 | 0 | 8 | 0 |
marble_name | 0 | 1.00 | 4 | 9 | 0 | 32 | 0 |
team_name | 0 | 1.00 | 6 | 16 | 0 | 16 | 0 |
host | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
pole | 0 | 1.00 | 2 | 3 | 0 | 16 | 0 |
notes | 121 | 0.05 | 37 | 100 | 0 | 7 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
track_length_m | 0 | 1.00 | 13.22 | 0.95 | 11.90 | 12.62 | 13.02 | 14.13 | 14.55 | <U+2585><U+2585><U+2582><U+2581><U+2587> |
number_laps | 0 | 1.00 | 11.50 | 2.41 | 9.00 | 9.75 | 10.50 | 13.25 | 16.00 | <U+2587><U+2582><U+2582><U+2582><U+2582> |
qual_time | 0 | 1.00 | 27.52 | 5.16 | 17.76 | 23.93 | 28.55 | 31.00 | 39.49 | <U+2585><U+2587><U+2587><U+2585><U+2582> |
qual_avg_lap | 0 | 1.00 | 27.52 | 5.16 | 17.76 | 23.93 | 28.55 | 31.00 | 39.49 | <U+2585><U+2587><U+2587><U+2585><U+2582> |
race_time | 3 | 0.98 | 358.08 | 48.77 | 317.73 | 333.06 | 338.23 | 360.16 | 492.01 | <U+2587><U+2582><U+2581><U+2581><U+2582> |
race_avg_lap | 3 | 0.98 | 31.92 | 5.03 | 23.39 | 29.66 | 32.44 | 34.48 | 41.62 | <U+2586><U+2585><U+2587><U+2583><U+2583> |
points | 0 | 1.00 | 6.45 | 7.74 | 0.00 | 0.00 | 3.00 | 11.25 | 26.00 | <U+2587><U+2582><U+2582><U+2581><U+2581> |
There are a few variables that might be interesting to create based on the data:
For the season-long characteristics, I’m going to create a new season summary dataframe, and then I will merge it back into the original dataframe at the end of this section.
Let’s start with #1: find total points for the season for each team.
#sum the points by team_name
<- marbles_9 %>%
season_stats ::filter(!is.na(points)) %>%
dplyr::group_by(team_name) %>%
dplyr::summarise(total_points = sum(points)) dplyr
Next, step #2: final season rankings
#create a variable that is the rank order of the total_points variable
<- season_stats %>%
season_stats_1 ::mutate(team_rank = rank(desc(total_points),
dplyrties.method = "min"))
Moving onto #3, finding the number of times a team had the fastest lap in a season
#count number of each times each points value occurs for each team
<- marbles_9 %>%
fastest_lap_counts ::count(team_name, points)
dplyr
#filter out the values that do not match the points dictionary
#count the number of times each time occurs in the filtered list
<- fastest_lap_counts %>%
fastest_lap_counts_1 ::filter(!(points %in% points_dictionary$value)) %>%
dplyr::count(team_name)
dplyr
#merge into season stats df
<- merge(season_stats_1, fastest_lap_counts_1, all = TRUE)
season_stats_2
#replace NA with 0
is.na(season_stats_2)] <- 0
season_stats_2[
#relabel n
names(season_stats_2)[4] <- "fastest_lap_count"
Now we need to combine the season_stats
dataframe into the marbles_9
dataframe.
<- merge(marbles_9, season_stats_2) marbles_10
Next, #4, let’s calculate the current total number of points at the beginning of each race. In other words, finding the cumulative points throughout the season.
#fix class of race_number
as.numeric(marbles_10$race_number)
## [1] 2 7 6 1 3 4 8 5 7 4 2 6 5 1 3 8 7 5 1 6 8 2 4 3 6 8 7 4 2 1 3 5 6 5 3 1 4
## [38] 2 7 8 8 4 3 5 2 7 6 1 7 2 5 4 6 3 8 1 2 1 4 5 3 8 6 7 4 6 7 5 1 2 3 8 3 2
## [75] 8 7 5 6 4 1 6 5 3 1 2 7 8 4 6 5 1 7 8 4 3 2 1 2 8 7 4 5 6 3 4 3 7 2 8 6 5
## [112] 1 2 1 6 7 3 8 4 5 8 6 4 3 1 5 7 2
#I have no idea why this works but it does
<- marbles_10 %>%
marbles_11 ::group_by(race_number, team_name) %>%
dplyr::arrange(race_number, team_name) %>%
dplyr::mutate(cumsum = cumsum(points)) %>%
dplyr::group_by(team_name) %>%
dplyr::mutate(current_points = cumsum(cumsum)) %>%
dplyr::ungroup() %>%
dplyr::select(-cumsum) dplyr
Moving onto #5, find the current ranking after each race throughout the season.
#find the points after each race
<- marbles_11 %>%
marbles_12 ::group_by(race_number) %>%
dplyr::mutate(current_rank = rank(desc(current_points),
dplyrties.method = "min"))
Next, #6, find the differences in time between the qualifying and actual races. The qualifying races are only 1 lap each, so it’s really only meaningful to find the difference in average lap times.
$time_diff <- marbles_12$race_avg_lap - marbles_12$qual_avg_lap marbles_12
Lastly, #7, calculating the speed of the marble in each race. We have variables that represent track length in m, the number of laps, and race time in seconds, so we can simply divide the two to calculate the speed in meters per second.
$speed <- (marbles_12$track_length_m * marbles_12$number_laps) / marbles_12$race_time marbles_12
We can now remove any unimportant variables and rearrange the columns to make a pretty, organized dataframe.
<- marbles_12 %>%
marbles_processed ::select(race_number,
dplyr
site,
track_length_m,
number_laps,
team_name,
host,
marble_name,
pole,
qual_time,
race_time,
race_avg_lap,
speed,
time_diff,
points,
current_points,
current_rank,
total_points,
team_rank,
fastest_lap_count, notes)
For future reference, I’m saving the processed data and the updated the data dictionary as a .Rds in the data
folder in the GitHub repository for this website.
#location to save file
<- here::here("data", "tidytuesday2", "processeddata.rds")
save_data_location
#save processed data as .rds
saveRDS(marbles_processed, file = save_data_location)
#create new data dictionary
<- c("race_number", "site", "track_length_m", "number_laps", "team_name", "host", "marble_name", "pole", "qual_time", "race_time", "race_avg_lap", "speed", "time_diff", "points", "current points", "current_rank", "total_points", "team_rank", "fastest_lap_count", "notes")
variable <- c("character", "character", "number", "number", "character", "character", "character", "character", "number", "number", "number", "number", "number", "number", "number", "integer", "number", "integer", "number", "character")
class <- c("number of the race in the season", "race location", "length of the track in meters", "number of laps for the site", "team name", "whether or not the team hosted the race", "marble name", "starting position determined by qualifying rounds", "qualifying time", "race time", "average time per lap in race", "average speed of marble in the actual race", "difference in average lap time between race and qualifying times", "points earned for the race", "cumulative points at the end of each race", "rank in the league at the end of each race", "total season points for the team", "final team ranking at the end of the season", "number of times the team had the fastest lap in the season", "notes (very few, but some notes about potential errors")
description
<- data.frame(variable, class, description)
data_dictionary_2 print(data_dictionary_2)
## variable class
## 1 race_number character
## 2 site character
## 3 track_length_m number
## 4 number_laps number
## 5 team_name character
## 6 host character
## 7 marble_name character
## 8 pole character
## 9 qual_time number
## 10 race_time number
## 11 race_avg_lap number
## 12 speed number
## 13 time_diff number
## 14 points number
## 15 current points number
## 16 current_rank integer
## 17 total_points number
## 18 team_rank integer
## 19 fastest_lap_count number
## 20 notes character
## description
## 1 number of the race in the season
## 2 race location
## 3 length of the track in meters
## 4 number of laps for the site
## 5 team name
## 6 whether or not the team hosted the race
## 7 marble name
## 8 starting position determined by qualifying rounds
## 9 qualifying time
## 10 race time
## 11 average time per lap in race
## 12 average speed of marble in the actual race
## 13 difference in average lap time between race and qualifying times
## 14 points earned for the race
## 15 cumulative points at the end of each race
## 16 rank in the league at the end of each race
## 17 total season points for the team
## 18 final team ranking at the end of the season
## 19 number of times the team had the fastest lap in the season
## 20 notes (very few, but some notes about potential errors
#location to save data dictionary
<- here::here("data", "tidytuesday2", "datadictionary.rds")
save_dictionary_location
#save data dictionary as .rds
saveRDS(data_dictionary_2, file = save_dictionary_location)
Now that we’ve created a cleaned and processed dataset, let’s do some exploration. In lieu of following a formal exploratory data analysis flow, there are a few figures to create that will illustrate the data:
To start, create a figure that shows marble names assigned to each team.
#we can use a barplot and a facet wrap to accomplish this
<- marbles_processed %>%
marble_names_plot ::group_by(team_name, marble_name) %>%
dplyr::count() %>%
dplyr::ggplot(aes(x = marble_name,
ggplot2y = n,
fill = team_name)) +
geom_bar(stat = "identity") +
geom_text(aes(label = marble_name),
vjust = 1.75,
col = "black",
size = 4) +
facet_wrap(~ team_name,
scales = "free",
nrow = 8) +
theme_void() +
theme(legend.position = "none",
strip.text.x = element_text(size = 11))
marble_names_plot
Next, plot competition points for each team throughout the season.
<- marbles_processed %>%
team_points_plot ::ggplot(aes(x = race_number,
ggplot2y = current_points,
group = team_name,
color = team_name)) +
geom_line(size = 1) +
scale_y_continuous(expand = c(0,0),
breaks = seq(0, 110, by = 10)) +
scale_x_discrete(expand = c(0, 0),
labels = unique(marbles_processed$site)) +
labs(x = "Race",
y = "Cumulative Points") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1))
team_points_plot
Now, plot total points earned by each marble for each team at the end of the season. We don’t have the variable for points by marble created, but easy enough to include in the pre-processing before ggplot step.
<- marbles_processed %>%
marble_points ::group_by(team_name, marble_name) %>%
dplyr::summarise(marble_points = sum(points)) %>%
dplyr::arrange(-marble_points) %>%
dplyr::ggplot(aes(x = reorder(marble_name, marble_points),
ggplot2y = marble_points,
fill = team_name)) +
geom_bar(stat = "identity") +
facet_wrap(~ team_name,
scales = "free") +
coord_flip() +
labs(x = "Marble Name",
y = "\nTotal Points") +
theme_minimal() +
theme(legend.position = "none")
## `summarise()` has grouped output by 'team_name'. You can override using the `.groups` argument.
marble_points
To visualize points at each race location for each team, a grouped bar plot would get chaotic, and a facet_wrap would also be hard to interpret. What if we made a gradient matrix with points in each box?
#using the points color scheme from the marble racing wiki page
<- marbles_processed %>%
race_teams_plot ::ggplot(aes(x = site,
ggplot2y = team_name)) +
geom_tile(aes(fill = points)) +
scale_fill_gradientn(colours = c("#ffffbf",
"#dfdfdf",
"#ffdf9f",
"#dfffdf",
"#cfcfff"),
values = scales::rescale(c(1, 0.7, 0.6, 0.3, 0))) +
geom_text(aes(label = points)) +
labs(x = "Race Location",
y = "Team Name") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.3,
hjust = 1))
race_teams_plot
Next, plot team rankings throughout the season.
#If this was a real exercise, I would spend more time adjusting the formatting to make it prettier.
<- marbles_processed %>%
team_rank_plot ::ggplot(aes(x = race_number,
ggplot2y = current_rank,
group = team_name,
color = team_name)) +
::geom_bump(smooth = 7,
ggbumpsize = 2.2) +
geom_point(data = marbles_processed %>% dplyr::filter(race_number == 1),
size = 5) +
geom_point(data = marbles_processed %>% dplyr::filter(race_number == 8),
size = 5,
shape = 21,
fill = "black",
stroke = 2) +
geom_text(data = marbles_processed %>% dplyr::filter(race_number == 8),
aes(label = team_rank),
size = 5,
hjust = -1) +
coord_cartesian(clip = "off") +
scale_x_discrete(expand = c(.001, .001),
labels = glue::glue("Race {1:8}")) +
scale_y_reverse(expand = c(.03, .03),
breaks = 1:16) +
theme_minimal() +
labs(x = "",
y = "Team Rank")
team_rank_plot
Moving onto #6, a scatterplot of marble speed and points
#only plot times where the marble earned a point
<- marbles_processed %>%
speed_points ::filter(points > 1) %>%
dplyr::ggplot(aes(x = points,
ggplot2y = speed)) +
geom_point(col="blue") +
facet_wrap(~ site,
scales = 'free') +
theme_minimal() +
labs(y = "Speed (m/s)",
x = "Points")
speed_points
Lastly, before we can think about models, let’s look at the pairwise scatter plot. We can only use the continuous variables, first subset the processed data, and then plot it.
<- marbles_processed %>%
marbles_cont ::select(race_number,
dplyr
track_length_m,
number_laps,
qual_time,
race_time,
race_avg_lap,
speed,
time_diff,
points,
current_points,
current_rank,
total_points,
team_rank,
fastest_lap_count)
$race_number <- as.numeric(marbles_cont$race_number)
marbles_cont$current_rank <- as.numeric(marbles_cont$current_rank)
marbles_cont$team_rank <- as.numeric(marbles_cont$team_rank)
marbles_cont
pairs(marbles_cont)
So not super helpful…too many variables.
What are the significant predictors in predicting team score at the end of the season?
Outcome of interest: total team points at the end of the season (total_points
in this code).
Predictors of interest include:
I’m choosing to not include team or marble names, as it limits the applicability to other seasons/other groups.
We need to subset our processed data and adjust the variable formats before creating the machine learning models.
#drop the season-specific variables
<- marbles_processed %>%
marbles_ML ::select(-c(team_name,
dplyr
marble_name,
team_rank,
notes))
#adjust variable formats
$race_number <- as.numeric(marbles_ML$race_number)
marbles_ML$current_rank <- as.numeric(marbles_ML$current_rank)
marbles_ML
#remove the "P" from the pole variable.
$pole <- as.numeric(stringr::str_sub(marbles_ML$pole, 2))
marbles_ML
#LASSO won't handle NA values, so replace NAs with 0 for the marbles that did not finish races
is.na(marbles_ML)] <- 0 marbles_ML[
Specify the following parameters:
123
total_points
as the stratification variabletotal_points
for the CV folds#set random seed to 123
set.seed(123)
#split dataset into 70% training, 30% testing
#use BodyTemp as stratification
<- rsample::initial_split(marbles_ML, prop = 7/10,
data_split strata = total_points)
#create dataframes for the two sets:
<- rsample::training(data_split)
train_data <- rsample::testing(data_split)
test_data
#5-fold cross validation, 5 times repeated, stratified on `total_points` for the CV folds
<- rsample::vfold_cv(train_data,
folds v = 5,
repeats = 5,
strata = total_points)
#create recipe that codes categorical variables as dummy variables
<- recipes::recipe(total_points ~ ., data = train_data) %>%
marbles_rec ::step_dummy(all_nominal_predictors()) recipes
Determine the performance of a null model (i.e. one with no predictors). For a continuous outcome and RMSE as the metric, a null model is one that predicts the mean of the outcome. Compute the RMSE for both training and test data for such a model.
#create null model
<- parsnip::null_model() %>%
null_mod ::set_engine("parsnip") %>%
parsnip::set_mode("regression")
parsnip
#add recipe and model into workflow
<- workflows::workflow() %>%
null_wflow ::add_recipe(marbles_rec) %>%
workflows::add_model(null_mod)
workflows
#"fit" model to training data
<- null_wflow %>%
null_train ::fit(data = train_data)
parsnip
#summary of null model with training data to get mean (which in this case is the RMSE)
<- broom.mixed::tidy(null_train)
null_train_sum null_train_sum
## # A tibble: 1 x 1
## value
## <dbl>
## 1 51.5
#RMSE for training data
<- tibble::tibble(
null_RMSE_train rmse = rmse_vec(truth = train_data$total_points,
estimate = rep(mean(train_data$total_points), nrow(train_data))),
SE = 0,
model = "Null - Train")
#RMSE for testing data
<- tibble::tibble(
null_RMSE_test rmse = rmse_vec(truth = test_data$total_points,
estimate = rep(mean(test_data$total_points), nrow(test_data))),
SE = 0,
model = "Null - Test")
Most of the code for this section comes from the TidyModels Tutorial for Tuning.
#run parallels to determine number of cores
<- parallel::detectCores() - 1
cores cores
## [1] 19
<- makeCluster(cores)
cl
registerDoParallel(cl)
#define the tree model
<-
tree_mod ::decision_tree(cost_complexity = tune(),
parsniptree_depth = tune(),
min_n = tune()) %>%
::set_engine("rpart") %>%
parsnip::set_mode("regression")
parsnip
#use the recipe specified earlier
#define workflow for tree
<- workflows::workflow() %>%
tree_wflow ::add_model(tree_mod) %>%
workflows::add_recipe(marbles_rec) workflows
#tuning grid specification
<- dials::grid_regular(cost_complexity(),
tree_grid tree_depth(),
min_n(),
levels = 5)
#tree depth
%>%
tree_grid ::count(tree_depth) dplyr
## # A tibble: 5 x 2
## tree_depth n
## <int> <int>
## 1 1 25
## 2 4 25
## 3 8 25
## 4 11 25
## 5 15 25
tune_grid()
function#tune the model with previously specified cross-validation and RMSE as target metric
<- tree_wflow %>%
tree_res ::tune_grid(resamples = folds,
tunegrid = tree_grid,
control = control_grid(verbose = TRUE),
metrics = yardstick::metric_set(rmse))
#collect metrics
%>% workflowsets::collect_metrics() tree_res
## # A tibble: 125 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 2 rmse standard 17.1 25 0.412
## 2 0.0000000178 1 2 rmse standard 17.1 25 0.412
## 3 0.00000316 1 2 rmse standard 17.1 25 0.412
## 4 0.000562 1 2 rmse standard 17.1 25 0.412
## 5 0.1 1 2 rmse standard 17.1 25 0.412
## 6 0.0000000001 4 2 rmse standard 16.4 25 0.608
## 7 0.0000000178 4 2 rmse standard 16.4 25 0.608
## 8 0.00000316 4 2 rmse standard 16.4 25 0.608
## 9 0.000562 4 2 rmse standard 16.4 25 0.609
## 10 0.1 4 2 rmse standard 16.0 25 0.404
## # ... with 115 more rows, and 1 more variable: .config <chr>
#default visualization
%>% autoplot() tree_res
#more detailed plot
%>%
tree_res ::collect_metrics() %>%
workflowsets::mutate(tree_depth = factor(tree_depth)) %>%
dplyr::ggplot(aes(cost_complexity, mean, color = tree_depth)) +
ggplot2geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = 0.9, end = 0)
#select the tree model with the lowest rmse
<- tree_res %>%
tree_lowest_rmse ::select_best("rmse")
tune
#finalize the workflow by using the selected lasso model
<- tree_wflow %>%
best_tree_wflow ::finalize_workflow(tree_lowest_rmse)
tune best_tree_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (regression)
##
## Main Arguments:
## cost_complexity = 0.0000000001
## tree_depth = 4
## min_n = 11
##
## Computational engine: rpart
#one last fit on the training data
<- best_tree_wflow %>%
best_tree_fit ::fit(data = train_data) parsnip
#plot the tree
::rpart.plot(x = workflowsets::extract_fit_parsnip(best_tree_fit)$fit,
rpart.plotroundint = F,
type = 5,
digits = 5,
main = "Selected Tree Model")
#find predictions and intervals
<- best_tree_fit %>%
tree_resid ::augment(new_data = train_data) %>%
broom.mixed::select(.pred, total_points) %>%
dplyr::mutate(.resid = .pred - total_points)
dplyr
#plot model predictions from tuned model versus actual outcomes
#geom_abline draws a 45 degree line, along which the results should fall
::ggplot(tree_resid, aes(x = .pred, y = total_points)) +
ggplot2geom_abline(slope = 1, intercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Decision Tree Fit: Predicted vs. Actual Total Points",
x = "Predicted Total Points",
y = "Actual Total Points")
#plot model with residuals
#the geom_hline plots a straight horizontal line along which the results should fall
::ggplot(tree_resid, aes(x = as.numeric(row.names(tree_resid)), y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Decision Tree Fit: Residuals",
x = "Observation Number",
y = "Residual")
#plot model fit vs residuals
#the geom_hline plots a straight horizontal line along which the results fall
::ggplot(tree_resid, aes(x = .pred, y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Decision Tree Fit: Residuals vs Fitted Total Points",
x = "Fitted Total Points",
y = "Residual")
#print model performance
#print 10 best performing hyperparameter sets
%>%
tree_res ::show_best(n = 10) %>%
tune::select(rmse = mean, std_err, cost_complexity) %>%
dplyr::mutate(rmse = round(rmse, 3),
dplyrstd_err = round(std_err, 4),
cost_complexity = scales::scientific(cost_complexity))
## # A tibble: 10 x 3
## rmse std_err cost_complexity
## <dbl> <dbl> <chr>
## 1 14.5 0.422 1.00e-10
## 2 14.5 0.422 1.78e-08
## 3 14.5 0.422 3.16e-06
## 4 14.5 0.422 5.62e-04
## 5 14.9 0.437 1.00e-10
## 6 14.9 0.437 1.78e-08
## 7 14.9 0.437 3.16e-06
## 8 14.9 0.437 5.62e-04
## 9 14.9 0.427 1.00e-10
## 10 14.9 0.427 1.78e-08
#print the best model performance
<- tree_res %>% tune::show_best(n = 1)
tree_performance print(tree_performance)
## # A tibble: 1 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 4 11 rmse standard 14.5 25 0.422
## # ... with 1 more variable: .config <chr>
#compare model performance to null model
<- tree_res %>%
tree_RMSE ::show_best(n = 1) %>%
tune::transmute(
dplyrrmse = round(mean, 3),
SE = round(std_err, 4),
model = "Tree") %>%
::bind_rows(null_RMSE_train)
dplyr tree_RMSE
## # A tibble: 2 x 3
## rmse SE model
## <dbl> <dbl> <chr>
## 1 14.5 0.422 Tree
## 2 23.5 0 Null - Train
The identified best performing tree model includes current_rank, fastest_lap_count, current_points, and race_time as key predictor variables. In comparing the RMSE to the null model, this is about a 33% reduction in RMSE, so it’s promising. Let’s try some other models.
Most of the code for this section comes from the TidyModels Tutorial Case Study.
#define the lasso model
#mixture = 1 identifies the model to be a LASSO model
<-
lasso_mod ::linear_reg(mode = "regression",
parsnippenalty = tune(),
mixture = 1) %>%
::set_engine("glmnet")
parsnip
#use the recipe specified earlier
#define workflow for lasso regression
<- workflows::workflow() %>%
lasso_wflow ::add_model(lasso_mod) %>%
workflows::add_recipe(marbles_rec) workflows
#tuning grid specification
<- tibble(penalty = 10^seq(-3, 3, length.out = 30))
lasso_grid
#5 lowest penalty values
%>%
lasso_grid ::top_n(-5) dplyr
## Selecting by penalty
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 0.001
## 2 0.00161
## 3 0.00259
## 4 0.00418
## 5 0.00672
#5 highest penalty values
%>%
lasso_grid ::top_n(5) dplyr
## Selecting by penalty
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 149.
## 2 240.
## 3 386.
## 4 621.
## 5 1000
tune_grid()
function#tune the model with previously specified cross-validation and RMSE as target metric
<- lasso_wflow %>%
lasso_res ::tune_grid(resamples = folds,
tunegrid = lasso_grid,
control = control_grid(verbose = TRUE,
save_pred = TRUE),
metrics = metric_set(rmse))
#look at 15 models with lowest RMSEs
<- lasso_res %>%
top_lasso_models ::show_best("rmse", n = 15) %>%
tune::arrange(penalty)
dplyr top_lasso_models
## # A tibble: 15 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00418 rmse standard 13.3 25 0.436 Preprocessor1_Model04
## 2 0.00672 rmse standard 13.3 25 0.434 Preprocessor1_Model05
## 3 0.0108 rmse standard 13.3 25 0.433 Preprocessor1_Model06
## 4 0.0174 rmse standard 13.3 25 0.430 Preprocessor1_Model07
## 5 0.0281 rmse standard 13.3 25 0.427 Preprocessor1_Model08
## 6 0.0452 rmse standard 13.2 25 0.424 Preprocessor1_Model09
## 7 0.0728 rmse standard 13.2 25 0.420 Preprocessor1_Model10
## 8 0.117 rmse standard 13.1 25 0.418 Preprocessor1_Model11
## 9 0.189 rmse standard 13.1 25 0.418 Preprocessor1_Model12
## 10 0.304 rmse standard 13.0 25 0.418 Preprocessor1_Model13
## 11 0.489 rmse standard 13.0 25 0.416 Preprocessor1_Model14
## 12 0.788 rmse standard 13.0 25 0.406 Preprocessor1_Model15
## 13 1.27 rmse standard 13.0 25 0.387 Preprocessor1_Model16
## 14 2.04 rmse standard 13.0 25 0.373 Preprocessor1_Model17
## 15 3.29 rmse standard 13.2 25 0.375 Preprocessor1_Model18
#default visualization
%>% autoplot() lasso_res
#create a graph to see when there is a significant change in the penalty
#this is the same as above, just a little more detail
%>%
lasso_res ::collect_metrics() %>%
workflowsets::ggplot(aes(penalty, mean, color = .metric)) +
ggplot2::geom_errorbar(aes(ymin = mean - std_err,
ggplot2ymax = mean + std_err),
alpha = 0.5) +
::geom_line(size = 1.5) +
ggplot2::scale_x_log10() ggplot2
#select the lasso model with the lowest rmse
<- lasso_res %>%
lasso_lowest_rmse ::select_best("rmse")
tune
#finalize the workflow by using the selected lasso model
<- lasso_wflow %>%
best_lasso_wflow ::finalize_workflow(lasso_lowest_rmse)
tune best_lasso_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 2.04335971785694
## mixture = 1
##
## Computational engine: glmnet
#one last fit on the training data
<- best_lasso_wflow %>%
best_lasso_fit ::fit(data = train_data)
parsnip
#create a table of model fit that includes the predictors in the model
#i.e. include all non-zero estimates
<- best_lasso_fit %>%
lasso_tibble ::extract_fit_parsnip() %>%
workflowsets::tidy() %>%
broom::filter(estimate != 0) %>%
dplyr::mutate_if(is.numeric, round, 4)
dplyr lasso_tibble
## # A tibble: 5 x 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) 66.9 2.04
## 2 current_points 0.112 2.04
## 3 current_rank -2.68 2.04
## 4 fastest_lap_count 10.2 2.04
## 5 host_Yes 1.89 2.04
#extract model from final fit
<- best_lasso_fit$fit$fit$fit
x_lasso
#plot how number of predictors included in LASSO model changes with the tuning parameter
plot(x_lasso, "lambda")
#the larger the regularization penalty, the fewer the predictors in the model
#find predictions and intervals
<- best_lasso_fit %>%
lasso_resid ::augment(new_data = train_data) %>%
broom.mixed::select(.pred, total_points) %>%
dplyr::mutate(.resid = .pred - total_points)
dplyr
#plot model predictions from tuned model versus actual outcomes
#geom_abline plots a 45 degree line along which the results should fall
::ggplot(lasso_resid, aes(x = .pred, y = total_points)) +
ggplot2geom_abline(slope = 1, intercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO fit: Predicted vs. Actual Total Points",
x = "Actual Total Points",
y = "Predicted Total Points")
#plot model with residuals
#the geom_hline plots a straight horizontal line along which the results should fall
::ggplot(lasso_resid, aes(x = as.numeric(row.names(lasso_resid)), y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO Fit: Residuals",
x = "Observation Number",
y = "Residual")
#plot model fit vs residuals
#the geom_hline plots a straight horizontal line along which the results fall
::ggplot(lasso_resid, aes(x = .pred, y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO Fit: Residuals vs Fitted Total Points",
x = "Fitted Total Points",
y = "Residual")
#print the 10 best performing hyperparameter sets
%>%
lasso_res ::show_best(n = 10) %>%
tune::select(rmse = mean, std_err, penalty) %>%
dplyr::mutate(rmse = round(rmse, 3),
dplyrstd_err = round(std_err, 4),
`log penalty` = round(log(penalty), 3),
.keep = "unused")
## # A tibble: 10 x 3
## rmse std_err `log penalty`
## <dbl> <dbl> <dbl>
## 1 13.0 0.373 0.715
## 2 13.0 0.416 -0.715
## 3 13.0 0.418 -1.19
## 4 13.0 0.406 -0.238
## 5 13.0 0.387 0.238
## 6 13.1 0.418 -1.67
## 7 13.1 0.418 -2.14
## 8 13.2 0.420 -2.62
## 9 13.2 0.376 1.19
## 10 13.2 0.424 -3.10
#print best model performance
<- lasso_res %>% tune::show_best(n = 1)
lasso_performance lasso_performance
## # A tibble: 1 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2.04 rmse standard 13.0 25 0.373 Preprocessor1_Model17
#compare model performance to null model and tree model
<- lasso_res %>%
lasso_RMSE ::show_best(n = 1) %>%
tune::transmute(
dplyrrmse = round(mean, 3),
SE = round(std_err, 4),
model = "LASSO") %>%
::bind_rows(tree_RMSE)
dplyr lasso_RMSE
## # A tibble: 3 x 3
## rmse SE model
## <dbl> <dbl> <chr>
## 1 13.0 0.373 LASSO
## 2 14.5 0.422 Tree
## 3 23.5 0 Null - Train
In examining the results of the model, there is an improvement of the target metric (RMSE) under the LASSO model in comparison to the Decision Tree and Null models. However, the residual plots and observed vs fitted plots suggest the fit still isn’t ideal. Time to try another model!
Most of the code for this section comes from the TidyModels Tutorial Case Study.
#run parallels to determine number of cores
<- parallel::detectCores() - 1
cores cores
## [1] 19
<- makeCluster(cores)
cl
registerDoParallel(cl)
#define the RF model
<-
RF_mod ::rand_forest(mtry = tune(),
parsnipmin_n = tune(),
trees = tune()) %>%
::set_engine("ranger",
parsnipimportance = "permutation") %>%
::set_mode("regression")
parsnip
#use the recipe specified earlier (line 133)
#check to make sure identified parameters will be tuned
%>% tune::parameters() RF_mod
## Collection of 3 parameters for tuning
##
## identifier type object
## mtry mtry nparam[?]
## trees trees nparam[+]
## min_n min_n nparam[+]
##
## Model parameters needing finalization:
## # Randomly Selected Predictors ('mtry')
##
## See `?dials::finalize` or `?dials::update.parameters` for more information.
#define workflow for RF regression
<- workflows::workflow() %>%
RF_wflow ::add_model(RF_mod) %>%
workflows::add_recipe(marbles_rec) workflows
#tuning grid specification
<- expand.grid(mtry = c(3, 4, 5, 6),
RF_grid min_n = c(40, 50, 60),
trees = c(500,1000))
tune_grid()
function#tune the model with previously specified cross-validation and RMSE as target metric
<- RF_wflow %>%
RF_res ::tune_grid(resamples = folds,
tunegrid = RF_grid,
control = control_grid(verbose = TRUE, save_pred = TRUE),
metrics = metric_set(rmse))
#look at top 5 RF models
<- RF_res %>%
top_RF_models ::show_best("rmse", n = 5)
tune top_RF_models
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 1000 40 rmse standard 14.5 25 0.410 Preprocessor1_Model16
## 2 6 500 40 rmse standard 14.5 25 0.412 Preprocessor1_Model04
## 3 6 1000 50 rmse standard 15.0 25 0.421 Preprocessor1_Model20
## 4 5 1000 40 rmse standard 15.0 25 0.419 Preprocessor1_Model15
## 5 6 500 50 rmse standard 15.0 25 0.409 Preprocessor1_Model08
#default visualization
%>% autoplot() RF_res
#in future analyses, might be worth more tuning with a higher minimum node size
#select the RF model with the lowest rmse
<- RF_res %>%
RF_lowest_rmse ::select_best("rmse")
tune
#finalize the workflow by using the selected RF model
<- RF_wflow %>%
best_RF_wflow ::finalize_workflow(RF_lowest_rmse)
tune best_RF_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 6
## trees = 1000
## min_n = 40
##
## Engine-Specific Arguments:
## importance = permutation
##
## Computational engine: ranger
#one last fit on the training data
<- best_RF_wflow %>%
best_RF_fit ::fit(data = train_data) parsnip
#extract model from final fit
<- best_RF_fit$fit$fit$fit
x_RF
#plot most important predictors in the model
::vip(x_RF, num_features = 20) vip
#find predictions and intervals
<- best_RF_fit %>%
RF_resid ::augment(new_data = train_data) %>%
broom.mixed::select(.pred, total_points) %>%
dplyr::mutate(.resid = .pred - total_points)
dplyr
#plot model predictions from tuned model versus actual outcomes
#geom_abline is a 45 degree line, along which the results should fall
::ggplot(RF_resid, aes(x = .pred, y = total_points)) +
ggplot2geom_abline(slope = 1, intercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "RF Fit: Actual vs. Predicted Total Points",
x = "Predicted Total Points",
y = "Actual Total Points")
#plot model with residuals
#the geom_hline plots a straight horizontal line along which the results should fall
::ggplot(RF_resid, aes(x = as.numeric(row.names(RF_resid)), y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "RF Fit: Residuals",
x = "Observation Number",
y = "Residual")
#plot model fit vs residuals
#the geom_hline plots a straight horizontal line along which the results fall
::ggplot(RF_resid, aes(x = .pred, y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO Fit: Residuals vs Fitted Total Points",
x = "Fitted Total Points",
y = "Residual")
#print the 10 best performing hyperparameter sets
%>%
RF_res ::show_best(n = 10) %>%
tune::select(rmse = mean, std_err) %>%
dplyr::mutate(rmse = round(rmse, 3),
dplyrstd_err = round(std_err, 4),
.keep = "unused")
## # A tibble: 10 x 2
## rmse std_err
## <dbl> <dbl>
## 1 14.5 0.41
## 2 14.5 0.412
## 3 15.0 0.422
## 4 15.0 0.419
## 5 15.0 0.409
## 6 15.1 0.423
## 7 15.6 0.420
## 8 15.7 0.441
## 9 15.8 0.436
## 10 15.9 0.437
#print the best model
<- RF_res %>% tune::show_best(n = 1)
RF_performance RF_performance
## # A tibble: 1 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 1000 40 rmse standard 14.5 25 0.410 Preprocessor1_Model16
#compare model performance to null model (and other models)
<- RF_res %>%
RF_RMSE ::show_best(n = 1) %>%
tune::transmute(
dplyrrmse = round(mean, 3),
SE = round(std_err, 4),
model = "RF") %>%
::bind_rows(lasso_RMSE)
dplyr RF_RMSE
## # A tibble: 4 x 3
## rmse SE model
## <dbl> <dbl> <chr>
## 1 14.5 0.41 RF
## 2 13.0 0.373 LASSO
## 3 14.5 0.422 Tree
## 4 23.5 0 Null - Train
In examining the results of the RF model within the context of RMSE, it does not perform better than the LASSO, but it has an approximately comparable performance to the decision tree model. By far, the most important variables are fastest_lap_count, current_rank, current_points, and points obtained during the race.
#run parallels to determine number of cores
<- parallel::detectCores() - 1
cores cores
## [1] 19
<- makeCluster(cores)
cl
registerDoParallel(cl)
#start over with random seed for bootstrapping
#set random seed
set.seed(20140102)
#use bootstrap instead of CV
<- train_data %>%
data_bootstrap ::bootstraps(times = 10,
rsamplestrata = total_points)
#define model
<- baguette::bag_tree(cost_complexity = tune(),
bag_tree_mod tree_depth = tune(),
min_n = tune()) %>%
::set_engine("rpart",
parsniptimes = 10) %>%
::set_mode("regression")
parsnip
#create recipe that codes categorical variables as dummy variables
<- recipes::recipe(total_points ~ ., data = train_data) %>%
BT_rec step_string2factor(all_nominal())
#define workflow for tree
<- workflows::workflow() %>%
BT_wflow ::add_model(bag_tree_mod) %>%
workflows::add_recipe(BT_rec) workflows
#tuning grid specification
<- dials::grid_regular(cost_complexity(),
tree_grid tree_depth(),
min_n(),
levels = 5)
tune_grid()
function#tune the model with previously specified cross-validation and RMSE as target metric
<- baguette::bag_tree(cost_complexity = tune(),
BT_res tree_depth = tune(),
min_n = tune()) %>%
::set_engine("rpart",
parsniptimes = 10) %>%
::set_mode("regression") %>%
parsnip::tune_grid(preprocessor = BT_rec,
tuneresamples = data_bootstrap,
grid = tree_grid,
metrics = metric_set(rmse))
#collect metrics
%>% workflowsets::collect_metrics() BT_res
## # A tibble: 125 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 2 rmse standard 15.5 10 0.409
## 2 0.0000000178 1 2 rmse standard 15.5 10 0.377
## 3 0.00000316 1 2 rmse standard 15.6 10 0.393
## 4 0.000562 1 2 rmse standard 15.6 10 0.351
## 5 0.1 1 2 rmse standard 15.5 10 0.301
## 6 0.0000000001 4 2 rmse standard 15.2 10 0.726
## 7 0.0000000178 4 2 rmse standard 15.0 10 0.508
## 8 0.00000316 4 2 rmse standard 14.8 10 0.659
## 9 0.000562 4 2 rmse standard 15.1 10 0.541
## 10 0.1 4 2 rmse standard 14.9 10 0.395
## # ... with 115 more rows, and 1 more variable: .config <chr>
#default visualization
<- BT_res %>% autoplot()
BT_auto_plot BT_auto_plot
#more detailed plot
<- BT_res %>%
BT_det_plot ::collect_metrics() %>%
workflowsets::mutate(tree_depth = factor(tree_depth)) %>%
dplyr::ggplot(aes(cost_complexity, mean, color = tree_depth)) +
ggplot2geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = 0.9, end = 0)
BT_det_plot
#select the BT model with the lowest rmse
<- BT_res %>%
BT_lowest_rmse ::select_best("rmse")
tune
#finalize the workflow by using the selected BT model
<- BT_wflow %>%
best_BT_wflow ::finalize_workflow(BT_lowest_rmse)
tune best_BT_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: bag_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_string2factor()
##
## -- Model -----------------------------------------------------------------------
## Bagged Decision Tree Model Specification (regression)
##
## Main Arguments:
## cost_complexity = 0.0000000001
## tree_depth = 4
## min_n = 21
##
## Engine-Specific Arguments:
## times = 10
##
## Computational engine: rpart
#one last fit on the training data
<- best_BT_wflow %>%
best_BT_fit ::fit(data = train_data) parsnip
#extract model from final fit
<- best_BT_fit$fit$fit$fit
x_BT
#vip package doesn't support BT
#find predictions and intervals
<- best_BT_fit %>%
BT_resid ::augment(new_data = train_data) %>%
broom.mixed::select(.pred, total_points) %>%
dplyr::mutate(.resid = .pred - total_points)
dplyr
#plot model predictions from tuned model versus actual outcomes
#geom_abline draws a 45 degree line, along which the results should fall
<- ggplot2::ggplot(BT_resid, aes(x = .pred, y = total_points)) +
BT_pred_act geom_abline(slope = 1, intercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Bagged Tree Fit: Predicted vs. Actual Total Points",
x = "Predicted Total Points)",
y = "Actual Total Points")
BT_pred_act
#plot model with residuals
#the geom_hline plots a straight horizontal line along which the results should fall
<- ggplot2::ggplot(BT_resid, aes(x = as.numeric(row.names(BT_resid)), y = .resid))+
BT_resids geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Bagged Tree Fit: Residuals",
x = "Observation Number",
y = "Residual")
BT_resids
#plot model fit vs residuals
#the geom_hline plots a straight horizontal line along which the results fall
<- ggplot2::ggplot(BT_resid, aes(x = .pred, y = .resid))+
BT_pred_resid geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Bagged Tree Fit: Residuals vs Fitted Total Points",
x = "Predicted Total Points",
y = "Residual")
BT_pred_resid
#print model performance
#print 10 best performing hyperparameter sets
%>%
BT_res ::show_best(n = 10) %>%
tune::select(rmse = mean, std_err, cost_complexity) %>%
dplyr::mutate(rmse = round(rmse, 3),
dplyrstd_err = round(std_err, 4),
cost_complexity = scales::scientific(cost_complexity))
## # A tibble: 10 x 3
## rmse std_err cost_complexity
## <dbl> <dbl> <chr>
## 1 14.2 0.678 1.00e-10
## 2 14.3 0.746 1.00e-10
## 3 14.3 0.652 3.16e-06
## 4 14.4 0.499 3.16e-06
## 5 14.4 0.666 1.78e-08
## 6 14.4 0.522 1.00e-10
## 7 14.4 0.605 1.78e-08
## 8 14.4 0.666 5.62e-04
## 9 14.4 0.676 5.62e-04
## 10 14.4 0.666 1.00e-10
#print the best model performance
<- BT_res %>% tune::show_best(n = 1)
BT_performance print(BT_performance)
## # A tibble: 1 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 4 21 rmse standard 14.2 10 0.678
## # ... with 1 more variable: .config <chr>
#compare model performance to null model
<- BT_res %>%
BT_RMSE ::show_best(n = 1) %>%
tune::transmute(
dplyrrmse = round(mean, 3),
SE = round(std_err, 4),
model = "Bagged Tree") %>%
::bind_rows(RF_RMSE)
dplyr BT_RMSE
## # A tibble: 5 x 3
## rmse SE model
## <dbl> <dbl> <chr>
## 1 14.2 0.678 Bagged Tree
## 2 14.5 0.41 RF
## 3 13.0 0.373 LASSO
## 4 14.5 0.422 Tree
## 5 23.5 0 Null - Train
Based on the RMSE output above, the LASSO model has the lowest RMSE and therefore is the most appropriate model in this case. The RF and Tree models are virtually identical in their performance, but all three models are an improvement over the null model. It is worth noting that none of these models actually fit the data well - suggesting the predictor variables included in this dataset aren’t all that useful in predicting the desired outcome (e.g. body temperature in suspected flu patients).
#fit lasso model to training set but evaluate on the test data
<- best_lasso_wflow %>%
lasso_fit_test ::last_fit(split = data_split)
tune
#compare test performance against training performance
<- collect_metrics(lasso_fit_test) %>%
lasso_rmse_test ::select(rmse = .estimate) %>%
dplyr::mutate(data = "test")
dplyr
<- lasso_RMSE %>%
lasso_comp ::filter(model == "LASSO") %>%
dplyr::transmute(rmse, data = "train") %>%
dplyr::bind_rows(lasso_rmse_test) %>%
dplyr::slice(-3) #don't know why the third row shows up
dplyr lasso_comp
## # A tibble: 2 x 2
## rmse data
## <dbl> <chr>
## 1 13.0 train
## 2 13.7 test
#RMSEs are essentially identical --> what we want (suggests we might've avoided overfitting)
#find predictions and intervals
<- lasso_fit_test %>%
lasso_resid_fit ::augment() %>%
broom.mixed::select(.pred, total_points) %>%
dplyr::mutate(.resid = .pred - total_points) dplyr
## Adding missing grouping variables: `race_number`
#plot model predictions from tuned model versus actual outcomes
#geom_hline plots a 45 degree line, along which the results should fall
::ggplot(lasso_resid_fit, aes(x = .pred, y = total_points)) +
ggplot2geom_abline(slope = 1, intercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO fit: Predicted vs. Actual Total Points",
x = "Predicted Total Points",
y = "Actual Total Points")
#plot model with residuals
#the geom_hline plots a straight horizontal line along which the results should fall
::ggplot(lasso_resid_fit, aes(x = as.numeric(row.names(lasso_resid_fit)), y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "Lasso Test Fit: Residuals",
x = "Observation Number",
y = "Residual")
#plot model fit vs residuals
#the geom_hline plots a straight horizontal line along which the results fall
::ggplot(lasso_resid_fit, aes(x = .pred, y = .resid))+
ggplot2geom_hline(yintercept = 0, color = "red", lty = 2) +
geom_point() +
labs(title = "LASSO Test Fit: Residuals vs Fitted Total Points",
x = "Fitted Total Points",
y = "Residual")
The LASSO model fits the data best. The model using both training and testing data performs better than the other possible models. There’s certainly plenty of opportunities for further exploration or tuning. There’s an underlying time-trend series component to this data set, so cross-validation may not be the most appropriate. I would be interested to come back to this analysis after better understanding time series.