Machine Learning

Load packages

library(dotwhisker)#for visualizing regression results

Warning: package 'dotwhisker' was built under R version 4.2.2

Loading required package: ggplot2

Warning: package 'ggplot2' was built under R version 4.2.2

library(skimr) #for visualization

Warning: package 'skimr' was built under R version 4.2.2

library(here) #data loading/saving

here() starts at C:/Data/GitHub/MADA23/betelihemgetachew-MADA-portfolio2

library(dplyr)#data cleaning and processing

Warning: package 'dplyr' was built under R version 4.2.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr) #data cleaning and processing
library(tidymodels) #for modeling

Warning: package 'tidymodels' was built under R version 4.2.2

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.1     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tibble       3.1.8
✔ infer        1.0.4     ✔ tune         1.0.1
✔ modeldata    1.1.0     ✔ workflows    1.1.3
✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
✔ purrr        0.3.4     ✔ yardstick    1.1.0
✔ recipes      1.0.5

Warning: package 'dials' was built under R version 4.2.2

Warning: package 'infer' was built under R version 4.2.2

Warning: package 'modeldata' was built under R version 4.2.2

Warning: package 'parsnip' was built under R version 4.2.2

Warning: package 'recipes' was built under R version 4.2.2

Warning: package 'rsample' was built under R version 4.2.2

Warning: package 'tune' was built under R version 4.2.2

Warning: package 'workflows' was built under R version 4.2.2

Warning: package 'workflowsets' was built under R version 4.2.2

Warning: package 'yardstick' was built under R version 4.2.2

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(gmodels)#for tables

Warning: package 'gmodels' was built under R version 4.2.2

library(ggplot2)#for hisograms and charts
library(performance)


Attaching package: 'performance'

The following objects are masked from 'package:yardstick':

    mae, rmse

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.2

── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──

✔ readr   2.1.2     ✔ forcats 0.5.1
✔ stringr 1.5.0

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()

library(yardstick)
library(dotwhisker)# for visualizing regresssion results
library(broom.mixed)#for converting bayesian models to tidy tibbles

Warning: package 'broom.mixed' was built under R version 4.2.2

library(readr)
library(vip) #for variable importance plots

Warning: package 'vip' was built under R version 4.2.2


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

#library(rpart.plot)#for visualizing decision tree...note: WAS NOT ABLE TO INSTALL PACKAGE 
#library(glmnet)  WAS NOT ABLE TO INSTALL PACKAGE 
#library(ranger) WAS NOT ABLE TO INSTALL PACKAGE 
library(caret)

Warning: package 'caret' was built under R version 4.2.3

Loading required package: lattice

Attaching package: 'caret'

The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:purrr':

    lift

library(vip)

install.packages("caret")

Warning: package 'caret' is in use and will not be installed

library(caret)

#Load most recent version of the dataset

data_location <- here::here("fluanalysis","processed_data","Processed_data.RDS")

NewClean_df<- readRDS(data_location)

#checking the variables to make sure there are no identical variables. These variables are strongly correlated and lead to poor model performance. so we need to remove any identical variables. Also source of error “Predictions from a rank deficient fit may be misleading”.

view(NewClean_df)

After reviweing the dataset, four variables are identified as identical and thus we need to remove them.

NewClean_df <- subset(NewClean_df, select=-c(WeaknessYN,CoughYN,CoughYN2,MyalgiaYN))

#check if referenced variables are removed

view(NewClean_df)

#remove unbalanced predictors that are not helpful in fitting/predicting the outcome. We are now planning to remove all binary predictors with less than 50 observations. lets look at the summary so we can manually identify and remove them.

summary(NewClean_df)

 SwollenLymphNodes ChestCongestion ChillsSweats NasalCongestion Sneeze   
 No :418           No :323         No :130      No :167         No :339  
 Yes:312           Yes:407         Yes:600      Yes:563         Yes:391  
                                                                         
                                                                         
                                                                         
                                                                         
 Fatigue   SubjectiveFever Headache      Weakness    CoughIntensity
 No : 64   No :230         No :115   None    : 49   None    : 47   
 Yes:666   Yes:500         Yes:615   Mild    :223   Mild    :154   
                                     Moderate:338   Moderate:357   
                                     Severe  :120   Severe  :172   
                                                                   
                                                                   
     Myalgia    RunnyNose AbPain    ChestPain Diarrhea  EyePn     Insomnia 
 None    : 79   No :211   No :639   No :497   No :631   No :617   No :315  
 Mild    :213   Yes:519   Yes: 91   Yes:233   Yes: 99   Yes:113   Yes:415  
 Moderate:325                                                              
 Severe  :113                                                              
                                                                           
                                                                           
 ItchyEye  Nausea    EarPn     Hearing   Pharyngitis Breathless ToothPn  
 No :551   No :475   No :568   No :700   No :119     No :436    No :565  
 Yes:179   Yes:255   Yes:162   Yes: 30   Yes:611     Yes:294    Yes:165  
                                                                         
                                                                         
                                                                         
                                                                         
 Vision    Vomit     Wheeze       BodyTemp     
 No :711   No :652   No :510   Min.   : 97.20  
 Yes: 19   Yes: 78   Yes:220   1st Qu.: 98.20  
                               Median : 98.50  
                               Mean   : 98.94  
                               3rd Qu.: 99.30  
                               Max.   :103.10

#The summary shows that two variables, Hearing and Vision have less than 50 observations in one of the two response values. Thus they need to be removed.

NewClean_df <- subset(NewClean_df, select=-c(Hearing,Vision))

#now lets check if the two variables are removed

dim(NewClean_df)

[1] 730  26

#It looks like we have 730 obervatios and 26 variables.

Analysis Code

Data Setup

Data Setup #select of the data and save into training dataset, select 70 percent for training and 30% for testing. Also use the outcome BodyTemp as stratification for a more balanced outcome in the training and testing datasets

set.seed(123)
data_split <- initial_split(NewClean_df, prop=c(.7),Strata=BodyTemp)

#here we will split data for training and testing

train_data <- training(data_split)
test_data <- testing(data_split)

#take the train dataset and divide into 5 subsets using the vfold function

folds<-vfold_cv(data=train_data, v=5, strata=BodyTemp)

#create a receipe for the data and fitting

fluCV_rec <-
  recipe (BodyTemp ~., data=train_data  )%>%
step_dummy(all_nominal_predictors())%>%
  step_zv(all_predictors())%>%
  step_normalize(all_predictors())

#Create null model #create receipe for null code didnt work

#{r} #nullrecipe<- recipe(BodyTemp ~ NULL, data=train_data )%>% # step_dummy(all_nominal_predictors()) %>% #

#create model for null

null_model<- null_model()%>%
  set_engine("parsnip")%>%
  set_mode ("regression")

#Create worflow for null by combining recipe null which didont work when rendering and model

#nullworkflow <- # workflow()%>% #add_model(null_mode) #add_recipe(nullrecipe)

fit the null model using the null workflow, none of my codes for the null model would render # even though there was no error while running the chunks

#null_model_performance <- fit_resamples(nullworkflow, resamples = folds)

#RMSE Lets get a better view of all 5 evaluation results

#{r} #null_model_performance %>% #collect_metrics(summarize = FALSE) #

#Lets get a better view of one summarized result. Code not working when rendered

#{r} #null_model_performance %>% #collect_metrics(summarize = TRUE) #

#The average RMSE is 1.22 for the null model performance

#computing RMSE for both training and test model using the null model

#Model Tuning and Fitting #Decision tree model

tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = 5)

#Model tuining with a GRID

tree_wf <- workflow()%>%
  add_model(tune_spec)%>%
  add_recipe(fluCV_rec)

tree_res <- tree_wf %>%
  tune_grid(resamples = folds, grid=tree_grid)

! Fold1: internal:
  There was 1 warning in `dplyr::summarise()`.
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 1`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 4`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 8`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 11`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 15`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...

! Fold2: internal:
  There was 1 warning in `dplyr::summarise()`.
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 1`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 4`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 8`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 11`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 15`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...

! Fold3: internal:
  There was 1 warning in `dplyr::summarise()`.
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 1`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 4`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 8`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 11`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 15`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...

! Fold4: internal:
  There was 1 warning in `dplyr::summarise()`.
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 1`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 4`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 8`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 11`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 15`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...

! Fold5: internal:
  There was 1 warning in `dplyr::summarise()`.
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 1`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 4`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 8`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 11`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...
  ℹ In argument: `.estimate = metric_fn(truth = BodyTemp, estimate = .pr...
    = na_rm)`.
  ℹ In group 1: `cost_complexity = 0.1`, `tree_depth = 15`.
  Caused by warning:
  ! A correlation computation is required, but `estimate` is constant an...

tree_res %>%
collect_metrics(summarize = TRUE)

# A tibble: 50 × 8
   cost_complexity tree_depth .metric .estimator     mean     n std_err .config 
             <dbl>      <int> <chr>   <chr>         <dbl> <int>   <dbl> <chr>   
 1    0.0000000001          1 rmse    standard     1.19       5  0.0315 Preproc…
 2    0.0000000001          1 rsq     standard     0.0575     5  0.0107 Preproc…
 3    0.0000000178          1 rmse    standard     1.19       5  0.0315 Preproc…
 4    0.0000000178          1 rsq     standard     0.0575     5  0.0107 Preproc…
 5    0.00000316            1 rmse    standard     1.19       5  0.0315 Preproc…
 6    0.00000316            1 rsq     standard     0.0575     5  0.0107 Preproc…
 7    0.000562              1 rmse    standard     1.19       5  0.0315 Preproc…
 8    0.000562              1 rsq     standard     0.0575     5  0.0107 Preproc…
 9    0.1                   1 rmse    standard     1.23       5  0.0290 Preproc…
10    0.1                   1 rsq     standard   NaN          0 NA      Preproc…
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows

#RMSE is 1.22

tree_res %>%
  collect_metrics() %>%
  mutate(tree_depth = factor(tree_depth)) %>%
  ggplot(aes(cost_complexity, mean, color = tree_depth)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

Warning: Removed 5 rows containing missing values (`geom_line()`).

Warning: Removed 5 rows containing missing values (`geom_point()`).

#show best tree model

tree_res %>%
show_best("rmse")

# A tibble: 5 × 8
  cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
            <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1    0.0000000001          1 rmse    standard    1.19     5  0.0315 Preprocesso…
2    0.0000000178          1 rmse    standard    1.19     5  0.0315 Preprocesso…
3    0.00000316            1 rmse    standard    1.19     5  0.0315 Preprocesso…
4    0.000562              1 rmse    standard    1.19     5  0.0315 Preprocesso…
5    0.0000000001          4 rmse    standard    1.22     5  0.0225 Preprocesso…

best_tree <- tree_res %>% 
select_best ("rmse")

best_tree

# A tibble: 1 × 3
  cost_complexity tree_depth .config              
            <dbl>      <int> <chr>                
1    0.0000000001          1 Preprocessor1_Model01

#Finazling the model/Tuning process

final_wf <- tree_wf %>%
finalize_workflow (best_tree)

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 1

Computational engine: rpart

#Finalziing the last fit to training data

final_fit <- final_wf%>%
  fit(train_data)

Fit LASSO MODEL #I KEPT RUNING INTO ERRORS, DELETED A BUNCH OF CODES IN ORDER TO BE ABLE TO RENDER. I WASNT ABLE TO INSTALL RPART.PLOT, RANGER, GLMENT.

#Conduct LASSO to improve model prediction by potentially avoiding overfitting the model to the training data, and will also help us select the most important predictor variables #create vector of potential mabda values (this are the tuning parameters)

#Build the model

Model1LASSO <- linear_reg (penalty = tune(), mixture=1) %>%
  set_engine("glmnet")

#Create the recipe ALREADY created in previus steps

#create workflow

wfLASSO <-
  workflow()%>%
  add_model(Model1LASSO) %>%
  add_recipe(fluCV_rec)

#Create THE GRID for Tuning

gridLASSO <- tibble (penalty= 10^seq (-4,-1,length.out=30))

gridLASSO %>% top_n(-5) #basically picking the lowest penality values

Selecting by penalty

# A tibble: 5 × 1
   penalty
     <dbl>
1 0.0001  
2 0.000127
3 0.000161
4 0.000204
5 0.000259

gridLASSO %>% top_n(5)   #highest penality values

Selecting by penalty

# A tibble: 5 × 1
  penalty
    <dbl>
1  0.0386
2  0.0489
3  0.0621
4  0.0788
5  0.1

#Train and Tune the Model

LASSO_res <- 
  wfLASSO %>% 
  tune_grid(resamples=folds,
            grid = gridLASSO)

Warning: package 'glmnet' was built under R version 4.2.3

LASSO_res %>% 
  collect_metrics()

# A tibble: 60 × 7
    penalty .metric .estimator   mean     n std_err .config              
      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
 1 0.0001   rmse    standard   1.22       5 0.0370  Preprocessor1_Model01
 2 0.0001   rsq     standard   0.0479     5 0.00970 Preprocessor1_Model01
 3 0.000127 rmse    standard   1.22       5 0.0370  Preprocessor1_Model02
 4 0.000127 rsq     standard   0.0479     5 0.00970 Preprocessor1_Model02
 5 0.000161 rmse    standard   1.22       5 0.0370  Preprocessor1_Model03
 6 0.000161 rsq     standard   0.0479     5 0.00970 Preprocessor1_Model03
 7 0.000204 rmse    standard   1.22       5 0.0370  Preprocessor1_Model04
 8 0.000204 rsq     standard   0.0479     5 0.00970 Preprocessor1_Model04
 9 0.000259 rmse    standard   1.22       5 0.0370  Preprocessor1_Model05
10 0.000259 rsq     standard   0.0479     5 0.00970 Preprocessor1_Model05
# … with 50 more rows
# ℹ Use `print(n = ...)` to see more rows

LASSO_res %>% show_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 5 × 7
  penalty .metric .estimator  mean     n std_err .config              
    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1  0.0788 rmse    standard    1.18     5  0.0295 Preprocessor1_Model29
2  0.0621 rmse    standard    1.18     5  0.0300 Preprocessor1_Model28
3  0.1    rmse    standard    1.19     5  0.0289 Preprocessor1_Model30
4  0.0489 rmse    standard    1.19     5  0.0307 Preprocessor1_Model27
5  0.0386 rmse    standard    1.19     5  0.0310 Preprocessor1_Model26

bestmodel <- LASSO_res %>%
  select_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

bestmodel

# A tibble: 1 × 2
  penalty .config              
    <dbl> <chr>                
1  0.0788 Preprocessor1_Model29

#Final LASSO Model

Final_LASSO_WF <-
wfLASSO %>% finalize_workflow(bestmodel)

Final_LASSO_fit <-
  Final_LASSO_WF %>%
  fit(train_data)

#I couldnt make the below code to work to actually get the rmse value, it kept saying collectmetrics doesnt exist for this type of object #{r} #Final_LASSO_fit %>% #collect_metrics() #

x <- extract_fit_engine(Final_LASSO_fit)
plot (x,"lambda")

Random Forest*

#Build the model and improve training time

cores <- parallel::detectCores()

cores

[1] 8

#Model for Random Forest

modelRF <- 
    rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("regression")

#Recipe for Random Forest, I used the same receipe pretty much for all the models except the null model. is that correct approach?

#workflow for Random Forest

workflow_rf <-
  workflow() %>%
  add_model(modelRF)%>%
  add_recipe(fluCV_rec)

#Train and Tune Model

modelRF

Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  num.threads = cores

Computational engine: ranger

rf_res <- 
  workflow_rf %>% 
  tune_grid(folds)

i Creating pre-processing data to finalize unknown parameter: mtry

Warning: package 'ranger' was built under R version 4.2.3

rf_res %>%
  show_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     5    34 rmse    standard    1.19     5  0.0320 Preprocessor1_Model01
2    13    39 rmse    standard    1.20     5  0.0315 Preprocessor1_Model05
3     3    15 rmse    standard    1.20     5  0.0318 Preprocessor1_Model10
4     9    19 rmse    standard    1.21     5  0.0323 Preprocessor1_Model07
5    12    22 rmse    standard    1.21     5  0.0323 Preprocessor1_Model02

best_rf <- rf_res %>%
  select_best ("rmse")

best_rf

# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1     5    34 Preprocessor1_Model01

final_best_wf <- 
  workflow_rf %>%
  finalize_workflow(best_rf)

#Model Fit

final_fit_rf <-
  final_best_wf %>%
  fit(train_data)

final_fit_rf

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5L,      x), num.trees = ~1000, min.node.size = min_rows(~34L, x),      num.threads = ~cores, verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  1000 
Sample size:                      510 
Number of independent variables:  31 
Mtry:                             5 
Target node size:                 34 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1.413541 
R squared (OOB):                  0.06279435

#Most of my codes stopped working for the various models, I deleted them in order to be able to render. Some of the packages I wasnt able to download etc., This exercise was actually the most challanging from me. I am not sure how to get the RMSE for the random forest or Lasso as the collect metrics for both didnt seem to work. I wasnt able to plot the importance matrix for the random forest. Please guide me to the repo of who over got this exercise right so i can clearly see my mistakes