5 - Tuning Hyperparameters

Analytical Paleobiology Workshop 2022

Hyperparameters

Some model or preprocessing parameters cannot be estimated directly from your data

Choose the best parameter

mod_glm3 <- h2o.glm(x = predictors, y= target, 
                   training_frame=ring_train,
                   family="gaussian", lambda = 0, 
                   compute_p_values = TRUE,
                   nfolds=10, keep_cross_validation_predictions=TRUE,
                   seed = 1234)

How do we know that 0️⃣ is a good value?

Choose the best parameter

The main two strategies for optimization are:

  • Grid search 💠 which tests a pre-defined set of candidate values

  • Random search 🌀 which tests random sets of candidate values

Choose the best parameter: Grid search

# Train and validate a cartesian grid of GLMs
glm_grid1 <- h2o.grid("glm", x = predictors, y = target,
                      grid_id = "glm_grid1",
                      training_frame = ring_train,
                      seed = 1,
                      hyper_params = glm_params1)

# Get the grid results, sorted by validation RMSE
glm_gridperf1 <- h2o.getGrid(grid_id = "glm_grid1",
                             sort_by = "rmse",
                             decreasing = TRUE)

print(glm_gridperf1)
#> H2O Grid Details
#> ================
#> 
#> Grid ID: glm_grid1 
#> Used hyper parameters: 
#>   -  alpha 
#>   -  lambda 
#> Number of models: 36 
#> Number of failed models: 0 
#> 
#> Hyper-Parameter Search Summary: ordered by decreasing rmse
#>   alpha lambda          model_ids    rmse
#> 1   1.0    1.0 glm_grid1_model_18 2.69280
#> 2   1.0    1.0 glm_grid1_model_27 2.69280
#> 3   1.0    1.0 glm_grid1_model_36 2.69280
#> 4   1.0    1.0  glm_grid1_model_9 2.69280
#> 5   0.5    1.0 glm_grid1_model_17 2.62457
#> 
#> ---
#>    alpha lambda          model_ids    rmse
#> 31   0.5    0.0 glm_grid1_model_20 2.18858
#> 32   1.0    0.0 glm_grid1_model_21 2.18858
#> 33   0.0    0.0 glm_grid1_model_28 2.18858
#> 34   0.5    0.0 glm_grid1_model_29 2.18858
#> 35   1.0    0.0  glm_grid1_model_3 2.18858
#> 36   1.0    0.0 glm_grid1_model_30 2.18858

Choose the best parameter: Grid search

# Grab the top GLM model, chosen by validation RMSE
best_glm1 <- h2o.getModel(glm_gridperf1@model_ids[[1]])

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_glm_perf1 <- h2o.performance(model = best_glm1,
                                  newdata = ring_test)
h2o.rmse(best_glm_perf1)
#> [1] 2.701137

# Look at the hyperparameters for the best model
print(best_glm1@model[["model_summary"]])
#> GLM Model: summary
#>     family     link        regularization number_of_predictors_total
#> 1 gaussian identity Lasso (lambda = 1.0 )                         10
#>   number_of_active_predictors number_of_iterations  training_frame
#> 1                           1                    1 RTMP_sid_9593_3

Choose the best parameter: Grid search

# Train and validate a cartesian grid of GLMs
glm_grid2 <- h2o.grid("glm", x = predictors, y = target,
                      grid_id = "glm_grid2",
                      training_frame = ring_train,
                      seed = 1,
                      hyper_params = glm_params2,
                       search_criteria = search_criteria)

# Get the grid results, sorted by validation RMSE
glm_gridperf2 <- h2o.getGrid(grid_id = "glm_grid2",
                             sort_by = "rmse",
                             decreasing = TRUE)

print(glm_gridperf2)
#> H2O Grid Details
#> ================
#> 
#> Grid ID: glm_grid2 
#> Used hyper parameters: 
#>   -  alpha 
#>   -  lambda 
#> Number of models: 40 
#> Number of failed models: 0 
#> 
#> Hyper-Parameter Search Summary: ordered by decreasing rmse
#>   alpha lambda          model_ids    rmse
#> 1   0.7    0.9 glm_grid2_model_19 2.62786
#> 2   0.7    0.9 glm_grid2_model_29 2.62786
#> 3   0.7    0.9 glm_grid2_model_39 2.62786
#> 4   0.7    0.9  glm_grid2_model_9 2.62786
#> 5   0.9    0.6  glm_grid2_model_1 2.56506
#> 
#> ---
#>    alpha lambda          model_ids    rmse
#> 35  0.45    0.1 glm_grid2_model_38 2.30685
#> 36  0.45    0.1  glm_grid2_model_8 2.30685
#> 37  0.95    0.0 glm_grid2_model_16 2.18858
#> 38  0.95    0.0 glm_grid2_model_26 2.18858
#> 39  0.95    0.0 glm_grid2_model_36 2.18858
#> 40  0.95    0.0  glm_grid2_model_6 2.18858

Choose the best parameter: Grid search

# Grab the top GLM model, chosen by validation RMSE
best_glm2 <- h2o.getModel(glm_gridperf2@model_ids[[1]])

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_glm_perf2 <- h2o.performance(model = best_glm2,
                                  newdata = ring_test)
h2o.rmse(best_glm_perf2)
#> [1] 2.645909

# Look at the hyperparameters for the best model
print(best_glm2@model[["model_summary"]])
#> GLM Model: summary
#>     family     link                           regularization
#> 1 gaussian identity Elastic Net (alpha = 0.7, lambda = 0.9 )
#>   number_of_predictors_total number_of_active_predictors number_of_iterations
#> 1                         10                           4                    1
#>    training_frame
#> 1 RTMP_sid_9593_3

Your turn

Use either a grid or random search to train your model

How do this vary with the values you initially chose?

15:00

Optimize tuning parameters

  • Try different values and measure their performance

  • Find good values for these parameters

  • Finalize the model by fitting the model with these parameters to the entire training set

Number of trees in a Random Forest?

Yes ✅

Number of PCA components to retain?

Yes ✅

Bayesian priors for model parameters?

Hmmmm, probably not ❌

Is the random seed a tuning parameter?

Nope ❌