Analytical Paleobiology Workshop 2022
Some model or preprocessing parameters cannot be estimated directly from your data
How do we know that 0️⃣ is a good value?
The main two strategies for optimization are:
Grid search 💠 which tests a pre-defined set of candidate values
Random search 🌀 which tests random sets of candidate values
# Train and validate a cartesian grid of GLMs
glm_grid1 <- h2o.grid("glm", x = predictors, y = target,
grid_id = "glm_grid1",
training_frame = ring_train,
seed = 1,
hyper_params = glm_params1)
# Get the grid results, sorted by validation RMSE
glm_gridperf1 <- h2o.getGrid(grid_id = "glm_grid1",
sort_by = "rmse",
decreasing = TRUE)
print(glm_gridperf1)
#> H2O Grid Details
#> ================
#>
#> Grid ID: glm_grid1
#> Used hyper parameters:
#> - alpha
#> - lambda
#> Number of models: 36
#> Number of failed models: 0
#>
#> Hyper-Parameter Search Summary: ordered by decreasing rmse
#> alpha lambda model_ids rmse
#> 1 1.0 1.0 glm_grid1_model_18 2.69280
#> 2 1.0 1.0 glm_grid1_model_27 2.69280
#> 3 1.0 1.0 glm_grid1_model_36 2.69280
#> 4 1.0 1.0 glm_grid1_model_9 2.69280
#> 5 0.5 1.0 glm_grid1_model_17 2.62457
#>
#> ---
#> alpha lambda model_ids rmse
#> 31 0.5 0.0 glm_grid1_model_20 2.18858
#> 32 1.0 0.0 glm_grid1_model_21 2.18858
#> 33 0.0 0.0 glm_grid1_model_28 2.18858
#> 34 0.5 0.0 glm_grid1_model_29 2.18858
#> 35 1.0 0.0 glm_grid1_model_3 2.18858
#> 36 1.0 0.0 glm_grid1_model_30 2.18858
# Grab the top GLM model, chosen by validation RMSE
best_glm1 <- h2o.getModel(glm_gridperf1@model_ids[[1]])
# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_glm_perf1 <- h2o.performance(model = best_glm1,
newdata = ring_test)
h2o.rmse(best_glm_perf1)
#> [1] 2.701137
# Look at the hyperparameters for the best model
print(best_glm1@model[["model_summary"]])
#> GLM Model: summary
#> family link regularization number_of_predictors_total
#> 1 gaussian identity Lasso (lambda = 1.0 ) 10
#> number_of_active_predictors number_of_iterations training_frame
#> 1 1 1 RTMP_sid_9593_3
# Train and validate a cartesian grid of GLMs
glm_grid2 <- h2o.grid("glm", x = predictors, y = target,
grid_id = "glm_grid2",
training_frame = ring_train,
seed = 1,
hyper_params = glm_params2,
search_criteria = search_criteria)
# Get the grid results, sorted by validation RMSE
glm_gridperf2 <- h2o.getGrid(grid_id = "glm_grid2",
sort_by = "rmse",
decreasing = TRUE)
print(glm_gridperf2)
#> H2O Grid Details
#> ================
#>
#> Grid ID: glm_grid2
#> Used hyper parameters:
#> - alpha
#> - lambda
#> Number of models: 40
#> Number of failed models: 0
#>
#> Hyper-Parameter Search Summary: ordered by decreasing rmse
#> alpha lambda model_ids rmse
#> 1 0.7 0.9 glm_grid2_model_19 2.62786
#> 2 0.7 0.9 glm_grid2_model_29 2.62786
#> 3 0.7 0.9 glm_grid2_model_39 2.62786
#> 4 0.7 0.9 glm_grid2_model_9 2.62786
#> 5 0.9 0.6 glm_grid2_model_1 2.56506
#>
#> ---
#> alpha lambda model_ids rmse
#> 35 0.45 0.1 glm_grid2_model_38 2.30685
#> 36 0.45 0.1 glm_grid2_model_8 2.30685
#> 37 0.95 0.0 glm_grid2_model_16 2.18858
#> 38 0.95 0.0 glm_grid2_model_26 2.18858
#> 39 0.95 0.0 glm_grid2_model_36 2.18858
#> 40 0.95 0.0 glm_grid2_model_6 2.18858
# Grab the top GLM model, chosen by validation RMSE
best_glm2 <- h2o.getModel(glm_gridperf2@model_ids[[1]])
# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_glm_perf2 <- h2o.performance(model = best_glm2,
newdata = ring_test)
h2o.rmse(best_glm_perf2)
#> [1] 2.645909
# Look at the hyperparameters for the best model
print(best_glm2@model[["model_summary"]])
#> GLM Model: summary
#> family link regularization
#> 1 gaussian identity Elastic Net (alpha = 0.7, lambda = 0.9 )
#> number_of_predictors_total number_of_active_predictors number_of_iterations
#> 1 10 4 1
#> training_frame
#> 1 RTMP_sid_9593_3
Use either a grid or random search to train your model
How do this vary with the values you initially chose?
15:00
Try different values and measure their performance
Find good values for these parameters
Finalize the model by fitting the model with these parameters to the entire training set
Yes ✅
Yes ✅
Hmmmm, probably not ❌
Nope ❌