4 - Evaluating models

Analytical Paleobiology Workshop 2022

Metrics for model performance `

#> H2ORegressionMetrics: glm
#> ** Reported on training data. **
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09
  • RMSE: difference between the predicted and observed values ⬇️
  • \(R^2\): squared correlation between the predicted and observed values ⬆️
  • MAE: similar to RMSE, but mean absolute error ⬇️

Metrics for model performance

h2o.performance(mod_glm2, newdata = ring_test)
#> H2ORegressionMetrics: glm
#> MSE:  4.865688
#> RMSE:  2.20583
#> MAE:  1.591277
#> RMSLE:  0.179791
#> Mean Residual Deviance :  4.865688
#> R^2 :  0.538475
#> Null Deviance :13015.45
#> Null D.o.F. :1233
#> Residual Deviance :6004.259
#> Residual D.o.F. :1224
#> AIC :5476.385


Dangers of overfitting ⚠️

Dangers of overfitting ⚠️

Dangers of overfitting ⚠️

h2o.performance(mod_glm2, newdata = ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09

We call this “resubstitution” or “repredicting the training set”

The values are “resubstitution estimate”

Dangers of overfitting ⚠️

h2o.performance(mod_glm2, newdata = ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09

Dangers of overfitting ⚠️

h2o.performance(mod_glm2, newdata = ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09
h2o.performance(mod_glm2, newdata = ring_test)
#> H2ORegressionMetrics: glm
#> MSE:  4.865688
#> RMSE:  2.20583
#> MAE:  1.591277
#> RMSLE:  0.179791
#> Mean Residual Deviance :  4.865688
#> R^2 :  0.538475
#> Null Deviance :13015.45
#> Null D.o.F. :1233
#> Residual Deviance :6004.259
#> Residual D.o.F. :1224
#> AIC :5476.385

⚠️ Don’t use the test set until the end of your modeling analysis

Your turn

Compute the metrics for both training and testing data.

Notice the evidence of overfitting, if any! ⚠️


Dangers of overfitting ⚠️

h2o.performance(mod_glm2, newdata = ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09
h2o.performance(mod_glm2, newdata = ring_test)
#> H2ORegressionMetrics: glm
#> MSE:  4.865688
#> RMSE:  2.20583
#> MAE:  1.591277
#> RMSLE:  0.179791
#> Mean Residual Deviance :  4.865688
#> R^2 :  0.538475
#> Null Deviance :13015.45
#> Null D.o.F. :1233
#> Residual Deviance :6004.259
#> Residual D.o.F. :1224
#> AIC :5476.385
  • What if we want to compare more models?

  • And/or more model configurations?

  • And we want to understand if these are important differences?

The testing data are precious 💎

How can we use the training data to compare and evaluate different models? 🤔



Your turn

If we use 10 folds, what percent of the training data

  • ends up in analysis
  • ends up in assessment

for each fold?



What is in this?

mod_glm3 <- h2o.glm(x = predictors, y= target, 
                   family="gaussian", lambda = 0, 
                   compute_p_values = TRUE,
                   nfolds=10, keep_cross_validation_predictions=TRUE,
                   seed = 1234)

Evaluating model performance

h2o.performance(mod_glm3, newdata=ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09

We can reliably measure performance using only the training data 🎉

Comparing metrics

How do the metrics from resampling compare to the metrics from training and testing?

h2o.performance(mod_glm3, newdata=ring_train)
#> H2ORegressionMetrics: glm
#> MSE:  4.789876
#> RMSE:  2.188578
#> MAE:  1.578086
#> RMSLE:  NaN
#> Mean Residual Deviance :  4.789876
#> R^2 :  0.5362487
#> Null Deviance :30396.91
#> Null D.o.F. :2942
#> Residual Deviance :14096.6
#> Residual D.o.F. :2933
#> AIC :12984.09

The RMSE previously was

  • 2.1885785 for the training set
  • 2.2058304 for test set

Remember that:

⚠️ the training set gives you overly optimistic metrics

⚠️ the test set is precious

Evaluating model performance

h2o.performance(mod_glm3, newdata=ring_test)
#> H2ORegressionMetrics: glm
#> MSE:  4.865688
#> RMSE:  2.20583
#> MAE:  1.591277
#> RMSLE:  0.179791
#> Mean Residual Deviance :  4.865688
#> R^2 :  0.538475
#> Null Deviance :13015.45
#> Null D.o.F. :1233
#> Residual Deviance :6004.259
#> Residual D.o.F. :1224
#> AIC :5476.385

Parallel processing

  • Resampling can involve fitting a lot of models!

  • These models don’t depend on one another and can be run in parallel

We can initiate h2o to have a parallel backend to do this:

h2o.init(nthreads = -1) #-1 means use all cores available

h2o.shutdown() # to stop the clusters

Your turn


  • Retrain your model using cross validation

Don’t forget to set a seed!


Discussion time

Which model do you think you would decide to use?

What surprised you the most?

What is one thing you are looking forward to next?
