XGBoost Multiclass Classification — xgMultiClassif • easytidymodels

Runs XGBoost for multiclass classification.

xgMultiClassif(
  gridNumber = 10,
  levelNumber = 3,
  recipe = rec,
  folds = folds,
  train = train_df,
  test = test_df,
  response = response,
  treeNum = 100,
  calcFeatImp = TRUE,
  evalMetric = "roc_auc"
)

Arguments

gridNumber

Numeric. Size of the grid you want XGBoost to explore. Default is 10.

levelNumber

Numeric. How many levels are in your response? Default is 3.

recipe

A recipe object.

folds

A rsample::vfolds_cv object.

train

Data frame/tibble. The training data set.

test

Data frame/tibble. The testing data set.

response

Character. The variable that is the response for analysis.

treeNum

Numeric. The number of trees to evaluate your model with.

calcFeatImp

Logical. Do you want to calculate feature importance for your model? If not, set = FALSE.

evalMetric

Character. The classification metric you want to evaluate the model's accuracy on. Default is bal_accuracy. List of metrics available to choose from:

bal_accuracy
mn_log_loss
roc_auc
mcc
kap
sens
spec
precision
recall

Value

A list with the following outputs:

Training confusion matrix
Training model metric score
Testing confusion matrix
Testing model metric score
Final model chosen by XGBoost
Tuned model
Feature importance plot
Feature importance variable

Details

What the model tunes:

mtry: The number of predictors that will be randomly sampled at each split when creating the tree models.
min_n: The minimum number of data points in a node that are required for the node to be split further.
tree_depth: The maximum depth of the tree (i.e. number of splits).
learn_rate: The rate at which the boosting algorithm adapts from iteration-to-iteration.
loss_reduction: The reduction in the loss function required to split further.
sample_size: The amount of data exposed to the fitting routine.

What you set specifically:

trees: Default is 100. Sets the number of trees contained in the ensemble. A larger values increases runtime but (ideally) leads to more robust outcomes.

Examples

library(easytidymodels)
library(dplyr)
library(recipes)
utils::data(penguins, package = "modeldata")
#Define your response variable and formula object here
resp <- "species"
formula <- stats::as.formula(paste(resp, ".", sep="~"))
#Split data into training and testing sets
split <- trainTestSplit(penguins, stratifyOnResponse = TRUE,
responseVar = resp)
#Create recipe for feature engineering for dataset, varies based on data working with
rec <- recipe(formula, data = split$train) %>% step_knnimpute(!!resp) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_medianimpute(all_predictors()) %>% step_normalize(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>% step_nzv(all_predictors()) %>%
step_corr(all_numeric(), -all_outcomes(), threshold = .8) %>% prep()
#> Warning: There are new levels in a factor: NA
train_df <- bake(rec, split$train)
#> Warning: There are new levels in a factor: NA
test_df <- bake(rec, split$test)
#> Warning: There are new levels in a factor: NA
folds <- cvFolds(train_df)
#xgClass <- xgMultiClassif(recipe = rec, response = resp, folds = folds,
#train = train_df, test = test_df, evalMetric = "roc_auc")

#Visualize training data and its predictions
#xgClass$trainConfMat

#View how model metrics look
#xgClass$trainScore

#Visualize testing data and its predictions
#xgClass$testConfMat

#View how model metrics look
#xgClass$testScore

#See the final model chosen by XGBoost based on optimizing for your chosen evaluation metric
#xgClass$final

#See how model fit looks based on another evaluation metric
#xgClass$tune %>% tune::show_best("bal_accuracy")

#Feature importance plot
#xgClass$featImpPlot

#Feature importance variables
#xgClass$featImpVars