The goal of easytidymodels is to make running analyses in R using the tidymodels framework both easier and more reproducible. This is a wrapper for the tidymodels packages so that, after your data pre-processing steps, it all runs in one line of code and automatically tunes all the hyperparameters that are offered.
If you are not familiar with tidymodels, I recommend learning more here or here.
For more details on how the functions work in this package, I recommend checking out the reference page, referencing the vignettes on this site, or calling help on the function of interest in R to learn more. Here I will just give a brief overview of the workflow of this package.
Installation
You can install easytidymodels like this:
# install.packages("devtools")
devtools::install_github("amanda-park/easytidymodels")
Preparing Data for Analysis
There are three main functions to prepare your data for analysis:
-
trainTestSplit lets you split data into training and testing sets, with the ability to stratify on a variable and split based on a point in time.
-
cvFolds splits your data into cross-validation folds to allow the model’s hyperparameters to be tuned.
-
createRecipe does some basic data preprocessing on your dataset. NOTE: I recommend calling recipe() and creating a recipe object specific to your dataset’s needs, as every dataset will require its own preprocessing prior to analysis.
Classification Functions
The binary classification machine learning models available are as follows:
- XGBoost (function xgBinaryClassif)
- Logistic Regression (function logRegBinary)
- K-Nearest Neighbors (function knnClassif)
- Support Vector Machine (function svmClassif)
The multiclass classifications available are as follows:
- XGBoost (function xgMultiClassif)
- Multinomial Regression (function logRegMulti)
- K-Nearest Neighbors (function knnClassif)
- Support Vector Machine (function svmClassif)
Each of these models will tune the appropriate hyperparameters in the mode. However, these models allow for optimizing hyperparameters based on a specific evaluation metric. The list of metrics are as follows:
Save the model output to an object; the model will return the following in a list (can be accessed using $):
- Confusion matrix on training data
- Accuracy evaluation on training data
- Confusion matrix on testing data
- Accuracy evaluation on testing data
- Description of final model chosen
- A tuned version of the model (in the case you want to try model stacking or seeing the optimal model fit based on a different evaluation metric)
Regression Functions
The regression functions available are as follows:
- Random Forest (function rfRegress)
- XGBoost (function xgRegress)
- Linear Regression (function linearRegress)
- MARS (function marsRegress)
- K-Nearest Neighbor Regression (function knnRegress)
- Support Vector Machine Regression (function svmRegress)
These models allow for optimizing hyperparameters based on a specific evaluation metric as well. The list of metrics are as follows:
-
RMSE (Root Mean Squared Error, call “rmse”)
-
MAE (Mean Absolute Error, call “mae”)
-
RSQ (R-Squared, call “rsq”)
-
MASE (Mean Absolute Scaled Error, call “mase”)
-
CCC (Concordance Correlation Coefficient, call “ccc”)
-
IIC (Index of Ideality of Correlation, call “iic”)
-
HUBER_LOSS (Huber loss, call “huber_loss”)
Save the model output to an object; the model will return the following in a list (can be accessed using $):
- Predictions on training data
- RMSE and MAE evaluation on training data
- Predictions on testing data
- RMSE and MAE evaluation on testing data
- Description of final model chosen
- A tuned version of the model (in the case you want to try model stacking or seeing the optimal model fit based on a different evaluation metric)