Creates recipes to preprocess data set — createRecipe • easytidymodels

Creates and returns a recipe object.

createRecipe(
  data = data,
  responseVar = "response",
  corrValue = 0.9,
  otherValue = 0.01
)

Arguments

data: The training data set
responseVar: the variable that is the response for analysis.
corrValue: The value to remove variables that are highly correlated from dataset. The step will try to remove the minimum number of columns so that all the resulting absolute correlations are less than this value. Default value is .9.
otherValue: The minimum frequency of a level in a factor variable needed to avoid converting its outcome to "other". Default is .01.

Value

A recipes::recipe object that has been prepped.

Details

The following data transformations are automatically applied to the data:

Normalizes numeric variables
Puts infrequent levels of categorical variables into "other" category
Puts NA values into "unknown" category
Removes variables with near-zero variance
Removes highly correlated variables
One-hot encodes your categorical variables

If a different recipe is needed, I recommend calling the recipes library and building one appropriate for your dataset (this function is hard to automate given the variety of data transformations that can happen for a specific data set).

Examples

library(easytidymodels)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#> 
#>     matches
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
utils::data(penguins, package = "modeldata")
resp <- "sex"
split <- trainTestSplit(penguins, stratifyOnResponse = TRUE, responseVar = resp)
formula <- stats::as.formula(paste(resp, ".", sep="~"))
rec <- createRecipe(split$train, responseVar = resp)
#> Warning: The `preserve` argument of `step_dummy()` is deprecated as of recipes 0.1.16.
#> Please use the `keep_original_cols` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          6
#> 
#> Training data contained 274 data points and 8 incomplete rows. 
#> 
#> Operations:
#> 
#> Unknown factor level assignment for species, island, sex [trained]
#> Centering and scaling for bill_length_mm, bill_depth_mm, flipper_length_m... [trained]
#> Collapsing factor levels for species, island [trained]
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Correlation filter removed no terms [trained]
#> Dummy variables from species, island [trained]