Creates and returns a recipe object.

createRecipe(
  data = data,
  responseVar = "response",
  corrValue = 0.9,
  otherValue = 0.01
)

Arguments

data

The training data set

responseVar

the variable that is the response for analysis.

corrValue

The value to remove variables that are highly correlated from dataset. The step will try to remove the minimum number of columns so that all the resulting absolute correlations are less than this value. Default value is .9.

otherValue

The minimum frequency of a level in a factor variable needed to avoid converting its outcome to "other". Default is .01.

Value

A recipes::recipe object that has been prepped.

Details

The following data transformations are automatically applied to the data:

  • Normalizes numeric variables

  • Puts infrequent levels of categorical variables into "other" category

  • Puts NA values into "unknown" category

  • Removes variables with near-zero variance

  • Removes highly correlated variables

  • One-hot encodes your categorical variables

If a different recipe is needed, I recommend calling the recipes library and building one appropriate for your dataset (this function is hard to automate given the variety of data transformations that can happen for a specific data set).

Examples

library(easytidymodels)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#> 
#>     matches
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
utils::data(penguins, package = "modeldata")
resp <- "sex"
split <- trainTestSplit(penguins, stratifyOnResponse = TRUE, responseVar = resp)
formula <- stats::as.formula(paste(resp, ".", sep="~"))
rec <- createRecipe(split$train, responseVar = resp)
#> Warning: The `preserve` argument of `step_dummy()` is deprecated as of recipes 0.1.16.
#> Please use the `keep_original_cols` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          6
#> 
#> Training data contained 274 data points and 8 incomplete rows. 
#> 
#> Operations:
#> 
#> Unknown factor level assignment for species, island, sex [trained]
#> Centering and scaling for bill_length_mm, bill_depth_mm, flipper_length_m... [trained]
#> Collapsing factor levels for species, island [trained]
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Correlation filter removed no terms [trained]
#> Dummy variables from species, island [trained]