Split your data into a training and testing set — trainTestSplit • easytidymodels

Create a training and testing data set. Also returns a bootstrapped version of the training data set.

trainTestSplit(
  data = df,
  splitAmt = 0.8,
  timeDependent = FALSE,
  responseVar = "nameOfResponseVar",
  stratifyOnResponse = FALSE,
  numberOfBootstrapSamples = 25
)

Arguments

data: The data set of interest.
splitAmt: The amount of data you want in the training set. Default is .8
timeDependent: Logical. Is your data time-dependent? If so, set TRUE.
responseVar: Name of response variable in analysis.
stratifyOnResponse: Logical. Should the training and testing splits be stratified based on the response? If so, set TRUE.
numberOfBootstrapSamples: Numeric. How many bootstrap samples do you want? Default is 25.

Value

A list with four components: train is the training set, test is the testing set, boot is a bootstrapped data set, and split is an rsample object that helps split your original data set.

Examples

library(easytidymodels)
library(dplyr)
utils::data(penguins, package = "modeldata")
resp <- "sex"
split <- trainTestSplit(penguins, stratifyOnResponse = TRUE, responseVar = resp)
#Training data
split$train
#> # A tibble: 275 x 7
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.5          17.4               186        3800
#>  2 Adelie  Torgersen           40.3          18                 195        3250
#>  3 Adelie  Torgersen           NA            NA                  NA          NA
#>  4 Adelie  Torgersen           36.7          19.3               193        3450
#>  5 Adelie  Torgersen           38.9          17.8               181        3625
#>  6 Adelie  Torgersen           42            20.2               190        4250
#>  7 Adelie  Torgersen           41.1          17.6               182        3200
#>  8 Adelie  Torgersen           36.6          17.8               185        3700
#>  9 Adelie  Torgersen           38.7          19                 195        3450
#> 10 Adelie  Torgersen           34.4          18.4               184        3325
#> # ... with 265 more rows, and 1 more variable: sex <fct>

#Testing data
split$test
#> # A tibble: 69 x 7
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750
#>  2 Adelie  Torgersen           37.8          17.3               180        3700
#>  3 Adelie  Biscoe              38.8          17.2               180        3800
#>  4 Adelie  Biscoe              40.5          17.9               187        3200
#>  5 Adelie  Dream               39.5          17.8               188        3300
#>  6 Adelie  Dream               39.2          21.1               196        4150
#>  7 Adelie  Dream               38.8          20                 190        3950
#>  8 Adelie  Dream               36.5          18                 182        3150
#>  9 Adelie  Dream               44.1          19.7               196        4400
#> 10 Adelie  Dream               39.6          18.8               190        4600
#> # ... with 59 more rows, and 1 more variable: sex <fct>

#Bootstrapped data
split$boot
#> # Bootstrap sampling 
#> # A tibble: 25 x 2
#>    splits            id         
#>    <list>            <chr>      
#>  1 <split [275/100]> Bootstrap01
#>  2 <split [275/94]>  Bootstrap02
#>  3 <split [275/105]> Bootstrap03
#>  4 <split [275/98]>  Bootstrap04
#>  5 <split [275/108]> Bootstrap05
#>  6 <split [275/98]>  Bootstrap06
#>  7 <split [275/108]> Bootstrap07
#>  8 <split [275/100]> Bootstrap08
#>  9 <split [275/103]> Bootstrap09
#> 10 <split [275/94]>  Bootstrap10
#> # ... with 15 more rows

#Split object (helpful to call if you want to do model stacking)
split$split
#> <Analysis/Assess/Total>
#> <275/69/344>