Use tidymodels with weighted and unweighted data

Use case

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. The get started case study helps to take the first steps. Another helpful source is lesson 10 of an R tutorial from a data mining course at George Mason University.

Building on these basics, my next step is to apply frequency weights when estimating a linear regression model in the tidymodels way of coding. However, this blog post shows that this is a feature under development and therefore some of my first attempts to create a reproducible example failed.

The tidymodels how-to add case weights to a workflow gives some examples with code that helps to crack the case. Below I give the code for two reproducible examples, one example of model estimation without using weights and one with using weights.

Data and method

The models that I estimate are linear regression models with a set of predictors and one numeric outcome variable. The parameters of this model are estimated by ordinary least squares.

I use the car_prices data set for the examples and try to predict the car prices with the care brands as predictors. Note that, as a consequence, in my examples the outcome variable is non-negative and the predictors are mutually exclusive (0/1) dummy variables. This makes the examples easy to understand, but the code may apply to a wider range of variables nonetheless. I use mileage as the weighting variable.

Let us start with loading the data into memory.

# Load library for the recipe. parsnip, workflow and hardhat packages, along with the rest of tidymodels
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.1      ✔ recipes      1.0.1 
✔ dials        1.0.0      ✔ rsample      1.1.0 
✔ dplyr        1.0.10     ✔ tibble       3.1.8 
✔ ggplot2      3.3.6      ✔ tidyr        1.2.0 
✔ infer        1.0.3      ✔ tune         1.0.0 
✔ modeldata    1.0.0      ✔ workflows    1.0.0 
✔ parsnip      1.0.1      ✔ workflowsets 1.0.0 
✔ purrr        0.3.4      ✔ yardstick    1.0.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

Now we select only the relevant variables. Although the weights are not yet used in the first example, mileage is already defined as the weighting variable in the data set.

# Create a data set with one non-negative continuous variable and uncorrelated dummy variables as predictors
db <- select(car_prices, Price, Buick:Saturn, Mileage) %>% 
  mutate(Mileage = frequency_weights(Mileage))
str(db)

tibble [804 × 8] (S3: tbl_df/tbl/data.frame)
 $ Price   : num [1:804] 22661 21725 29143 30732 33359 ...
 $ Buick   : int [1:804] 1 0 0 0 0 0 0 0 0 0 ...
 $ Cadillac: int [1:804] 0 0 0 0 0 0 0 0 0 0 ...
 $ Chevy   : int [1:804] 0 1 0 0 0 0 0 0 0 0 ...
 $ Pontiac : int [1:804] 0 0 0 0 0 0 0 0 0 0 ...
 $ Saab    : int [1:804] 0 0 1 1 1 1 1 1 1 1 ...
 $ Saturn  : int [1:804] 0 0 0 0 0 0 0 0 0 0 ...
 $ Mileage : freq_wts [1:804] 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2755...

Example 1: linear regression without weights

Now on with the first example. In the code below we define the recipe, define the model and set mode and engine. These are combined into a workflow. Afterwards we look at the properties of these objects to check if these are as expected. Note that Saturn is the reference dummy variable of my choice (i.e. in effect its coefficient is set to zero by default) and is thus excluded from the regression.

# Get data ready for modeling with recipe package
recipe1 <-
  db %>% 
  recipe(Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab) # add all dummy variables but one

# Define model, mode and engine with parsnip package
model1 <-
  linear_reg() %>% # adds the basic model type
  set_engine('lm') %>% # adds the computational engine to estimate the model parameters
  set_mode('regression') # adds the modeling context in which it will be used

# Bundle pre-processing, modeling, and post-processing with workflow package
workflow1 <-
  workflow() %>%
  add_recipe(recipe1) %>%
  add_model(model1)

# View object properties
recipe1

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          5

model1

Linear Regression Model Specification (regression)

Computational engine: lm

workflow1

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Now that the objects look alright, the model estimation can be performed and the parameter estimates are printed.

# Now estimate the model via a single call to fit()
fit1 <- fit(workflow1, data = db)

# View fit1 properties
tidy(fit1)

# A tibble: 6 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   13979.      763.     18.3  7.62e- 63
2 Buick          6836.     1009.      6.77 2.46e- 11
3 Cadillac      26958.     1009.     26.7  9.16e-113
4 Chevy          2449.      832.      2.94 3.32e-  3
5 Pontiac        4433.      903.      4.91 1.10e-  6
6 Saab          15516.      943.     16.5  1.26e- 52

glance(fit1)

# A tibble: 1 × 12
  r.squared adj.r.s…¹ sigma stati…²   p.value    df logLik    AIC    BIC devia…³
      <dbl>     <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1     0.645     0.642 5911.    290. 1.53e-176     5 -8120. 16254. 16287. 2.79e10
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹adj.r.squared, ²statistic, ³deviance

Example 2: linear regression with weights

Then come the weights. The first thought is to update the current workflow with a line of code to make clear that weights should be used. However, this approach does not produce the desired result.

Therefore, an alternative approach is followed. Instead of building upon the blocks of the first example, we start with a new workflow() object and add an add_case_weights line of code to it. Next, one would expect a line of code with an add_recipe command, but for some reason this did not work after a “few” tries. Instead, we use add_formula with the regression formula as an argument. Lastly, surprisingly conventional, an add_model command is added.

workflow2 <-
  workflow() %>%
  add_case_weights(Mileage) %>%
  add_formula(Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab) %>%
  add_model(model1)

workflow2

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab

── Case Weights ────────────────────────────────────────────────────────────────
Mileage

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Now the parameters are estimated with one line of code as follows.

fit2 <- fit(workflow2, db)

# View fit2 properties
tidy(fit2)

# A tibble: 6 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   13448.      694.     19.4  8.70e- 69
2 Buick          7006.      918.      7.63 6.52e- 14
3 Cadillac      26152.      933.     28.0  7.96e-121
4 Chevy          2452.      759.      3.23 1.28e-  3
5 Pontiac        4647.      828.      5.61 2.73e-  8
6 Saab          15349.      853.     18.0  5.72e- 61

glance(fit2)

# A tibble: 1 × 12
  r.squared adj.r.…¹  sigma stati…²   p.value    df logLik    AIC    BIC devia…³
      <dbl>    <dbl>  <dbl>   <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1     0.664    0.661 7.67e5    315. 5.90e-186     5 -8111. 16237. 16269. 4.70e14
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹adj.r.squared, ²statistic, ³deviance

This is a nice first try! With the two examples above it is possible to experiment further in the hope of alternative/shorter routes to the estimation results. In the mean time, we wait for the tidymodels to include weights in the relevant packages. If you are inspired by these two examples (or not) and have some new ideas for progress, do not hesitate to give feedback to the Tidyverse developers.

Happy coding!

Citation

BibTeX citation:

@online{stam2022,
  author = {Stam, Piet},
  title = {Use Tidymodels with Weighted and Unweighted Data},
  date = {2022-09-19},
  url = {https://www.pietstam.nl/posts/2022-09-19-tidymodels-weighted-data/},
  langid = {en}
}

For attribution, please cite this work as:

Stam, Piet. 2022. “Use Tidymodels with Weighted and Unweighted Data.” September 19, 2022. https://www.pietstam.nl/posts/2022-09-19-tidymodels-weighted-data/.