Machine Learning for Scientific Research, Part 1

UCLA OARC Statistical Methods and Data Analytics

Overview

This two-part workshop introduces usage of machine learning methods for scientific research.

  • aimed at researchers with some background in classical regression modeling
  • no background in machine learning assumed
  • emphasizes methods for scientific inference rather than prediction

Part 1 discusses:

  • machine learning vs classical regression
  • training and test prediction error
  • hyperparameter tuning and cross-validation
  • Regularization, focused on LASSO
  • fitting LASSO models with glmnet
  • inference methods after LASSO variable selection

Part 2 discusses regression and classification trees and their extensions, including bagging, boosting, random forests and causal forests, with an emphasis on inference.

Machine learning vs classical statistical methods

Machine learning

Machine learning (ML) encompasses a wide array of computational methods that identify patterns in data to make accurate predictions and infer relationships among variables.

ML is a subclass of artificial intelligence: the machine learns patterns without human guidance.

Traditionally, ML has been used for accurate prediction in industry and engineering settings:

  • recommending products
  • predicting credit card fraud
  • detecting defects in manufacturing

Machine learning is increasingly used in the sciences

ML usage in the sciences has grown rapidly more recently:

  • bigger data
  • increasing computational power
  • development of inference methods

Below are figures showing increasing usage of ML methods in health sciences (left) and business, social sciences, and economics (right).

Number of articles using or discussing ML has grown rapidly over time across many health subfields

Image is Fig 1 from (Rajula et al., 2020).

Machine learning usage has rapidly grown in social sciences, economics, and business, especially more recently

Blue line is % of abstracts containing ML terms across fields

Image is Fig 1 from (Rahal et al., 2024).

Big data challenges

Data sets continue getting bigger:

  • ubiquity of data-recording devices (e.g. cell phones, sensors) and digital data collection
  • cheap, massive, and remotely accessible storage (e.g. cloud)

Digital tools enable measurement of thousands of variables efficiently.

These wide data sets suffer from the curse of dimensionality: relevant patterns become harder to find as the number of variables grows.

For example in genomics, we might wish to know which few genes out of 25,000 are most strongly related to variation in a trait.

Human vs machine learning

Compared to machines, humans often require much less data to learn.

  1. If we were to show the following pictures to a young child:

  1. Most children would be immediately able to solve this problem:

A computer might require thousands of training pictures to solve the problem.

Instead, computers excel at processing data, especially at large scales.

Thus, for scientific problems, ML is ideally suited for problems where:

  • the number of relationships to probe is overwhelming
  • large amounts of data to train the machine are available

Machine learning spectrum

For many problems (e.g.,genomics) we wish to explore the relationships of thousands of predictors (features in ML) with an outcome (label in ML).

A researcher can be easily overwhelmed deciding which predictors, interactions, non-linear effects, etc., to evaluate.

ML methods transfer the decision-making from humans to computers (algorithms).

Beam & Kohane, 2018, imagined classical regression and ML methods lying on a spectrum defined by how much human vs machine decision-making is required.

They examined and plotted the relationship between this spectrum and the amount of data used in various studies employing various methods.

As expected, studies using methods requiring more machine-driven decisions also used more data.

Image is Figure from (Beam & Kohane, 2018)

Human decisions are often model assumptions

Classical regression generally requires more human decision making, often in the form of assumptions:

  • assuming predictors have no relation to outcome by omission
    • including interactions
  • assuming linear relationships of included predictors
  • assuming outcome or errors follow a known parametric distribution

These assumptions simplify the model, reducing how many parameters are estimated and thus the amount of data required.

Thus, given sufficent data, machine learning methods are especially useful for scientific problems where regression assumptions are untenable or the researcher is overwhelmed by the number of decisions and assumptions to ponder.

Supervised learning and prediction error

Supervised learning

Like regression, supervised ML methods model the relationships between a set of predictors and an outcome.

Two primary goals:

  • prediction: given a set of predictor values, what is the predicted outcome?
    • when earthquakes will happen
    • who will develop a disease
    • traditionally the focus of ML
  • inference: understanding relationship between predictor and outcome
    • which predictors are related to the outcome?
    • if I vary a predictor, how does the outcome vary (e.g., treatment effect)?
    • what is the uncertainty in these relationships?
    • usual focus of scientific research
    • ML methods often leave inference inside the black box

Unsupervised methods find patterns among variables without respect to an outcome (e.g. k-means clustering, principal component analysis)

Prediction

Imagine we are interested in estimating the relationship, \(f\), between and outcome \(Y\) and a set of predictors \(X=X_1,X_2,...,X_p\).

\[Y=f(X)\]

For example, in linear regression, \(f\) is assumed to be a linear combination of a set of coefficients and \(X\):

\[\begin{align} Y&=f_{OLS}(X) + \epsilon \\ &=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon \\ \end{align}\]

The error term \(\epsilon\) is not included in \(f\) because it does not depend on \(X\).

In many ML approaches, linear relationships are not assumed for \(f\).

With data, we can estimate these relationships, and the estimated \(\hat{f}\) can be used to make predictions of the outcome, \(\hat{Y}\).

\[\hat{Y} = \hat{f}(X)\]

Prediction error

Predictions \(\hat{Y}\) will rarely perfectly match observations \(Y\).

Prediction error quantifies the difference between \(\hat{Y}\) and \(Y\) and comes in many forms.

For continuous \(Y\), squared errors, \((Y-\hat{Y})^2\), are commonly used.

If we wish to predict \(Y\) in future data sets, we might want to know how much prediction error to expect:

\[E((Y-\hat{Y})^2) = [(f(X) - \hat{f}(X))^2] + Var(\epsilon)\]

where

  • \(E((Y-\hat{Y})^2)\) is the expected value of prediction error, or the long-run average squared error over many data sets
  • \([(f(X) - \hat{f}(X))^2]\) is the squared difference between predictions based on the true model, \(f(X)\), and predictions based on the estimated model, \(\hat{f}(X)\)
    • both predictions \(f(X)\) and \(\hat{f}(X)\) exclude the error term \(\epsilon\)
  • \(Var(\epsilon)\) is the variance of the unexplained errors

We will use other measures of prediction error for categorical outcomes.

Reducible error and irreducible error

In the expression for expected squared (prediction) error:

\[\begin{align} E((Y-\hat{Y})^2) &= [(f(X) - \hat{f}(X))^2] + Var(\epsilon) \\ &=\text{reducible error} + \text{irreducible error} \end{align}\]

  • \([(f(X) - \hat{f}(X))^2]\) represents reducible error
    • can be reduced, theoretically to zero, if all relevant predictors correctly modeled
    • omission of relevant predictors generally causes bias
    • including irrelevant predictors causes estimates to become unstable due to overfitting
  • \(Var(\epsilon)\) represents irreducible error, which cannot be reduced because we lack the explanatory predictors or because of random fluctuations.

We can thus optimize predicting the outcome by minimizing the reducible error.

Mean Squared Error

With sample data, we can estimate the expected squared error with Mean Squared Error (MSE):

\[\text{MSE}=\frac{1}{n}\sum_{i=1}^n(y-\hat{f}(x))^2\]

MSE is the averaged squared residual (estimate of error), \((y-\hat{f}(x))^2\).

The MSE for the data used to fit the model is called training MSE, \(\text{MSE}_{\text{train}}\).

Test MSE

A model that predicts well in future data sets will generally be more useful than one that only predicts well for the current sample data.

Thus, we generally care less about \(\text{MSE}_{\text{train}}\) and more about \(\text{MSE}_{\text{test}}\), or prediction error in future data sets:

\[\text{MSE}_{\text{test}}=\frac{1}{\tilde{n}}\sum_{i=1}^\tilde{n}(\tilde{y}-\hat{f}(\tilde{x}))^2\]

\(\hat{f}\) above is the model estimated with the training data.

Above, \(\tilde{y}\) and \(\tilde{x}\) represent outcomes and predictors in future data sets, respectively.

We use test error to refer generally to a measure of prediction error in future data.

Minimizing test MSE

Methods that minimize \(\text{MSE}_{\text{train}}\) (e.g. ordinary least squares) usually do not simultaneously minimize \(\text{MSE}_{\text{test}}\).

Models with more parameters have more flexibility to accommodate complex relationships.

\(\text{MSE}_{\text{train}}\) monotonically decreases with increased model flexibility (analogous to how \(R^2\) always increases with more predictors in linear regression).

However, \(\text{MSE}_{\text{test}}\) has a U-shaped relationship with flexibility, where at some point increasing flexibility will lead to increasing \(\text{MSE}_{\text{test}}\).

Simulation of \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vs flexibility

The following simulation illustrates how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with model flexibility:

  • In the population, \(y\) has the following cubic relationship with \(x\):

\[y=x-x^2+.4x^3+\epsilon\]

  • we train linear regression models on a sample of \(n=30\)
  • flexibility of the models is determined by the degree of polynomial of \(x\) added to the model, ranging from 1 (just \(x\)) to 10 (all terms up to \(x^{10}\))
  • \(\text{MSE}_{\text{test}}\) is estimated at each polynomial degree by:
    • use the model fit to the sample data to predict outcomes in 20 future data sets (from the same population and each with \(n=30\))
    • use the predictions to calculate \(\text{MSE}_{\text{test}}\) in each future data set
    • average the 20 \(\text{MSE}_{\text{test}}\) values


The left plot below shows the training data and predictions from 3 different models (degree=1,3,10)

The right plot below shows how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with flexibility.

The added flexibility of additional polynomial terms allows the model to fit more closely to the training data.

However, the wiggliness of the maximally flexible model (degree=10) suggests overfitting.

\(\text{MSE}_{\text{train}}\) always decreases with flexibility.

\(\text{MSE}_{\text{test}}\) has the expected U-shaped relationship with flexibility, and is at a minimum at the true model (degree=3).

Figure inspired by Fig 2.9 of (James et al., 2013)

Bias-variance tradeoff

Why does \(\text{MSE}_{\text{test}}\) have a U-shaped relationship with flexibility?

The reducible error in future datasets has two components: bias and variance of \(\hat{f}\):

\[\begin{align}E(\text{MSE}_{\text{test}}) &= \text{reducible error} + \text{irreducible error} \\ &=[Bias(\hat{f}(\tilde{x}))^2 + Var(\hat{f}(\tilde{x})] + Var(\epsilon) \end{align}\]

  • \(E(\text{MSE}_{\text{test}})\) is the expected or average test MSE over many future datasets.
  • \(Bias(f(\tilde{x}))^2\) is the (squared) bias of \(\hat{f}\)
    • if the average of \(\hat{f}\) over many samples does not equal \(f\), \(\hat{f}\) is biased
    • more flexible models will tend to have less bias (by including all relevant predictors with correct functional forms)
  • \(Var(\hat{f}(\tilde{x}))\) is the variance of \(\hat{f}\), how much \(\hat{f}\) changes when repeatedly estimated
    • more flexible models will tend to have more variance because the extra parameters will be sensitive to small changes in the data

Thus, as models become more flexible, bias will decrease, but eventually, reductions in bias will be dominated by increases in variance.

The goal, then, is to find the model with the right amount of flexibility that balances bias and variance to minimize test MSE.

An extension of the previous polynomial simulation illustrates how bias and variance of \(\hat{f}\) vary with model flexibility:

  • Our target is bias and variance of the coefficient for the \(x\) linear term, \(\beta_{x}=1\)
  • Generate 500 training data sets of \(n=30\)
  • Train models with polynomial degrees=1,2,…,10 of \(x\) on the 500 data sets
  • Bias is estimated by subtracting the mean of the 500 estimated coefficients for the linear term, \(\overline{\hat{\beta}}_x\), from its true value of \(\beta_{x}=1\)
  • Variance is estimated by the variance of the 500 \(\hat{\beta}_x\)

Flexibility vs interpretability

Statistical approaches that are more flexible can improve prediction accuracy but tend to be less interpretable.

The additional complexity of flexible models obscures their interpretation (e.g. interactions and non-linear effects).

Classical regression models are less flexible but their parameters that have straightforward interpretations of additive changes in the outcome (or transformed outcome) per unit-change in the predictor.

Machine learning models are often non-parametric and predictor effects are often difficult to summarize.

Image inspired by Fig2.7 from (James et al., 2013)

Tuning model flexibility

Hyperparameters

The flexibility of ML methods is typically controlled by hyperparameters.

Hyperparameters are fixed before a model is fit, and the model is then fit conditionally on the hyperparameters’ values.

Across ML models, hyperparameters typically control

  • model complexity (e.g. LASSO, regression trees)
  • learning rate (e.g. boosting, neural networks)

The “best” hyperparameter value that minimizes test error cannot be estimated from a single model fit.

Instead, hyperparameters are tuned or optimized by estimating test error over a range of many hyperparameter values.

Cross-validation

Cross-validation is the most common method used to tune hyperparameters.

Cross-validation efficiently uses a single data set to train a model and estimate its test error (called validation if not using future data).

Cross-validation algorithm:

  • split the data randomly into \(K\) folds
  • then for each fold \(k\)
    • train the model on all folds but fold \(k\)
    • calculate predictions in fold \(k\) using the trained model to estimate test error
  • repeat for all folds
  • estimated test error is average test error across all \(k\) folds

Cross-validation thus allows us to estimate test error at each hyperparameter value and find the value that minimizes test error, with just one data set.

Imagine we wish to tune hyperparameter \(\lambda\) value over a range of 100 values, \(\lambda_1,\lambda_2,...,\lambda_{100}\)

The image below depicts 5-fold cross-validation to estimate test error for the first \(\lambda\) in the sequence, \(\lambda_1\)

How many folds?

With more data, the trained model will be less variable and on average closer to the true model, resulting in lower test error.

Cross-validation uses only a portion of the data to train the model (e.g. 90% with 10 folds) and thus test error will be generally over-estimated compared to using 100% of the data.

To avoid this over-estimation, \(K=n\) folds, where each fold is a single observation, known as leave-one-out cross-validation, might seem appealing. However, the estimated test errors become highly variable because they are based on single observations.

For most applications, \(k=5\) and \(k=10\) have been found to perform well.

Generally, with smaller data, more folds might be preferred so that more data is used to train.

Do not calucate test error on trained data

Statistical methods fit models by minimizing training error.

Thus, test error will be generally underestimated if calculated using any observations used to train the model.

Thus, any sort of training should be avoided before calculating test error at any step, including cross-validation.

“Training” includes not only fitting models but also selecting predictors based on their relationship with the outcome (e.g. using correlation thresholds or a forward stepwise procedure).

LASSO and regularization

Highly variable models

Regression parameter estimates will vary greatly across repeated samples when:

  • number of predictors is large relative to sample size
  • only a few predictors are actually predictive (sparsity)
  • predictors are highly correlated

More variable models will lead to higher test error, so methods less flexible than regression can help.

Regularized models task the machine with using the data to reduce the flexibility of the model, for example, by removing irrelevant predictors.

Regularization

A model’s loss function is minimized to produce the “best” parameter estimates:

  • sum of squared errors (SSE) minimized in linear regression

\[LOSS_{ols}(\beta) = \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2\]

  • negative likelihood minimized (equivalent to likelihood being maximized) for generalized linear models (e.g logistic and poisson regression)

Regularization methods impose a penalty term on to a model’s loss function to limit complexity and discourage overfitting.

The penalty term:

  • shrinks parameter estimates towards zero
  • induces bias but also reduces variance
  • as a hyperparameter can be tuned via cross-validation

LASSO

Least Absolute Shrinkage and Selection Operator (LASSO) is a popular ML regularization method adapted for several regression models.

The LASSO loss function, \(LOSS_{lasso}(\beta)\), adds a penalty to the sum of the absolute values of the \(p\) parameter estimates:

\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p|\beta_j| \]

where

  • \(Loss_{reg}(\beta)\) is the loss function of the unregularized regression model
    • e.g. sum of squared errors, negative log likelihood
  • \(\lambda\) is the penalty hyperparameter
    • larger values penalize more
  • \(\sum_{j=1}^p|\beta_j|\) is the sum of the absolute values of all model coefficients
    • each predictor \(X_j\) is standardized so that the magnitude of its coefficient \(\beta_j\) does not depend on \(X_j\)’s scaling

For example, the loss function for LASSO for linear regression adds the penalty term to the sum of squared errors (loss function for linear regression):

\[\begin{align} LOSS_{lasso.ols}(\beta) &= \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2 + \lambda\sum_{j=1}^p|\beta_j| \\ &= \text{SSE} + \text{penalty} \end{align}\]

For a given \(\lambda\), the “best” set of estimates \(\hat{\beta}\) minimizes \(Loss_{lasso}\)

Note: The loss function for LASSO linear regression can be equivalently minimized with SSE multiplied by \(\frac{1}{2n}\), as in the glmnet package, or by \(\frac{1}{2}\), as in the selectiveInference package. The same set of \(\hat{\beta}\) will be estimated, but the scaling of \(\lambda\) will be different.

LASSO as variable selection

The LASSO penalty can potentially shrink coefficients to zero if \(\lambda\) is large enough.

Predictors with zero coefficients are selected out of the model.

Thus, LASSO is often used for automated variable selection.

Coefficient path plots track how coefficients grow as \(\lambda\) and thus the LASSO penalty decreases.

The plot above tracks 10 coefficients against \(\lambda\)

  • each line represents a coefficient
  • the bottom x-axis is the negative of the log of \(\lambda\), which means \(\lambda\) decreases as we move to right
  • the y-axis the estimate of the coefficient
  • the top x-axis is how many non-zero coefficients are in the LASSO-regularized model
  • we see how coefficients emerge from zero as \(\lambda\) decreases

LASSO-regularized generalized linear models

The LASSO penalty can be added to generalized linear models (GLMs) that do not assume normally distributed errors, such as logistic, Poisson, multinomial, and Cox regression models.

Generally, the loss function to be minimized for LASSO-regularized GLMs will use the negative log likelihood:

\[LOSS_{lasso.glm}(\beta) = \text{negative log likelihood} + \text{LASSO penalty}\]

For example, for logistic regression of binary (0/1) outcome \(y_i\), the loss function can be written as:

\[LOSS_{lasso.logistic}(\beta) = -\frac{1}{n}\sum_{i=1}^n\left(y_i\Big(\beta_0 + \sum_{j=1}^pX_{ij}\beta_j\Big)-log\Big(1+e^{\beta_0 + \sum_{j=1}^pX_{ij}\beta_j}\Big)\right)+\lambda\sum_{j=1}^p|\beta_j|\]

Henceforth, we will only be looking at LASSO-regularized linear regression.

Ridge regression and elastic net regularization

Two regularization methods similar to LASSO are ridge regression and elastic net.

Ridge regression uses a different penalty based on the squares of the coefficients.

\[Loss_{ridge}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\beta_j^2\] \(\lambda\) again serves as a penalty parameter that can be tuned via cross-validation.

Unlike LASSO, ridge regression does not shrink coefficients to zero (unless \(\lambda=\infty\)), so might be a better choice if all predictors are expected to have non-zero coefficients.

Elastic net combines the LASSO and ridge penalties, with an additional mixing parameter \(\alpha\), \(0\le\alpha\le1\), which controls how much of each penalty to use:

\[Loss_{elastic}(\beta) = Loss_{reg}(\beta) + \lambda\left(\alpha \sum_{j=1}^p|\beta_j| + [1-\alpha] \sum_{j=1}^p\beta_j^2 \right)\]

  • when \(\alpha=1\), only the LASSO penalty is used
  • when \(\alpha=0\), only the ridge penalty is used
  • when \(0<\alpha<1\), both penalties are used, and some coefficients may shrink to zero, though less so than when \(\alpha=1\)

For the remainder of the seminar we focus on LASSO, because most inference methods for regularized methods have been developed for LASSO

The glmnet package for LASSO

glmnet package

The glmnet package (Friedman et al., 2010):

  • performs LASSO and related regularization methods (ridge regression and elastic net)
  • can regularize generalized linear models (e.g. linear, logistic, Poisson, Cox)
  • can tune hyperparameters via cross-validation
  • is optimized for speed, so very fast
#load glmnet package
library(glmnet)

LASSO data

We demonstrate usage of the glmnet package for LASSO-regularized models with the following simulated data:

  • \(n=200\)
  • outcome: \(y\)
  • 110 predictors: \(x1-x10\), \(z1-z100\)
    • each \(x\) has coefficient 1
    • each \(z\) has coefficient 0
    • all 110 predictors are correlated at \(\rho=.2\) with each other
  • error variance \(\sigma_e^2=1\)
# load simulated data to demonstrate glmnet LASSO
d_lasso <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2025/08/lasso_sim.csv")
names(d_lasso) # column names
  [1] "y"    "x1"   "x2"   "x3"   "x4"   "x5"   "x6"   "x7"   "x8"   "x9"  
 [11] "x10"  "z1"   "z2"   "z3"   "z4"   "z5"   "z6"   "z7"   "z8"   "z9"  
 [21] "z10"  "z11"  "z12"  "z13"  "z14"  "z15"  "z16"  "z17"  "z18"  "z19" 
 [31] "z20"  "z21"  "z22"  "z23"  "z24"  "z25"  "z26"  "z27"  "z28"  "z29" 
 [41] "z30"  "z31"  "z32"  "z33"  "z34"  "z35"  "z36"  "z37"  "z38"  "z39" 
 [51] "z40"  "z41"  "z42"  "z43"  "z44"  "z45"  "z46"  "z47"  "z48"  "z49" 
 [61] "z50"  "z51"  "z52"  "z53"  "z54"  "z55"  "z56"  "z57"  "z58"  "z59" 
 [71] "z60"  "z61"  "z62"  "z63"  "z64"  "z65"  "z66"  "z67"  "z68"  "z69" 
 [81] "z70"  "z71"  "z72"  "z73"  "z74"  "z75"  "z76"  "z77"  "z78"  "z79" 
 [91] "z80"  "z81"  "z82"  "z83"  "z84"  "z85"  "z86"  "z87"  "z88"  "z89" 
[101] "z90"  "z91"  "z92"  "z93"  "z94"  "z95"  "z96"  "z97"  "z98"  "z99" 
[111] "z100"

The glmnet package works best with the predictors and outcome stored as separate matrix objects (or vector if a single column):

y <- as.vector(d_lasso[,1]) # convert first column, y, to vector 
xz <- as.matrix(d_lasso[,2:111]) # convert columns 2:111 to matrix, first 10 columns are x vars

glmnet() for LASSO

The glmnet() function fits regularized generalized linear models, by default using LASSO.

Important arguments:

  • x=: predictor matrix
  • y=: outcome vector
  • family=: outcome/error distribution family (defaults to "gaussian")
  • alpha=: elastic net mixing parameter controlling how much of the ridge and LASSO penalties to use; defaults to alpha=1 for LASSO

glmnet() fits LASSO models across a sequence of \(\lambda\) values:

  • max \(\lambda\) is smallest \(\lambda\) that shrinks all coefficients to zero
  • min \(\lambda\) is max \(\lambda\) divided by 10,000 if \(n>p\), or divided by 100 if \(p>n\) (where more regularization is required)
  • 100 evenly spaced \(\lambda\) values chosen between min and max (use nlambda= to change number)
  • use lambda= to specify a custom sequence

We fit LASSO-regularized linear regressions for our simulated data across the sequence of \(\lambda\) values using glmnet():

# first 2 arguments are x= and y=
fit_lasso <- glmnet(xz, y)

Working with glmnet objects

Rather than extracting results directly from the fitted glmnet object, use these extractor functions:

  • plot(): coefficient path plot
  • print(): summary of model fit at each \(\lambda\)
  • coef(): coefficient estimates at specific \(\lambda\) value(s)
  • predict(): model-based predictions

Coefficient path plot

plot() used on a glmnet fitted object creates the coefficient path plot:

# coefficient path plot
plot(fit_lasso)

Remember that \(\lambda\) and the penalty decrease as we move to the right.

  • 10 non-zero coefficients emerge at relatively high penalties on the left, representing the 10 \(x\) variables (and end up near their values of 1)
  • the coefficients for the 100 \(z\) variables begin to emerge at lower penalties (but stay relatively closer the their true values of 0)
  • at the lowest \(\lambda\) value, 108 total coefficients are non-zero

Summary of model fit at each \(\lambda\)

Using print() on the fitted glmnet object (or just typing the object’s name) summarizes the model fit at each \(\lambda\) value.

In the table below,

  • Df: number of non-zero coefficients
  • %Dev: percent of null deviance explained, a measure model fit to the training data
    • null deviance measures difference in model fit between saturated model (perfectly predicts outcome) and null model (intercept-only)
      • \(Deviance_{null} = 2\times(LogLik_{saturated}-LogLik_{null})\)
    • \(\%Dev = \frac{Deviance_{null} - Deviance_{fitted}}{Deviance_{null}}\times100\)
    • %Dev thus measures how much the null deviance has been reduced by fitted model
  • Lambda: \(\lambda\) used to fit model

Note: if the number of rows in the summary table is less than 100 (or whatever nlambda= equals), glmnet() stopped early because %Dev was not changing enough as \(\lambda\) was decreasing

# summary of glmnet path (model fit at each lambda)
# same as print(fit_lasso)
# glmnet stops at 83rd lambda because %Dev did not change enough at 84th
fit_lasso

Call:  glmnet(x = xz, y = y) 

    Df  %Dev Lambda
1    0  0.00 3.6060
2    2  9.58 3.2850
3    7 21.09 2.9930
4    9 33.13 2.7280
5    9 43.64 2.4850
6   10 52.38 2.2640
7   10 60.00 2.0630
8   10 66.32 1.8800
9   10 71.56 1.7130
10  10 75.91 1.5610
11  10 79.53 1.4220
12  10 82.53 1.2960
13  10 85.02 1.1810
14  10 87.09 1.0760
15  10 88.80 0.9802
16  10 90.23 0.8931
17  10 91.41 0.8138
18  10 92.40 0.7415
19  10 93.21 0.6756
20  10 93.89 0.6156
21  10 94.45 0.5609
22  10 94.92 0.5111
23  10 95.30 0.4657
24  10 95.63 0.4243
25  10 95.89 0.3866
26  12 96.12 0.3523
27  13 96.32 0.3210
28  13 96.49 0.2925
29  13 96.63 0.2665
30  13 96.74 0.2428
31  13 96.84 0.2212
32  14 96.92 0.2016
33  16 96.99 0.1837
34  17 97.04 0.1674
35  19 97.14 0.1525
36  19 97.22 0.1389
37  19 97.28 0.1266
38  20 97.34 0.1154
39  21 97.39 0.1051
40  21 97.43 0.0958
41  22 97.46 0.0873
42  26 97.50 0.0795
43  30 97.55 0.0724
44  35 97.64 0.0660
45  37 97.72 0.0602
46  45 97.81 0.0548
47  47 97.90 0.0499
48  52 97.98 0.0455
49  54 98.06 0.0415
50  59 98.14 0.0378
51  61 98.22 0.0344
52  65 98.28 0.0314
53  70 98.34 0.0286
54  72 98.40 0.0260
55  73 98.44 0.0237
56  74 98.48 0.0216
57  78 98.51 0.0197
58  81 98.54 0.0180
59  83 98.57 0.0164
60  84 98.59 0.0149
61  90 98.61 0.0136
62  92 98.63 0.0124
63  94 98.65 0.0113
64  94 98.66 0.0103
65  94 98.68 0.0094
66  95 98.69 0.0085
67  97 98.70 0.0078
68  99 98.71 0.0071
69 100 98.71 0.0064
70 101 98.72 0.0059
71 104 98.72 0.0054
72 104 98.73 0.0049
73 103 98.73 0.0044
74 103 98.74 0.0040
75 104 98.74 0.0037
76 106 98.74 0.0034
77 106 98.75 0.0031
78 106 98.75 0.0028
79 107 98.75 0.0025
80 108 98.75 0.0023
81 108 98.75 0.0021
82 108 98.75 0.0019
83 108 98.76 0.0018

Estimated coefficients

Use coef() on a glmnet fitted object to see the estimated coefficients at specific \(\lambda\) values.

Use s= to specify a specific \(\lambda\) value.

In the previous summary table:

  • 2 coefficients are non-zero at the second \(\lambda\) in the sequence, \(\lambda_2\approx3.285\)
  • 10 coefficients are non-zero at the 10th \(\lambda\), \(\lambda_{10}\approx1.561\)
  • 59 coefficients are non-zero at the 50th \(\lambda\), \(\lambda_{50}\approx0.0378\)

Notice in the output below:

  • dots represent zero coefficients
  • coefficients grow as \(\lambda\) decreases
  • coefficients for the 10 \(x\) predictors are non-zero before coefficients for any of the \(z\) predictors
# estimated coefficients at 2nd, 10th, 50th lambda in sequence
# use lambda in fitted object for exact lambda values
coef(fit_lasso, s=fit_lasso$lambda[c(2,10,50)])
111 x 3 sparse Matrix of class "dgCMatrix"
            s=3.28532713 s=1.56079655  s=0.03777328
(Intercept)   -0.2774453   -0.1540268 -0.0788428868
x1             .            0.5581107  1.0833645939
x2             .            0.4656223  0.9903312049
x3             .            0.5624802  1.0888907504
x4             .            0.4084946  0.8699055576
x5             .            0.5186582  0.9046030148
x6             .            0.6403637  1.0915202716
x7             0.2475144    0.7299907  1.0450388127
x8             .            0.4786788  0.9035429299
x9             0.2443391    0.6795551  0.9668759598
x10            .            0.3266957  0.9809235209
z1             .            .          .           
z2             .            .          .           
z3             .            .         -0.0206654766
z4             .            .          .           
z5             .            .         -0.0066959227
z6             .            .          .           
z7             .            .          .           
z8             .            .          .           
z9             .            .         -0.0244619431
z10            .            .          .           
z11            .            .         -0.0016520651
z12            .            .          .           
z13            .            .         -0.0839619427
z14            .            .          .           
z15            .            .          .           
z16            .            .          .           
z17            .            .          .           
z18            .            .         -0.1062617174
z19            .            .          0.0676161596
z20            .            .          0.0596107857
z21            .            .          0.0839758311
z22            .            .         -0.0552335796
z23            .            .          .           
z24            .            .          .           
z25            .            .          .           
z26            .            .         -0.0209694844
z27            .            .          0.1475372478
z28            .            .          .           
z29            .            .          .           
z30            .            .          .           
z31            .            .         -0.0120957621
z32            .            .          .           
z33            .            .          .           
z34            .            .          0.0345114284
z35            .            .          .           
z36            .            .          .           
z37            .            .         -0.0408881624
z38            .            .          .           
z39            .            .         -0.0023945306
z40            .            .         -0.0135155559
z41            .            .          .           
z42            .            .          .           
z43            .            .          0.0501765464
z44            .            .          0.0578256815
z45            .            .         -0.0682585395
z46            .            .         -0.0033829427
z47            .            .          0.0458979268
z48            .            .          .           
z49            .            .         -0.0235085731
z50            .            .          .           
z51            .            .          .           
z52            .            .         -0.0667957449
z53            .            .          .           
z54            .            .          0.0002443637
z55            .            .          .           
z56            .            .          .           
z57            .            .          0.0242324411
z58            .            .          0.0308628212
z59            .            .          .           
z60            .            .          0.0353664237
z61            .            .          .           
z62            .            .          0.0367126979
z63            .            .          .           
z64            .            .          .           
z65            .            .          0.0902023966
z66            .            .          .           
z67            .            .         -0.1160661948
z68            .            .          .           
z69            .            .          .           
z70            .            .          0.0817966076
z71            .            .          .           
z72            .            .          0.0021532024
z73            .            .          0.0424245272
z74            .            .          .           
z75            .            .          .           
z76            .            .          .           
z77            .            .         -0.0187693362
z78            .            .          .           
z79            .            .          0.1060738780
z80            .            .         -0.2900691069
z81            .            .          .           
z82            .            .          .           
z83            .            .         -0.0577672389
z84            .            .          0.0453429773
z85            .            .          .           
z86            .            .          .           
z87            .            .          0.0229062192
z88            .            .         -0.0539264755
z89            .            .          .           
z90            .            .          0.0169112693
z91            .            .          .           
z92            .            .         -0.0321959400
z93            .            .         -0.0416651462
z94            .            .         -0.0088839342
z95            .            .          0.0443730370
z96            .            .          0.1666781834
z97            .            .          .           
z98            .            .          .           
z99            .            .          0.0393199615
z100           .            .         -0.0778771951

Model predictions

Use predict() on a fitted glmnet object to estimate predicted outcomes at specific \(\lambda\) values.

A data set of predictors must be supplied to predict() with newx=.

# training data outcome predictions at three lambda values
y_pred <- predict(fit_lasso, newx=xz, s=fit_lasso$lambda[c(2,10,50)])

# first 10 rows of predictions at each lambda
head(y_pred, n=10)
      s=3.28532713 s=1.56079655 s=0.03777328
 [1,]  -0.27727004   -2.2292917   -4.5230702
 [2,]  -0.29023088   -2.3982307   -4.1869625
 [3,]  -0.48639430   -0.5071706   -0.9317479
 [4,]  -0.01351316    0.4239464    0.6954105
 [5,]  -0.45602893   -2.2999218   -3.9897567
 [6,]   0.09896796    4.1854244    7.4381806
 [7,]  -0.17889600   -3.0588763   -6.3669035
 [8,]  -0.17626043    0.8637251    1.9385714
 [9,]  -0.24148140   -0.4175899   -1.9661407
[10,]  -0.09918585    2.4069175    4.1820346

Below we plot the observed outcome (black) and the predicted outcome (red) at the 3 \(\lambda\) values.

  • at higher \(\lambda\) penalties, the predictions hover close to the intercept because only 2 coefficients are non-zero and are themselves shrunken towards zero
  • as the \(\lambda\) penalty decreases, the predictions move closer to the outcome
  • the right-most model is overfitted, as only 10 predictors should have non-zero coefficients rather than 59
# plot observed and predicted outcomes for models with 3 lambda values
#  each plot() below plots the observed data in black
#  each points() below plots the predictions
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
obs <- 1:200 # observation index (row in data set) is x-axis
plot(obs, y, main=expression(lambda[2]%~~%3.285), xlab="observation index")
points(obs, y_pred[,1], col="red")
plot(obs, y, main=expression(lambda[10]%~~%1.561), xlab="observation index")
points(obs, y_pred[,2], col="red")
plot(obs, y, main=expression(lambda[50]%~~%0.378), xlab="observation index")
points(obs, y_pred[,3], col="red")

par(mfrow=c(1,1)) # restore plotting area to one plot per row

Predictions can be made for new data specified in newx=.

Below we get predictions for \(x1=-3,-2,...,3\), the range of \(x1\) observed in the data, while holding all other 109 predictors at 0, at the same 3 \(\lambda\) values we used above.

# create new predictor matrix
xz_new <- matrix(rep(0, 7*110), nrow=7) # start with a 7X110 matrix of zeroes 
xz_new[,1] <- -3:3 # make first column (x1) -3 through 3

y_pred_x1 <- predict(fit_lasso, newx=xz_new, s=fit_lasso$lambda[c(2,10,50)])
y_pred_x1
     s=3.28532713 s=1.56079655 s=0.03777328
[1,]   -0.2774453   -1.8283590  -3.32893667
[2,]   -0.2774453   -1.2702483  -2.24557207
[3,]   -0.2774453   -0.7121375  -1.16220748
[4,]   -0.2774453   -0.1540268  -0.07884289
[5,]   -0.2774453    0.4040839   1.00452171
[6,]   -0.2774453    0.9621946   2.08788630
[7,]   -0.2774453    1.5203053   3.17125090

A plot of the predictions against \(x1=1,...10\) shows how its coefficient starts at 0 at high \(\lambda\) and grows as \(\lambda\) decreases:

# plot predicted y (red) across x1=1:10, for models with 3 lambda values
#  each plot() below plots the observed y vs observed x1 in black
#  each points() below plots predicted y vs observed x1 in red
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
plot(xz[,1], y, main=expression(paste(lambda[2]%~~%3.285, ", ", beta[1]==0)), xlab="x1")
points(xz_new[,1], y_pred_x1[,1], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[10]%~~%1.561, ", ", beta[1]==0.558)), xlab="x1")
points(xz_new[,1], y_pred_x1[,2], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[50]%~~%0.378, ", ", beta[1]==1.083)), xlab="x1")
points(xz_new[,1], y_pred_x1[,3], col="red", type="b")

par(mfrow=c(1,1)) # restore plotting area to one plot per row

Cross-validation with cv.glmnet()

Generally, a good value of \(\lambda\) to use is not known a priori, so it is tuned via \(K\)-fold cross-validation with cv.glmnet().

Before splitting the data, glmnet() is run on all the training data to determine the \(\lambda\) sequence.

Then, cv.glmnet() splits the data into 10 folds (by default, can be changed with nfolds=), and for each fold:

  • trains models across the \(\lambda\) sequence with glmnet() using data in all other folds
  • estimates test error across the \(\lambda\) sequence in the current fold

After progressing through each fold, the test error at each \(\lambda\) is averaged over the \(K\) folds.

For linear regressions, the test error measure is MSE, but can be changed with type=.

Let’s use cv.glmnet() to tune \(\lambda\)

# cross-validate to tune lambda
set.seed(1328) # set random seed to replicate results
cv_fit_lasso <- cv.glmnet(xz, y) 

Working with the cv.glmnet object

Using plot() on a cv.glmnet fitted object displays how the estimated test error varies with \(\lambda\).

# plot of -log(lambda) vs estimated MSE
plot(cv_fit_lasso)

In the plot above:

  • bottom x-axis is again the negative log of lambda, with penalty increasing left-to-right
  • top x-axis is again the number of non-zero coefficients
  • y-axis is MSE averaged across the 10-folds
  • the dotted lines indicate 2 important \(\lambda\) values
    • the right dotted line is "lambda.min", the \(\lambda\) value that produced the lowest MSE
    • the left dotted line is "lambda.1se", the largest \(\lambda\) that produces MSE within one standard deviation (of MSE) of the MSE at "lambda.min"
      • "lambda.1se" might be preferred to "lambda.min" if parsimony is critical
  • we see the typical U-shaped curve for flexibility vs test error, where adding predictors initially causes large reductions in MSE due to decreases in bias but eventually MSE increases as the model overfits


Using print() on a fitted cv.glmnet object provides a summary of the model at "lambda.min" and "lambda.1se".

  • Lambda is the \(\lambda\) value
  • Index is the position of the \(\lambda\) value in the sequence
  • Measure is the model fit statistic (here MSE) used to select "lambda.min" and "lambda.1se"
  • SE is the standard error of Measure
  • Nonzero is the number of non-zero coefficients
# summary of models at lambda.min and lambda.1se
cv_fit_lasso

Call:  cv.glmnet(x = xz, y = y) 

Measure: Mean-Squared Error 

     Lambda Index Measure     SE Nonzero
min 0.09577    40   1.173 0.1404      21
1se 0.20158    32   1.298 0.1591      14


Use coef() to view the non-zero coefficients at "lambda.min" and "lambda.1se" (only one s= value is allowed in coef() for cv.glmnet objects:

cbind(coef(cv_fit_lasso, s="lambda.min"), coef(cv_fit_lasso, s="lambda.1se"))
111 x 2 sparse Matrix of class "dgCMatrix"
              lambda.min   lambda.1se
(Intercept) -0.055875689 -0.059932829
x1           1.014179329  0.976740851
x2           0.972657853  0.941136158
x3           1.080458555  1.044742296
x4           0.868502542  0.850174522
x5           0.865512174  0.830586236
x6           1.054412159  1.049472385
x7           1.009644318  0.999929630
x8           0.899071153  0.897799862
x9           0.923509777  0.921528197
x10          0.955276944  0.920458581
z1           .            .          
z2           .            .          
z3           .            .          
z4           .            .          
z5           .            .          
z6           .            .          
z7           .            .          
z8           .            .          
z9           .            .          
z10          .            .          
z11          .            .          
z12          .            .          
z13          .            .          
z14          .            .          
z15          .            .          
z16          .            .          
z17          .            .          
z18          .            .          
z19          0.006675561  .          
z20          0.035252917  .          
z21          0.019965285  .          
z22          .            .          
z23          .            .          
z24          .            .          
z25          .            .          
z26          .            .          
z27          0.095711793  0.046942148
z28          .            .          
z29          .            .          
z30          .            .          
z31          .            .          
z32          .            .          
z33          .            .          
z34          .            .          
z35          .            .          
z36          .            .          
z37          .            .          
z38          .            .          
z39          .            .          
z40          .            .          
z41          .            .          
z42          .            .          
z43          .            .          
z44          .            .          
z45          .            .          
z46          .            .          
z47          0.021294076  .          
z48          .            .          
z49          .            .          
z50          .            .          
z51          .            .          
z52          .            .          
z53          .            .          
z54          .            .          
z55          .            .          
z56          .            .          
z57          0.006398966  .          
z58          .            .          
z59          .            .          
z60          .            .          
z61          .            .          
z62          .            .          
z63          .            .          
z64          .            .          
z65          0.046140567  0.029272534
z66          .            .          
z67          .            .          
z68          .            .          
z69          .            .          
z70          0.035666459  .          
z71          .            .          
z72          .            .          
z73          0.036471921  0.002720046
z74          .            .          
z75          .            .          
z76          .            .          
z77          .            .          
z78          .            .          
z79          .            .          
z80         -0.157849444  .          
z81          .            .          
z82          .            .          
z83          .            .          
z84          .            .          
z85          .            .          
z86          .            .          
z87          .            .          
z88          .            .          
z89          .            .          
z90          .            .          
z91          .            .          
z92          .            .          
z93          .            .          
z94          .            .          
z95          .            .          
z96          0.115250859  0.068722850
z97          .            .          
z98          .            .          
z99          .            .          
z100         .            .          

Now that we know how to tune \(\lambda\), we can proceed to inference

Inference after LASSO

Inference from machine learning models

Inference is generally more difficult with machine learning

  • many ML methods do not produce directly interpretable parameter estimates
  • complex interactions and non-linear relationships may be in the model
  • ML methods generally do not provide measures of uncertainty, preventing the construction of confidence intervals and test-statistics for p-values
  • inference should generally not be performed on the same data used to select a set of candidate predictors

Inference after LASSO is only generally complicated by the latter 2 concerns.

Several methods have been developed to perform inference for regularized model fits, with most focusing on the LASSO to address both concerns.

Post-selection inference

One often overlooked assumption of regression models is that the set of predictors is fixed before looking at their relationship with the outcome.

If we use data to select a “best” or “relevant” set of predictors and then use the same data for inference, this assumption is violated.

Common selection methods: forward/backward/stepwise selection, predictors whose correlation with the outcome surpasses a threshold, predictors whose coefficients are significantly different from zero when modeled alone, LASSO

Generally, ignoring the uncertainty of which predictors are relevant will cause underestimation of uncertainty in model coefficients, leading to confidence intervals that are too narrow and overly optimistic p-values.

Using the same data for selection and inference has been so widely practiced that it likely contributes to the reproducibility crisis in science (Taylor & Tibshirani, 2015)

Simulating distorted inference after selection

A simple simulation demonstrates this underestimation of uncertainty due to selection:

  • y and x are both normally distributed with no relation, i.e. \(\beta_x=0\)
  • for each of 10,000 simulations:
    • draw 100 observations of y and x
    • calculate the correlation between y and x
    • regress y on x and record the p-value and confidence interval for \(\hat\beta_x\)
# simulation code
set.seed(12345)
n <- 100 # sample size
nsim <- 10000 # number of simulations
rho <- rep(0, n) # correlation(x,y)
b <- rep(0, n) # lm coef
pval <- rep(0, n) # pvalues
ll <- rep(0, n) # CI lower
ul <- rep(0, n) # CI upper
for (i in 1:nsim) {
  ysim <- rnorm(n) # draw ysim
  xsim <- rnorm(n) # draw x independently of y
  rho[i] <- cor(ysim, xsim) # sample correlation
  m <- lm(ysim ~ xsim) # lm model
  b[i] <- coef(m)["xsim"]  # save coefficient
  pval[i] <- summary(m)$coefficients["xsim", "Pr(>|t|)"]  # save pvalue
  ci <- confint(m)["xsim",] # get CI
  ll[i] <- ci[1] # save CI
  ul[i] <- ci[2]
}


sel_inf <- data.frame(rho=rho, pval=pval, ll=ll, ul=ul) # put it all in data frame

For selection, we will report p-values and confidence intervals when the correlation between x and y is greater than .2 (or less than -.2)

# select estimates where the |correlation| (rho) was greater than 0.2
selected <- which(abs(sel_inf$rho)>.2) # indices of selected

Under the null hypothesis that \(\beta_x=0\), which is true, p-values have a uniform distribution (in blue below) and the probability of observing \(p<=0.05\) is \(0.05\).

The p-values for \(\hat{\beta_x}\) after selection (henceforth known as selection p-values) have a much different distribution concentrated near 0 (in read below).

# histograms of all p-values and selection p-values
par(mfrow=c(2,1),
    mar=c(4,4,2,1),
    cex=.8)
mybreaks <- pretty(range(sel_inf$pval), n=50)
plot(hist(sel_inf$pval, plot=F, breaks=mybreaks), col=rgb(0,0,1, alpha=.5), main="all p-values have a uniform distribution", xlab = "p-value")
plot(hist(sel_inf$pval[selected], plot=F, breaks=mybreaks), col=rgb(1,0,0, alpha=.5), main="p-values after selection have a very different distribution", xlab="p-value")

For all p-values, the probability that \(p<.05\) is indeed near 0.05:

# probability of p<0.05 for all p-values
sum(sel_inf$pval < 0.05)/length(sel_inf$pval)
[1] 0.0495

Thus, if we were to always model x without selection, the probability of a false positive is about 0.05.

However, for the selection p-values, the probability of a false positive is 1!

# probability of p<0.05 for selection p-values
sum(sel_inf$pval[selected] < 0.05)/length(sel_inf$pval[selected])
[1] 1

Thus, these p-values calculated only for selected samples certainly do not represent the probability of observing an extreme test statistic (\(|t|\gt t_{crit}\)) under the null hypothesis.

Similarly, confidence intervals inferred from the same data used for selection will not have their desired properties.

95% confidence intervals should contain the true parameter value, \(\beta_x=0\) in 95% of samples, i.e. 95% coverage.

This is true in our simulation for all coefficients:

# proportion of all confidence intervals that contain 0
sum((sel_inf$ll < 0) & (0 < sel_inf$ul))/length(sel_inf$ll)
[1] 0.9505

The selection confidence intervals have 0% coverage, because they never contain 0 (coincides with all selection p-values < 0.05):

# proportion of selection confidence intervals that contain 0
sum(sel_inf$ll[selected] < 0 & 0 < sel_inf$ul[selected])/length(sel_inf$ll[selected])
[1] 0

Selective Inference

A number of methods collectively called selective inference have been developed that address the problem of inference after a selection procedure (Taylor & Tibshirani, 2015).

Selective inference methods can be broadly classifed into 3 categories (Kuchbhotla et al., 2022):

  • sample splitting
  • conditional selective inference approaches, which calculate p-values and confidence intervals conditioned on the selection
  • simulataneous inference approaches, which frame selective inference as a multiple comparison problem

Sample splitting

The simplest method to avoid problems of using the same data to select predictors and then again for inference is sample splitting.

In sample splitting, the data are randomly split into 2 parts (Wasserman, 2009):

  • the first subset is used for predictor selection via, for example, LASSO with \(\lambda\) tuned via cross-validation
  • the second subset is used for inference based on an unpenalized model with the predictors selected in the first subset
  • to determine how much data to assign to each split:
    • more data in the first subset improves tuning \(\lambda\) and may stabilize the set of selected predictors
    • more data in the second subset improves power and inference, e.g. narrower confidence intervals

The major disadvantage of sample splitting is that less data is used for both selection and inference.

We demonstrate by splitting our data d_lasso into halves.

# split data into two halves
set.seed(53257)
ind <- sample(1:200, size=100) # 100 random observation numbers
d_sel <- d_lasso[ind,] # half for selection
d_inf <- d_lasso[-ind,] # half for inference

Then we tune \(\lambda\) in the first half, and use the predictors selected at "lambda.1se" for a more parsimonious model.

set.seed(82945)
xz_sel <- as.matrix(d_sel[,2:111]) # selection data predictor matrix
y_sel <- as.matrix(d_sel[,1]) # selection data outcome vector
cv_fit_sel <- cv.glmnet(xz_sel, y_sel) # cross-validate to tune lambda and select predictors

coef(cv_fit_sel, s="lambda.1se") # non-zero coefficients at lambda.1se
111 x 1 sparse Matrix of class "dgCMatrix"
             lambda.1se
(Intercept) -0.09829234
x1           1.04015614
x2           0.89411437
x3           1.01871272
x4           0.85994520
x5           0.51793015
x6           0.97266368
x7           1.03250572
x8           0.77971793
x9           0.81640041
x10          1.14665334
z1           .         
z2           .         
z3           .         
z4           .         
z5           .         
z6           .         
z7           .         
z8           .         
z9           .         
z10          .         
z11          .         
z12          .         
z13          .         
z14          .         
z15          .         
z16          .         
z17          .         
z18          .         
z19          .         
z20          .         
z21          0.10113821
z22          .         
z23          .         
z24          .         
z25          .         
z26          .         
z27          0.05186625
z28          .         
z29          .         
z30          .         
z31          .         
z32          .         
z33          .         
z34          .         
z35          .         
z36          .         
z37          .         
z38          .         
z39          .         
z40          .         
z41          .         
z42          .         
z43          .         
z44          .         
z45          .         
z46          .         
z47          .         
z48          .         
z49          .         
z50          .         
z51          .         
z52          .         
z53          .         
z54          0.12353828
z55          .         
z56          .         
z57          .         
z58          .         
z59          .         
z60          .         
z61          .         
z62          .         
z63          .         
z64          .         
z65          .         
z66          .         
z67          .         
z68          .         
z69          .         
z70          .         
z71          .         
z72          .         
z73          0.01223998
z74          .         
z75          .         
z76          .         
z77          .         
z78          .         
z79          .         
z80          .         
z81          .         
z82          .         
z83          .         
z84          .         
z85          .         
z86          .         
z87          0.01532838
z88          .         
z89          .         
z90          .         
z91          .         
z92          .         
z93          .         
z94          .         
z95          .         
z96          .         
z97          .         
z98          .         
z99          .         
z100         .         

Above we see 13 non-zero coefficients.

The columns of the predictor training matrix corresponding to the non-zero coefficients are can be accessed in the object returned by coef() with @i

# columns of xz_sel corresponding to non-zero coefficients
xz_sel_ind <- coef(cv_fit_sel, s="lambda.1se")@i # column indices of selected predictors
xz_sel_ind # 0th column is intercept and is always selected
 [1]  0  1  2  3  4  5  6  7  8  9 10 31 37 64 83 97

We can then perform inference on these columns using the second subset. We use lm() for an unpenalized model (i.e. coefficients are unbiased, unlike those in LASSO):

# run linear regression on 2nd subset using predictors selected in 1st subset
f <- paste("y ~", paste(colnames(xz_sel)[xz_sel_ind[-1]], collapse="+")) # create lm formula using column names of xz_sel
f
[1] "y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z21+z27+z54+z73+z87"
m_inf <- lm(f, data=d_inf) # unpenalized model
summary(m_inf)

Call:
lm(formula = f, data = d_inf)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.77695 -0.68562 -0.01483  0.70274  2.58269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.04329    0.11871  -0.365    0.716    
x1           0.91117    0.12544   7.264 1.77e-10 ***
x2           0.99319    0.13664   7.268 1.73e-10 ***
x3           1.02833    0.12860   7.997 6.16e-12 ***
x4           0.92140    0.13877   6.640 2.92e-09 ***
x5           1.17323    0.13302   8.820 1.37e-13 ***
x6           1.20620    0.13039   9.251 1.85e-14 ***
x7           1.08318    0.12649   8.563 4.49e-13 ***
x8           0.94158    0.13322   7.068 4.29e-10 ***
x9           0.98493    0.12835   7.674 2.71e-11 ***
x10          0.84850    0.13792   6.152 2.49e-08 ***
z21         -0.08537    0.13673  -0.624    0.534    
z27          0.17788    0.14427   1.233    0.221    
z54         -0.11317    0.13074  -0.866    0.389    
z73          0.10923    0.13178   0.829    0.410    
z87         -0.12212    0.13232  -0.923    0.359    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.11 on 84 degrees of freedom
Multiple R-squared:  0.9683,    Adjusted R-squared:  0.9627 
F-statistic: 171.1 on 15 and 84 DF,  p-value: < 2.2e-16

The confidence intervals for all of the regression coefficients contain the true parameter values (\(\beta_0=0\), \(\beta_{x1}=\beta_{x2}=...=\beta_{x_10}=1\), \(\beta_{z1}=\beta_{z2}=...=\beta_{z100}=0\))

# CIs for unpenalized model run on independent data
m_inf_ci <- confint(m_inf)
m_inf_ci
                 2.5 %    97.5 %
(Intercept) -0.2793542 0.1927652
x1           0.6617064 1.1606250
x2           0.7214565 1.2649211
x3           0.7725980 1.2840567
x4           0.6454536 1.1973541
x5           0.9087133 1.4377542
x6           0.9469135 1.4654957
x7           0.8316320 1.3347250
x8           0.6766531 1.2065017
x9           0.7297027 1.2401619
x10          0.5742381 1.1227606
z21         -0.3572702 0.1865276
z27         -0.1090141 0.4647676
z54         -0.3731493 0.1468177
z73         -0.1528357 0.3712892
z87         -0.3852672 0.1410179

The unpenalized model run on the first data subset used for selection (remember, don’t do this!) results in a false positive test of significance for \(z10\):

# problematic inference if unpenalized model is run on data
#  used for selection
m_sel <- lm(f, data=d_sel) 
summary(m_sel)

Call:
lm(formula = f, data = d_sel)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.35513 -0.52145  0.02935  0.60783  2.12045 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.03296    0.09642  -0.342   0.7333    
x1           1.11606    0.10056  11.099  < 2e-16 ***
x2           0.98712    0.10426   9.468 6.80e-15 ***
x3           1.11358    0.12156   9.160 2.82e-14 ***
x4           0.90027    0.10011   8.993 6.13e-14 ***
x5           0.59005    0.10604   5.564 3.07e-07 ***
x6           1.07014    0.10317  10.372  < 2e-16 ***
x7           1.03268    0.11405   9.055 4.61e-14 ***
x8           0.84885    0.09158   9.269 1.70e-14 ***
x9           0.86572    0.13661   6.337 1.11e-08 ***
x10          1.19627    0.11337  10.552  < 2e-16 ***
z21          0.24648    0.10596   2.326   0.0224 *  
z27          0.10270    0.10197   1.007   0.3168    
z54          0.14421    0.11117   1.297   0.1981    
z73          0.04630    0.09679   0.478   0.6336    
z87          0.05788    0.10884   0.532   0.5962    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8704 on 84 degrees of freedom
Multiple R-squared:  0.9831,    Adjusted R-squared:   0.98 
F-statistic: 325.1 on 15 and 84 DF,  p-value: < 2.2e-16

Stability sampling

A common problem encountered with LASSO selection is that the best set of predictors may be sensitive to how the data are split during cross-validation.

  • correlated predictors may be alternately selected
  • unrelated predictors may be selected in noisy data

Below we re-run the \(\lambda\) cross-validation and predictor selection 3 times and can see that the number of nonzero coefficeints varies at "lambda.min" and "lambda.1se":

# selected predictor set varies across cross-validation runs
set.seed(1290)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

     Lambda Index Measure     SE Nonzero
min 0.09394    82   1.185 0.1192      29
1se 0.20715    65   1.296 0.1600      18
set.seed(2741)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.0856    84   1.215 0.1586      31
1se 0.2273    63   1.365 0.1992      17
set.seed(4895)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.1428    73   1.188 0.1322      20
1se 0.2614    60   1.316 0.2009      16

Stability sampling (Meinhausen & Buhlmann, 2010) addresses predictor set variability using this general algorithm:

  • resample a subset (e.g. half) of the training data repeatedly \(R\) times
    • simulates new data
    • can reduce predictor correlations
    • \(R=100\) is usually sufficient
  • use the predictor selection method (e.g. LASSO) on each resampled subset and record which predictors are selected
    • \(\lambda\) can be chosen before resampling using cross-validation on the entire training sample
    • or, \(\lambda\) can tuned via cross-validation for each \(R\) subset
  • calculate the probability that each predictor is selected across the \(m\) resampled subsets
  • choose the predictors that surpass some threshold probability, \(\pi_{thr}\)
    • Meinhausen & Buhlmann, 2010, suggest \(0.6 \le \pi_{thr} \le 0.9\) as sensible thresholds and note that the results tend not to be too sensitive to \(\pi_{thr}\)
    • higher values of \(\pi_{thr}\) result in more regularized predictor sets
    • \(\pi_{thr}\) can also be chosen to control the expected number of false positives (predictors with true zero coeffcients, see Meinshausen & Bühlmann, 2010)

Below we use stability sampling on our training data (already a subset of 100 observations due to sampling splitting) to stabilize selection of the predictor set:

# stability selection
set.seed(13459)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected <- rep(0, 110) # counts of being selected for 110 predictors 
R <- 100 # number of simulations
for (i in 1:R) {
  stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
  xz_stab <- xz_sel[stab_ind,] # resampled predictors
  y_stab <- y_sel[stab_ind,] # resampled outcome
  ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam))@i[-1] # column indices of selected predictor, -1 removes intercept
  selected[ind_selected] <- selected[ind_selected] + 1 # increase counts for selected predictors
}

# predictor set at 4 different pi thresholds
which(selected >= .6*R) # selected in more than 60% of simulations
 [1]   1   2   3   4   5   6   7   8   9  10  31  97 105
which(selected >= .7*R)
 [1]  1  2  3  4  5  6  7  8  9 10 31
which(selected >= .8*R)
 [1]  1  2  3  4  5  6  7  8  9 10
which(selected >= .9*R)
 [1]  1  2  3  4  5  6  7  8  9 10

All 10 \(x\) predictors (corresponding to 1-10) are selected at each threshold, and are the only predictors selected at threshold \(\pi_{thr}=.8\) and \(\pi_{thr}=.9\).

Stability selection with randomized LASSO

Meinhausen & Buhlmann, 2010, recommend the use of the randomized LASSO to improve stability selection.

With the randomized LASSO, the penalties are perturbed randomly across the coefficients.

Imagine two predictors, \(X_1\) and \(X_2\) are correlated and equally predictive of the outcome \(Y\).

  • if in the sample at hand \(X_1\) happens to be more strongly predictive of \(Y\) due to sampling variability, LASSO may always choose \(X_1\), even though \(X_2\) is actually equally predictive
  • the randomized LASSO penalty perturbations will reduce the influence of \(X_1\) in some of the resampled subsets, simulating samples where \(X_2\) happens to be more strongly predictive of \(Y\) and thus allowing LASSO to select \(X_2\) in some samples

Essentially, the randomized LASSO adds weights to the penalty term:

\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\frac{1}{w_j}|\beta_j| \]

These penalty weights, \(w_j\) are drawn from a uniform distribution with bounds \([\alpha, 1]\), \(w_j \sim Uniform(\alpha, 1)\), with \(0.2 \le \alpha \le 0.8\) recommended, with lower values of \(\alpha\) adding more randomness to the penalties.

We will see in part 2 of this workshop how random forests introduce randomness in regression trees to improve prediction.

Below we implement the randomized LASSO in our stability sampling:

  • draw weights from runif() (one for each column of predictor matrix)
  • apply the reciprocal of these weights to LASSO using penalty.factor= in glmnet
set.seed(13769)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected_randomized <- rep(0, 110) # counts of being selected for 110 predictors 
R <- 100 # number of 
for (i in 1:R) {
  stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
  xz_stab <- xz_sel[stab_ind,] # resampled predictors
  y_stab <- y_sel[stab_ind,] # resampled outcome
  pf <- 1/runif(110, .4, 1) # 1/w_j penalty weights
  
  ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam, penalty.factor=pf))@i[-1] # apply weights using penalty.factor=
  selected_randomized[ind_selected] <- selected_randomized[ind_selected] + 1
}
# predictor sets at different pi thresholds 
which(selected_randomized >= .6*R)
 [1]   1   2   3   4   5   6   7   8   9  10  64 105
which(selected_randomized >= .7*R)
 [1]  1  2  3  4  5  6  7  8  9 10 64
which(selected_randomized >= .8*R)
 [1]  1  2  3  4  5  6  7  8  9 10
which(selected_randomized >= .9*R)
 [1]  1  2  3  4  5  6  7  8  9 10

For our simulated data, the randomized LASSO only provides a little more stability to the selected predictor set.

Conditional selective inference

Conditional selective inference methods calculate p-values and confidence intervals (CIs) that account for the fact that the set of predictors was chosen by examining their relationships with the outcome.

Conditional selective inference methods do not require the sample to be split, so the same data are used efficiently for selection and inference.

Recall that test-statistics used to evaluate null hypotheses for regression coefficients (and construct CIs) are formed by dividing the estimated coefficient by its standard error, e.g. \(t_\hat{\beta}\) in linear regression:

\[t_\hat{\beta_j} = \frac{\hat{\beta_j}}{\hat{\sigma}_{\hat{\beta_j}}}\]

\(t_\hat{\beta_j}\) will be near 0 if either \(\hat{\beta_j}\) is near 0 or if \(\hat{\sigma}_{\hat{\beta_j}}\) is very large.

In one conditional selective inference approach (Lee et al., 2016), the test statistic distribution is truncated to account for the fact the selection procedure prevents predictors whose coefficients have test-statistics near zero from being entered into the inference model.

For a given test statistic, this approach essentially calculates bounds defining the interval the test statistic must lie due to the selection event.

These bounds truncate the test statistic’s sampling distribution, and CIs and p-values are then calculated based on this truncated distribution.

To understand how a truncated distribution changes inference, imagine we randomly draw \(x\) repeatedly from a standard normal distribution, \(x \sim N(0,1)\):

  • the probability of observing \(x>2.5\) over all draws is quite low (.0062)