Machine Learning for Scientific Research, Part 1

UCLA OARC Statistical Methods and Data Analytics

Overview

This two-part workshop introduces usage of machine learning methods for scientific research.

  • aimed at researchers with some background in classical regression modeling
  • no background in machine learning assumed
  • emphasizes methods for scientific inference rather than prediction

Part 1 discusses:

  • machine learning vs classical regression
  • training and test prediction error
  • hyperparameter tuning and cross-validation
  • Regularization, focused on LASSO
  • fitting LASSO models with glmnet
  • inference methods after LASSO variable selection

Part 2 discusses regression and classification trees and their extensions, including bagging, boosting, random forests and causal forests, with an emphasis on inference.

Machine learning vs classical statistical methods

Machine learning

Machine learning (ML) encompasses a wide array of computational methods that identify patterns in data to make accurate predictions and infer relationships among variables.

ML is a subclass of artificial intelligence: the machine learns patterns without human guidance.

Traditionally, ML has been used for accurate prediction in industry and engineering settings:

  • recommending products
  • predicting credit card fraud
  • detecting defects in manufacturing

Machine learning is increasingly used in the sciences

ML usage in the sciences has grown rapidly more recently:

  • bigger data
  • increasing computational power
  • development of inference methods

Below are figures showing increasing usage of ML methods in health sciences (left) and business, social sciences, and economics (right).

Number of articles using or discussing ML has grown rapidly over time across many health subfields

Image is Fig 1 from (Rajula et al., 2020).

Machine learning usage has rapidly grown in social sciences, economics, and business, especially more recently

Blue line is % of abstracts containing ML terms across fields

Image is Fig 1 from (Rahal et al., 2024).

Big data challenges

Data sets continue getting bigger:

  • ubiquity of data-recording devices (e.g. cell phones, sensors) and digital data collection
  • cheap, massive, and remotely accessible storage (e.g. cloud)

Digital tools enable measurement of thousands of variables efficiently.

These wide data sets suffer from the curse of dimensionality: relevant patterns become harder to find as the number of variables grows.

For example in genomics, we might wish to know which few genes out of 25,000 are most strongly related to variation in a trait.

Human vs machine learning

Compared to machines, humans often require much less data to learn.

  1. If we were to show the following pictures to a young child:

  1. Most children would be immediately able to solve this problem:

A computer might require thousands of training pictures to solve the problem.

Instead, computers excel at processing data, especially at large scales.

Thus, for scientific problems, ML is ideally suited for problems where:

  • the number of relationships to probe is overwhelming
  • large amounts of data to train the machine are available

Machine learning spectrum

For many problems (e.g.,genomics) we wish to explore the relationships of thousands of predictors (features in ML) with an outcome (label in ML).

A researcher can be easily overwhelmed deciding which predictors, interactions, non-linear effects, etc., to evaluate.

ML methods transfer the decision-making from humans to computers (algorithms).

Beam & Kohane, 2018, imagined classical regression and ML methods lying on a spectrum defined by how much human vs machine decision-making is required.

They examined and plotted the relationship between this spectrum and the amount of data used in various studies employing various methods.

As expected, studies using methods requiring more machine-driven decisions also used more data.

Image is Figure from (Beam & Kohane, 2018)

Human decisions are often model assumptions

Classical regression generally requires more human decision making, often in the form of assumptions:

  • assuming predictors have no relation to outcome by omission
    • including interactions
  • assuming linear relationships of included predictors
  • assuming outcome or errors follow a known parametric distribution

These assumptions simplify the model, reducing how many parameters are estimated and thus the amount of data required.

Thus, given sufficent data, machine learning methods are especially useful for scientific problems where regression assumptions are untenable or the researcher is overwhelmed by the number of decisions and assumptions to ponder.

Supervised learning and prediction error

Supervised learning

Like regression, supervised ML methods model the relationships between a set of predictors and an outcome.

Two primary goals:

  • prediction: given a set of predictor values, what is the predicted outcome?
    • when earthquakes will happen
    • who will develop a disease
    • traditionally the focus of ML
  • inference: understanding relationship between predictor and outcome
    • which predictors are related to the outcome?
    • if I vary a predictor, how does the outcome vary (e.g., treatment effect)?
    • what is the uncertainty in these relationships?
    • usual focus of scientific research
    • ML methods often leave inference inside the black box

Unsupervised methods find patterns among variables without respect to an outcome (e.g. k-means clustering, principal component analysis)

Prediction

Imagine we are interested in estimating the relationship, \(f\), between and outcome \(Y\) and a set of predictors \(X=X_1,X_2,...,X_p\).

\[Y=f(X)\]

For example, in linear regression, \(f\) is assumed to be a linear combination of a set of coefficients and \(X\):

\[\begin{align} Y&=f_{OLS}(X) + \epsilon \\ &=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon \\ \end{align}\]

The error term \(\epsilon\) is not included in \(f\) because it does not depend on \(X\).

In many ML approaches, linear relationships are not assumed for \(f\).

With data, we can estimate these relationships, and the estimated \(\hat{f}\) can be used to make predictions of the outcome, \(\hat{Y}\).

\[\hat{Y} = \hat{f}(X)\]

Prediction error

Predictions \(\hat{Y}\) will rarely perfectly match observations \(Y\).

Prediction error quantifies the difference between \(\hat{Y}\) and \(Y\) and comes in many forms.

For continuous \(Y\), squared errors, \((Y-\hat{Y})^2\), are commonly used.

If we wish to predict \(Y\) in future data sets, we might want to know how much prediction error to expect:

\[E((Y-\hat{Y})^2) = [(f(X) - \hat{f}(X))^2] + Var(\epsilon)\]

where

  • \(E((Y-\hat{Y})^2)\) is the expected value of prediction error, or the long-run average squared error over many data sets
  • \([(f(X) - \hat{f}(X))^2]\) is the squared difference between predictions based on the true model, \(f(X)\), and predictions based on the estimated model, \(\hat{f}(X)\)
    • both predictions \(f(X)\) and \(\hat{f}(X)\) exclude the error term \(\epsilon\)
  • \(Var(\epsilon)\) is the variance of the unexplained errors

We will use other measures of prediction error for categorical outcomes.

Reducible error and irreducible error

In the expression for expected squared (prediction) error:

\[\begin{align} E((Y-\hat{Y})^2) &= [(f(X) - \hat{f}(X))^2] + Var(\epsilon) \\ &=\text{reducible error} + \text{irreducible error} \end{align}\]

  • \([(f(X) - \hat{f}(X))^2]\) represents reducible error
    • can be reduced, theoretically to zero, if all relevant predictors correctly modeled
    • omission of relevant predictors generally causes bias
    • including irrelevant predictors causes estimates to become unstable due to overfitting
  • \(Var(\epsilon)\) represents irreducible error, which cannot be reduced because we lack the explanatory predictors or because of random fluctuations.

We can thus optimize predicting the outcome by minimizing the reducible error.

Mean Squared Error

With sample data, we can estimate the expected squared error with Mean Squared Error (MSE):

\[\text{MSE}=\frac{1}{n}\sum_{i=1}^n(y-\hat{f}(x))^2\]

MSE is the averaged squared residual (estimate of error), \((y-\hat{f}(x))^2\).

The MSE for the data used to fit the model is called training MSE, \(\text{MSE}_{\text{train}}\).

Test MSE

A model that predicts well in future data sets will generally be more useful than one that only predicts well for the current sample data.

Thus, we generally care less about \(\text{MSE}_{\text{train}}\) and more about \(\text{MSE}_{\text{test}}\), or prediction error in future data sets:

\[\text{MSE}_{\text{test}}=\frac{1}{\tilde{n}}\sum_{i=1}^\tilde{n}(\tilde{y}-\hat{f}(\tilde{x}))^2\]

\(\hat{f}\) above is the model estimated with the training data.

Above, \(\tilde{y}\) and \(\tilde{x}\) represent outcomes and predictors in future data sets, respectively.

We use test error to refer generally to a measure of prediction error in future data.

Minimizing test MSE

Methods that minimize \(\text{MSE}_{\text{train}}\) (e.g. ordinary least squares) usually do not simultaneously minimize \(\text{MSE}_{\text{test}}\).

Models with more parameters have more flexibility to accommodate complex relationships.

\(\text{MSE}_{\text{train}}\) monotonically decreases with increased model flexibility (analogous to how \(R^2\) always increases with more predictors in linear regression).

However, \(\text{MSE}_{\text{test}}\) has a U-shaped relationship with flexibility, where at some point increasing flexibility will lead to increasing \(\text{MSE}_{\text{test}}\).

Simulation of \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vs flexibility

The following simulation illustrates how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with model flexibility:

  • In the population, \(y\) has the following cubic relationship with \(x\):

\[y=x-x^2+.4x^3+\epsilon\]

  • we train linear regression models on a sample of \(n=30\)
  • flexibility of the models is determined by the degree of polynomial of \(x\) added to the model, ranging from 1 (just \(x\)) to 10 (all terms up to \(x^{10}\))
  • \(\text{MSE}_{\text{test}}\) is estimated at each polynomial degree by:
    • use the model fit to the sample data to predict outcomes in 20 future data sets (from the same population and each with \(n=30\))
    • use the predictions to calculate \(\text{MSE}_{\text{test}}\) in each future data set
    • average the 20 \(\text{MSE}_{\text{test}}\) values


The left plot below shows the training data and predictions from 3 different models (degree=1,3,10)

The right plot below shows how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with flexibility.

The added flexibility of additional polynomial terms allows the model to fit more closely to the training data.

However, the wiggliness of the maximally flexible model (degree=10) suggests overfitting.

\(\text{MSE}_{\text{train}}\) always decreases with flexibility.

\(\text{MSE}_{\text{test}}\) has the expected U-shaped relationship with flexibility, and is at a minimum at the true model (degree=3).

Figure inspired by Fig 2.9 of (James et al., 2013)

Bias-variance tradeoff

Why does \(\text{MSE}_{\text{test}}\) have a U-shaped relationship with flexibility?

The reducible error in future datasets has two components: bias and variance of \(\hat{f}\):

\[\begin{align}E(\text{MSE}_{\text{test}}) &= \text{reducible error} + \text{irreducible error} \\ &=[Bias(\hat{f}(\tilde{x}))^2 + Var(\hat{f}(\tilde{x})] + Var(\epsilon) \end{align}\]

  • \(E(\text{MSE}_{\text{test}})\) is the expected or average test MSE over many future datasets.
  • \(Bias(f(\tilde{x}))^2\) is the (squared) bias of \(\hat{f}\)
    • if the average of \(\hat{f}\) over many samples does not equal \(f\), \(\hat{f}\) is biased
    • more flexible models will tend to have less bias (by including all relevant predictors with correct functional forms)
  • \(Var(\hat{f}(\tilde{x}))\) is the variance of \(\hat{f}\), how much \(\hat{f}\) changes when repeatedly estimated
    • more flexible models will tend to have more variance because the extra parameters will be sensitive to small changes in the data

Thus, as models become more flexible, bias will decrease, but eventually, reductions in bias will be dominated by increases in variance.

The goal, then, is to find the model with the right amount of flexibility that balances bias and variance to minimize test MSE.

An extension of the previous polynomial simulation illustrates how bias and variance of \(\hat{f}\) vary with model flexibility:

  • Our target is bias and variance of the coefficient for the \(x\) linear term, \(\beta_{x}=1\)
  • Generate 500 training data sets of \(n=30\)
  • Train models with polynomial degrees=1,2,…,10 of \(x\) on the 500 data sets
  • Bias is estimated by subtracting the mean of the 500 estimated coefficients for the linear term, \(\overline{\hat{\beta}}_x\), from its true value of \(\beta_{x}=1\)
  • Variance is estimated by the variance of the 500 \(\hat{\beta}_x\)

Flexibility vs interpretability

Statistical approaches that are more flexible can improve prediction accuracy but tend to be less interpretable.

The additional complexity of flexible models obscures their interpretation (e.g. interactions and non-linear effects).

Classical regression models are less flexible but their parameters that have straightforward interpretations of additive changes in the outcome (or transformed outcome) per unit-change in the predictor.

Machine learning models are often non-parametric and predictor effects are often difficult to summarize.

Image inspired by Fig2.7 from (James et al., 2013)

Tuning model flexibility

Hyperparameters

The flexibility of ML methods is typically controlled by hyperparameters.

Hyperparameters are fixed before a model is fit, and the model is then fit conditionally on the hyperparameters’ values.

Across ML models, hyperparameters typically control

  • model complexity (e.g. LASSO, regression trees)
  • learning rate (e.g. boosting, neural networks)

The “best” hyperparameter value that minimizes test error cannot be estimated from a single model fit.

Instead, hyperparameters are tuned or optimized by estimating test error over a range of many hyperparameter values.

Cross-validation

Cross-validation is the most common method used to tune hyperparameters.

Cross-validation efficiently uses a single data set to train a model and estimate its test error (called validation if not using future data).

Cross-validation algorithm:

  • split the data randomly into \(K\) folds
  • then for each fold \(k\)
    • train the model on all folds but fold \(k\)
    • calculate predictions in fold \(k\) using the trained model to estimate test error
  • repeat for all folds
  • estimated test error is average test error across all \(k\) folds

Cross-validation thus allows us to estimate test error at each hyperparameter value and find the value that minimizes test error, with just one data set.

Imagine we wish to tune hyperparameter \(\lambda\) value over a range of 100 values, \(\lambda_1,\lambda_2,...,\lambda_{100}\)

The image below depicts 5-fold cross-validation to estimate test error for the first \(\lambda\) in the sequence, \(\lambda_1\)

How many folds?

With more data, the trained model will be less variable and on average closer to the true model, resulting in lower test error.

Cross-validation uses only a portion of the data to train the model (e.g. 90% with 10 folds) and thus test error will be generally over-estimated compared to using 100% of the data.

To avoid this over-estimation, \(K=n\) folds, where each fold is a single observation, known as leave-one-out cross-validation, might seem appealing. However, the estimated test errors become highly variable because they are based on single observations.

For most applications, \(k=5\) and \(k=10\) have been found to perform well.

Generally, with smaller data, more folds might be preferred so that more data is used to train.

Do not calucate test error on trained data

Statistical methods fit models by minimizing training error.

Thus, test error will be generally underestimated if calculated using any observations used to train the model.

Thus, any sort of training should be avoided before calculating test error at any step, including cross-validation.

“Training” includes not only fitting models but also selecting predictors based on their relationship with the outcome (e.g. using correlation thresholds or a forward stepwise procedure).

LASSO and regularization

Highly variable models

Regression parameter estimates will vary greatly across repeated samples when:

  • number of predictors is large relative to sample size
  • only a few predictors are actually predictive (sparsity)
  • predictors are highly correlated

More variable models will lead to higher test error, so methods less flexible than regression can help.

Regularized models task the machine with using the data to reduce the flexibility of the model, for example, by removing irrelevant predictors.

Regularization

A model’s loss function is minimized to produce the “best” parameter estimates:

  • sum of squared errors (SSE) minimized in linear regression

\[LOSS_{ols}(\beta) = \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2\]

  • negative likelihood minimized (equivalent to likelihood being maximized) for generalized linear models (e.g logistic and poisson regression)

Regularization methods impose a penalty term on to a model’s loss function to limit complexity and discourage overfitting.

The penalty term:

  • shrinks parameter estimates towards zero
  • induces bias but also reduces variance
  • as a hyperparameter can be tuned via cross-validation

LASSO

Least Absolute Shrinkage and Selection Operator (LASSO) is a popular ML regularization method adapted for several regression models.

The LASSO loss function, \(LOSS_{lasso}(\beta)\), adds a penalty to the sum of the absolute values of the \(p\) parameter estimates:

\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p|\beta_j| \]

where

  • \(Loss_{reg}(\beta)\) is the loss function of the unregularized regression model
    • e.g. sum of squared errors, negative log likelihood
  • \(\lambda\) is the penalty hyperparameter
    • larger values penalize more
  • \(\sum_{j=1}^p|\beta_j|\) is the sum of the absolute values of all model coefficients
    • each predictor \(X_j\) is standardized so that the magnitude of its coefficient \(\beta_j\) does not depend on \(X_j\)’s scaling

For example, the loss function for LASSO for linear regression adds the penalty term to the sum of squared errors (loss function for linear regression):

\[\begin{align} LOSS_{lasso.ols}(\beta) &= \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2 + \lambda\sum_{j=1}^p|\beta_j| \\ &= \text{SSE} + \text{penalty} \end{align}\]

For a given \(\lambda\), the “best” set of estimates \(\hat{\beta}\) minimizes \(Loss_{lasso}\)

Note: The loss function for LASSO linear regression can be equivalently minimized with SSE multiplied by \(\frac{1}{2n}\), as in the glmnet package, or by \(\frac{1}{2}\), as in the selectiveInference package. The same set of \(\hat{\beta}\) will be estimated, but the scaling of \(\lambda\) will be different.

LASSO as variable selection

The LASSO penalty can potentially shrink coefficients to zero if \(\lambda\) is large enough.

Predictors with zero coefficients are selected out of the model.

Thus, LASSO is often used for automated variable selection.

Coefficient path plots track how coefficients grow as \(\lambda\) and thus the LASSO penalty decreases.

The plot above tracks 10 coefficients against \(\lambda\)

  • each line represents a coefficient
  • the bottom x-axis is the negative of the log of \(\lambda\), which means \(\lambda\) decreases as we move to right
  • the y-axis the estimate of the coefficient
  • the top x-axis is how many non-zero coefficients are in the LASSO-regularized model
  • we see how coefficients emerge from zero as \(\lambda\) decreases

LASSO-regularized generalized linear models

The LASSO penalty can be added to generalized linear models (GLMs) that do not assume normally distributed errors, such as logistic, Poisson, multinomial, and Cox regression models.

Generally, the loss function to be minimized for LASSO-regularized GLMs will use the negative log likelihood:

\[LOSS_{lasso.glm}(\beta) = \text{negative log likelihood} + \text{LASSO penalty}\]

For example, for logistic regression of binary (0/1) outcome \(y_i\), the loss function can be written as:

\[LOSS_{lasso.logistic}(\beta) = -\frac{1}{n}\sum_{i=1}^n\left(y_i\Big(\beta_0 + \sum_{j=1}^pX_{ij}\beta_j\Big)-log\Big(1+e^{\beta_0 + \sum_{j=1}^pX_{ij}\beta_j}\Big)\right)+\lambda\sum_{j=1}^p|\beta_j|\]

Henceforth, we will only be looking at LASSO-regularized linear regression.

Ridge regression and elastic net regularization

Two regularization methods similar to LASSO are ridge regression and elastic net.

Ridge regression uses a different penalty based on the squares of the coefficients.

\[Loss_{ridge}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\beta_j^2\] \(\lambda\) again serves as a penalty parameter that can be tuned via cross-validation.

Unlike LASSO, ridge regression does not shrink coefficients to zero (unless \(\lambda=\infty\)), so might be a better choice if all predictors are expected to have non-zero coefficients.

Elastic net combines the LASSO and ridge penalties, with an additional mixing parameter \(\alpha\), \(0\le\alpha\le1\), which controls how much of each penalty to use:

\[Loss_{elastic}(\beta) = Loss_{reg}(\beta) + \lambda\left(\alpha \sum_{j=1}^p|\beta_j| + [1-\alpha] \sum_{j=1}^p\beta_j^2 \right)\]

  • when \(\alpha=1\), only the LASSO penalty is used
  • when \(\alpha=0\), only the ridge penalty is used
  • when \(0<\alpha<1\), both penalties are used, and some coefficients may shrink to zero, though less so than when \(\alpha=1\)

For the remainder of the seminar we focus on LASSO, because most inference methods for regularized methods have been developed for LASSO

The glmnet package for LASSO

glmnet package

The glmnet package (Friedman et al., 2010):

  • performs LASSO and related regularization methods (ridge regression and elastic net)
  • can regularize generalized linear models (e.g. linear, logistic, Poisson, Cox)
  • can tune hyperparameters via cross-validation
  • is optimized for speed, so very fast
#load glmnet package
library(glmnet)

LASSO data

We demonstrate usage of the glmnet package for LASSO-regularized models with the following simulated data:

  • \(n=200\)
  • outcome: \(y\)
  • 110 predictors: \(x1-x10\), \(z1-z100\)
    • each \(x\) has coefficient 1
    • each \(z\) has coefficient 0
    • all 110 predictors are correlated at \(\rho=.2\) with each other
  • error variance \(\sigma_e^2=1\)
# load simulated data to demonstrate glmnet LASSO
d_lasso <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2025/08/lasso_sim.csv")
names(d_lasso) # column names
  [1] "y"    "x1"   "x2"   "x3"   "x4"   "x5"   "x6"   "x7"   "x8"   "x9"  
 [11] "x10"  "z1"   "z2"   "z3"   "z4"   "z5"   "z6"   "z7"   "z8"   "z9"  
 [21] "z10"  "z11"  "z12"  "z13"  "z14"  "z15"  "z16"  "z17"  "z18"  "z19" 
 [31] "z20"  "z21"  "z22"  "z23"  "z24"  "z25"  "z26"  "z27"  "z28"  "z29" 
 [41] "z30"  "z31"  "z32"  "z33"  "z34"  "z35"  "z36"  "z37"  "z38"  "z39" 
 [51] "z40"  "z41"  "z42"  "z43"  "z44"  "z45"  "z46"  "z47"  "z48"  "z49" 
 [61] "z50"  "z51"  "z52"  "z53"  "z54"  "z55"  "z56"  "z57"  "z58"  "z59" 
 [71] "z60"  "z61"  "z62"  "z63"  "z64"  "z65"  "z66"  "z67"  "z68"  "z69" 
 [81] "z70"  "z71"  "z72"  "z73"  "z74"  "z75"  "z76"  "z77"  "z78"  "z79" 
 [91] "z80"  "z81"  "z82"  "z83"  "z84"  "z85"  "z86"  "z87"  "z88"  "z89" 
[101] "z90"  "z91"  "z92"  "z93"  "z94"  "z95"  "z96"  "z97"  "z98"  "z99" 
[111] "z100"

The glmnet package works best with the predictors and outcome stored as separate matrix objects (or vector if a single column):

y <- as.vector(d_lasso[,1]) # convert first column, y, to vector 
xz <- as.matrix(d_lasso[,2:111]) # convert columns 2:111 to matrix, first 10 columns are x vars

glmnet() for LASSO

The glmnet() function fits regularized generalized linear models, by default using LASSO.

Important arguments:

  • x=: predictor matrix
  • y=: outcome vector
  • family=: outcome/error distribution family (defaults to "gaussian")
  • alpha=: elastic net mixing parameter controlling how much of the ridge and LASSO penalties to use; defaults to alpha=1 for LASSO

glmnet() fits LASSO models across a sequence of \(\lambda\) values:

  • max \(\lambda\) is smallest \(\lambda\) that shrinks all coefficients to zero
  • min \(\lambda\) is max \(\lambda\) divided by 10,000 if \(n>p\), or divided by 100 if \(p>n\) (where more regularization is required)
  • 100 evenly spaced \(\lambda\) values chosen between min and max (use nlambda= to change number)
  • use lambda= to specify a custom sequence

We fit LASSO-regularized linear regressions for our simulated data across the sequence of \(\lambda\) values using glmnet():

# first 2 arguments are x= and y=
fit_lasso <- glmnet(xz, y)

Working with glmnet objects

Rather than extracting results directly from the fitted glmnet object, use these extractor functions:

  • plot(): coefficient path plot
  • print(): summary of model fit at each \(\lambda\)
  • coef(): coefficient estimates at specific \(\lambda\) value(s)
  • predict(): model-based predictions

Coefficient path plot

plot() used on a glmnet fitted object creates the coefficient path plot:

# coefficient path plot
plot(fit_lasso)

Remember that \(\lambda\) and the penalty decrease as we move to the right.

  • 10 non-zero coefficients emerge at relatively high penalties on the left, representing the 10 \(x\) variables (and end up near their values of 1)
  • the coefficients for the 100 \(z\) variables begin to emerge at lower penalties (but stay relatively closer the their true values of 0)
  • at the lowest \(\lambda\) value, 108 total coefficients are non-zero

Summary of model fit at each \(\lambda\)

Using print() on the fitted glmnet object (or just typing the object’s name) summarizes the model fit at each \(\lambda\) value.

In the table below,

  • Df: number of non-zero coefficients
  • %Dev: percent of null deviance explained, a measure model fit to the training data
    • null deviance measures difference in model fit between saturated model (perfectly predicts outcome) and null model (intercept-only)
      • \(Deviance_{null} = 2\times(LogLik_{saturated}-LogLik_{null})\)
    • \(\%Dev = \frac{Deviance_{null} - Deviance_{fitted}}{Deviance_{null}}\times100\)
    • %Dev thus measures how much the null deviance has been reduced by fitted model
  • Lambda: \(\lambda\) used to fit model

Note: if the number of rows in the summary table is less than 100 (or whatever nlambda= equals), glmnet() stopped early because %Dev was not changing enough as \(\lambda\) was decreasing

# summary of glmnet path (model fit at each lambda)
# same as print(fit_lasso)
# glmnet stops at 83rd lambda because %Dev did not change enough at 84th
fit_lasso

Call:  glmnet(x = xz, y = y) 

    Df  %Dev Lambda
1    0  0.00 3.6060
2    2  9.58 3.2850
3    7 21.09 2.9930
4    9 33.13 2.7280
5    9 43.64 2.4850
6   10 52.38 2.2640
7   10 60.00 2.0630
8   10 66.32 1.8800
9   10 71.56 1.7130
10  10 75.91 1.5610
11  10 79.53 1.4220
12  10 82.53 1.2960
13  10 85.02 1.1810
14  10 87.09 1.0760
15  10 88.80 0.9802
16  10 90.23 0.8931
17  10 91.41 0.8138
18  10 92.40 0.7415
19  10 93.21 0.6756
20  10 93.89 0.6156
21  10 94.45 0.5609
22  10 94.92 0.5111
23  10 95.30 0.4657
24  10 95.63 0.4243
25  10 95.89 0.3866
26  12 96.12 0.3523
27  13 96.32 0.3210
28  13 96.49 0.2925
29  13 96.63 0.2665
30  13 96.74 0.2428
31  13 96.84 0.2212
32  14 96.92 0.2016
33  16 96.99 0.1837
34  17 97.04 0.1674
35  19 97.14 0.1525
36  19 97.22 0.1389
37  19 97.28 0.1266
38  20 97.34 0.1154
39  21 97.39 0.1051
40  21 97.43 0.0958
41  22 97.46 0.0873
42  26 97.50 0.0795
43  30 97.55 0.0724
44  35 97.64 0.0660
45  37 97.72 0.0602
46  45 97.81 0.0548
47  47 97.90 0.0499
48  52 97.98 0.0455
49  54 98.06 0.0415
50  59 98.14 0.0378
51  61 98.22 0.0344
52  65 98.28 0.0314
53  70 98.34 0.0286
54  72 98.40 0.0260
55  73 98.44 0.0237
56  74 98.48 0.0216
57  78 98.51 0.0197
58  81 98.54 0.0180
59  83 98.57 0.0164
60  84 98.59 0.0149
61  90 98.61 0.0136
62  92 98.63 0.0124
63  94 98.65 0.0113
64  94 98.66 0.0103
65  94 98.68 0.0094
66  95 98.69 0.0085
67  97 98.70 0.0078
68  99 98.71 0.0071
69 100 98.71 0.0064
70 101 98.72 0.0059
71 104 98.72 0.0054
72 104 98.73 0.0049
73 103 98.73 0.0044
74 103 98.74 0.0040
75 104 98.74 0.0037
76 106 98.74 0.0034
77 106 98.75 0.0031
78 106 98.75 0.0028
79 107 98.75 0.0025
80 108 98.75 0.0023
81 108 98.75 0.0021
82 108 98.75 0.0019
83 108 98.76 0.0018

Estimated coefficients

Use coef() on a glmnet fitted object to see the estimated coefficients at specific \(\lambda\) values.

Use s= to specify a specific \(\lambda\) value.

In the previous summary table:

  • 2 coefficients are non-zero at the second \(\lambda\) in the sequence, \(\lambda_2\approx3.285\)
  • 10 coefficients are non-zero at the 10th \(\lambda\), \(\lambda_{10}\approx1.561\)
  • 59 coefficients are non-zero at the 50th \(\lambda\), \(\lambda_{50}\approx0.0378\)

Notice in the output below:

  • dots represent zero coefficients
  • coefficients grow as \(\lambda\) decreases
  • coefficients for the 10 \(x\) predictors are non-zero before coefficients for any of the \(z\) predictors
# estimated coefficients at 2nd, 10th, 50th lambda in sequence
# use lambda in fitted object for exact lambda values
coef(fit_lasso, s=fit_lasso$lambda[c(2,10,50)])
111 x 3 sparse Matrix of class "dgCMatrix"
            s=3.28532713 s=1.56079655  s=0.03777328
(Intercept)   -0.2774453   -0.1540268 -0.0788428868
x1             .            0.5581107  1.0833645939
x2             .            0.4656223  0.9903312049
x3             .            0.5624802  1.0888907504
x4             .            0.4084946  0.8699055576
x5             .            0.5186582  0.9046030148
x6             .            0.6403637  1.0915202716
x7             0.2475144    0.7299907  1.0450388127
x8             .            0.4786788  0.9035429299
x9             0.2443391    0.6795551  0.9668759598
x10            .            0.3266957  0.9809235209
z1             .            .          .           
z2             .            .          .           
z3             .            .         -0.0206654766
z4             .            .          .           
z5             .            .         -0.0066959227
z6             .            .          .           
z7             .            .          .           
z8             .            .          .           
z9             .            .         -0.0244619431
z10            .            .          .           
z11            .            .         -0.0016520651
z12            .            .          .           
z13            .            .         -0.0839619427
z14            .            .          .           
z15            .            .          .           
z16            .            .          .           
z17            .            .          .           
z18            .            .         -0.1062617174
z19            .            .          0.0676161596
z20            .            .          0.0596107857
z21            .            .          0.0839758311
z22            .            .         -0.0552335796
z23            .            .          .           
z24            .            .          .           
z25            .            .          .           
z26            .            .         -0.0209694844
z27            .            .          0.1475372478
z28            .            .          .           
z29            .            .          .           
z30            .            .          .           
z31            .            .         -0.0120957621
z32            .            .          .           
z33            .            .          .           
z34            .            .          0.0345114284
z35            .            .          .           
z36            .            .          .           
z37            .            .         -0.0408881624
z38            .            .          .           
z39            .            .         -0.0023945306
z40            .            .         -0.0135155559
z41            .            .          .           
z42            .            .          .           
z43            .            .          0.0501765464
z44            .            .          0.0578256815
z45            .            .         -0.0682585395
z46            .            .         -0.0033829427
z47            .            .          0.0458979268
z48            .            .          .           
z49            .            .         -0.0235085731
z50            .            .          .           
z51            .            .          .           
z52            .            .         -0.0667957449
z53            .            .          .           
z54            .            .          0.0002443637
z55            .            .          .           
z56            .            .          .           
z57            .            .          0.0242324411
z58            .            .          0.0308628212
z59            .            .          .           
z60            .            .          0.0353664237
z61            .            .          .           
z62            .            .          0.0367126979
z63            .            .          .           
z64            .            .          .           
z65            .            .          0.0902023966
z66            .            .          .           
z67            .            .         -0.1160661948
z68            .            .          .           
z69            .            .          .           
z70            .            .          0.0817966076
z71            .            .          .           
z72            .            .          0.0021532024
z73            .            .          0.0424245272
z74            .            .          .           
z75            .            .          .           
z76            .            .          .           
z77            .            .         -0.0187693362
z78            .            .          .           
z79            .            .          0.1060738780
z80            .            .         -0.2900691069
z81            .            .          .           
z82            .            .          .           
z83            .            .         -0.0577672389
z84            .            .          0.0453429773
z85            .            .          .           
z86            .            .          .           
z87            .            .          0.0229062192
z88            .            .         -0.0539264755
z89            .            .          .           
z90            .            .          0.0169112693
z91            .            .          .           
z92            .            .         -0.0321959400
z93            .            .         -0.0416651462
z94            .            .         -0.0088839342
z95            .            .          0.0443730370
z96            .            .          0.1666781834
z97            .            .          .           
z98            .            .          .           
z99            .            .          0.0393199615
z100           .            .         -0.0778771951

Model predictions

Use predict() on a fitted glmnet object to estimate predicted outcomes at specific \(\lambda\) values.

A data set of predictors must be supplied to predict() with newx=.

# training data outcome predictions at three lambda values
y_pred <- predict(fit_lasso, newx=xz, s=fit_lasso$lambda[c(2,10,50)])

# first 10 rows of predictions at each lambda
head(y_pred, n=10)
      s=3.28532713 s=1.56079655 s=0.03777328
 [1,]  -0.27727004   -2.2292917   -4.5230702
 [2,]  -0.29023088   -2.3982307   -4.1869625
 [3,]  -0.48639430   -0.5071706   -0.9317479
 [4,]  -0.01351316    0.4239464    0.6954105
 [5,]  -0.45602893   -2.2999218   -3.9897567
 [6,]   0.09896796    4.1854244    7.4381806
 [7,]  -0.17889600   -3.0588763   -6.3669035
 [8,]  -0.17626043    0.8637251    1.9385714
 [9,]  -0.24148140   -0.4175899   -1.9661407
[10,]  -0.09918585    2.4069175    4.1820346

Below we plot the observed outcome (black) and the predicted outcome (red) at the 3 \(\lambda\) values.

  • at higher \(\lambda\) penalties, the predictions hover close to the intercept because only 2 coefficients are non-zero and are themselves shrunken towards zero
  • as the \(\lambda\) penalty decreases, the predictions move closer to the outcome
  • the right-most model is overfitted, as only 10 predictors should have non-zero coefficients rather than 59
# plot observed and predicted outcomes for models with 3 lambda values
#  each plot() below plots the observed data in black
#  each points() below plots the predictions
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
obs <- 1:200 # observation index (row in data set) is x-axis
plot(obs, y, main=expression(lambda[2]%~~%3.285), xlab="observation index")
points(obs, y_pred[,1], col="red")
plot(obs, y, main=expression(lambda[10]%~~%1.561), xlab="observation index")
points(obs, y_pred[,2], col="red")
plot(obs, y, main=expression(lambda[50]%~~%0.378), xlab="observation index")
points(obs, y_pred[,3], col="red")

par(mfrow=c(1,1)) # restore plotting area to one plot per row

Predictions can be made for new data specified in newx=.

Below we get predictions for \(x1=-3,-2,...,3\), the range of \(x1\) observed in the data, while holding all other 109 predictors at 0, at the same 3 \(\lambda\) values we used above.

# create new predictor matrix
xz_new <- matrix(rep(0, 7*110), nrow=7) # start with a 7X110 matrix of zeroes 
xz_new[,1] <- -3:3 # make first column (x1) -3 through 3

y_pred_x1 <- predict(fit_lasso, newx=xz_new, s=fit_lasso$lambda[c(2,10,50)])
y_pred_x1
     s=3.28532713 s=1.56079655 s=0.03777328
[1,]   -0.2774453   -1.8283590  -3.32893667
[2,]   -0.2774453   -1.2702483  -2.24557207
[3,]   -0.2774453   -0.7121375  -1.16220748
[4,]   -0.2774453   -0.1540268  -0.07884289
[5,]   -0.2774453    0.4040839   1.00452171
[6,]   -0.2774453    0.9621946   2.08788630
[7,]   -0.2774453    1.5203053   3.17125090

A plot of the predictions against \(x1=1,...10\) shows how its coefficient starts at 0 at high \(\lambda\) and grows as \(\lambda\) decreases:

# plot predicted y (red) across x1=1:10, for models with 3 lambda values
#  each plot() below plots the observed y vs observed x1 in black
#  each points() below plots predicted y vs observed x1 in red
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
plot(xz[,1], y, main=expression(paste(lambda[2]%~~%3.285, ", ", beta[1]==0)), xlab="x1")
points(xz_new[,1], y_pred_x1[,1], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[10]%~~%1.561, ", ", beta[1]==0.558)), xlab="x1")
points(xz_new[,1], y_pred_x1[,2], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[50]%~~%0.378, ", ", beta[1]==1.083)), xlab="x1")
points(xz_new[,1], y_pred_x1[,3], col="red", type="b")

par(mfrow=c(1,1)) # restore plotting area to one plot per row

Cross-validation with cv.glmnet()

Generally, a good value of \(\lambda\) to use is not known a priori, so it is tuned via \(K\)-fold cross-validation with cv.glmnet().

Before splitting the data, glmnet() is run on all the training data to determine the \(\lambda\) sequence.

Then, cv.glmnet() splits the data into 10 folds (by default, can be changed with nfolds=), and for each fold:

  • trains models across the \(\lambda\) sequence with glmnet() using data in all other folds
  • estimates test error across the \(\lambda\) sequence in the current fold

After progressing through each fold, the test error at each \(\lambda\) is averaged over the \(K\) folds.

For linear regressions, the test error measure is MSE, but can be changed with type=.

Let’s use cv.glmnet() to tune \(\lambda\)

# cross-validate to tune lambda
set.seed(1328) # set random seed to replicate results
cv_fit_lasso <- cv.glmnet(xz, y) 

Working with the cv.glmnet object

Using plot() on a cv.glmnet fitted object displays how the estimated test error varies with \(\lambda\).

# plot of -log(lambda) vs estimated MSE
plot(cv_fit_lasso)

In the plot above:

  • bottom x-axis is again the negative log of lambda, with penalty increasing left-to-right
  • top x-axis is again the number of non-zero coefficients
  • y-axis is MSE averaged across the 10-folds
  • the dotted lines indicate 2 important \(\lambda\) values
    • the right dotted line is "lambda.min", the \(\lambda\) value that produced the lowest MSE
    • the left dotted line is "lambda.1se", the largest \(\lambda\) that produces MSE within one standard deviation (of MSE) of the MSE at "lambda.min"
      • "lambda.1se" might be preferred to "lambda.min" if parsimony is critical
  • we see the typical U-shaped curve for flexibility vs test error, where adding predictors initially causes large reductions in MSE due to decreases in bias but eventually MSE increases as the model overfits


Using print() on a fitted cv.glmnet object provides a summary of the model at "lambda.min" and "lambda.1se".

  • Lambda is the \(\lambda\) value
  • Index is the position of the \(\lambda\) value in the sequence
  • Measure is the model fit statistic (here MSE) used to select "lambda.min" and "lambda.1se"
  • SE is the standard error of Measure
  • Nonzero is the number of non-zero coefficients
# summary of models at lambda.min and lambda.1se
cv_fit_lasso

Call:  cv.glmnet(x = xz, y = y) 

Measure: Mean-Squared Error 

     Lambda Index Measure     SE Nonzero
min 0.09577    40   1.173 0.1404      21
1se 0.20158    32   1.298 0.1591      14


Use coef() to view the non-zero coefficients at "lambda.min" and "lambda.1se" (only one s= value is allowed in coef() for cv.glmnet objects:

cbind(coef(cv_fit_lasso, s="lambda.min"), coef(cv_fit_lasso, s="lambda.1se"))
111 x 2 sparse Matrix of class "dgCMatrix"
              lambda.min   lambda.1se
(Intercept) -0.055875689 -0.059932829
x1           1.014179329  0.976740851
x2           0.972657853  0.941136158
x3           1.080458555  1.044742296
x4           0.868502542  0.850174522
x5           0.865512174  0.830586236
x6           1.054412159  1.049472385
x7           1.009644318  0.999929630
x8           0.899071153  0.897799862
x9           0.923509777  0.921528197
x10          0.955276944  0.920458581
z1           .            .          
z2           .            .          
z3           .            .          
z4           .            .          
z5           .            .          
z6           .            .          
z7           .            .          
z8           .            .          
z9           .            .          
z10          .            .          
z11          .            .          
z12          .            .          
z13          .            .          
z14          .            .          
z15          .            .          
z16          .            .          
z17          .            .          
z18          .            .          
z19          0.006675561  .          
z20          0.035252917  .          
z21          0.019965285  .          
z22          .            .          
z23          .            .          
z24          .            .          
z25          .            .          
z26          .            .          
z27          0.095711793  0.046942148
z28          .            .          
z29          .            .          
z30          .            .          
z31          .            .          
z32          .            .          
z33          .            .          
z34          .            .          
z35          .            .          
z36          .            .          
z37          .            .          
z38          .            .          
z39          .            .          
z40          .            .          
z41          .            .          
z42          .            .          
z43          .            .          
z44          .            .          
z45          .            .          
z46          .            .          
z47          0.021294076  .          
z48          .            .          
z49          .            .          
z50          .            .          
z51          .            .          
z52          .            .          
z53          .            .          
z54          .            .          
z55          .            .          
z56          .            .          
z57          0.006398966  .          
z58          .            .          
z59          .            .          
z60          .            .          
z61          .            .          
z62          .            .          
z63          .            .          
z64          .            .          
z65          0.046140567  0.029272534
z66          .            .          
z67          .            .          
z68          .            .          
z69          .            .          
z70          0.035666459  .          
z71          .            .          
z72          .            .          
z73          0.036471921  0.002720046
z74          .            .          
z75          .            .          
z76          .            .          
z77          .            .          
z78          .            .          
z79          .            .          
z80         -0.157849444  .          
z81          .            .          
z82          .            .          
z83          .            .          
z84          .            .          
z85          .            .          
z86          .            .          
z87          .            .          
z88          .            .          
z89          .            .          
z90          .            .          
z91          .            .          
z92          .            .          
z93          .            .          
z94          .            .          
z95          .            .          
z96          0.115250859  0.068722850
z97          .            .          
z98          .            .          
z99          .            .          
z100         .            .          

Now that we know how to tune \(\lambda\), we can proceed to inference

Inference after LASSO

Inference from machine learning models

Inference is generally more difficult with machine learning

  • many ML methods do not produce directly interpretable parameter estimates
  • complex interactions and non-linear relationships may be in the model
  • ML methods generally do not provide measures of uncertainty, preventing the construction of confidence intervals and test-statistics for p-values
  • inference should generally not be performed on the same data used to select a set of candidate predictors

Inference after LASSO is only generally complicated by the latter 2 concerns.

Several methods have been developed to perform inference for regularized model fits, with most focusing on the LASSO to address both concerns.

Post-selection inference

One often overlooked assumption of regression models is that the set of predictors is fixed before looking at their relationship with the outcome.

If we use data to select a “best” or “relevant” set of predictors and then use the same data for inference, this assumption is violated.

Common selection methods: forward/backward/stepwise selection, predictors whose correlation with the outcome surpasses a threshold, predictors whose coefficients are significantly different from zero when modeled alone, LASSO

Generally, ignoring the uncertainty of which predictors are relevant will cause underestimation of uncertainty in model coefficients, leading to confidence intervals that are too narrow and overly optimistic p-values.

Using the same data for selection and inference has been so widely practiced that it likely contributes to the reproducibility crisis in science (Taylor & Tibshirani, 2015)

Simulating distorted inference after selection

A simple simulation demonstrates this underestimation of uncertainty due to selection:

  • y and x are both normally distributed with no relation, i.e. \(\beta_x=0\)
  • for each of 10,000 simulations:
    • draw 100 observations of y and x
    • calculate the correlation between y and x
    • regress y on x and record the p-value and confidence interval for \(\hat\beta_x\)
# simulation code
set.seed(12345)
n <- 100 # sample size
nsim <- 10000 # number of simulations
rho <- rep(0, n) # correlation(x,y)
b <- rep(0, n) # lm coef
pval <- rep(0, n) # pvalues
ll <- rep(0, n) # CI lower
ul <- rep(0, n) # CI upper
for (i in 1:nsim) {
  ysim <- rnorm(n) # draw ysim
  xsim <- rnorm(n) # draw x independently of y
  rho[i] <- cor(ysim, xsim) # sample correlation
  m <- lm(ysim ~ xsim) # lm model
  b[i] <- coef(m)["xsim"]  # save coefficient
  pval[i] <- summary(m)$coefficients["xsim", "Pr(>|t|)"]  # save pvalue
  ci <- confint(m)["xsim",] # get CI
  ll[i] <- ci[1] # save CI
  ul[i] <- ci[2]
}


sel_inf <- data.frame(rho=rho, pval=pval, ll=ll, ul=ul) # put it all in data frame

For selection, we will report p-values and confidence intervals when the correlation between x and y is greater than .2 (or less than -.2)

# select estimates where the |correlation| (rho) was greater than 0.2
selected <- which(abs(sel_inf$rho)>.2) # indices of selected

Under the null hypothesis that \(\beta_x=0\), which is true, p-values have a uniform distribution (in blue below) and the probability of observing \(p<=0.05\) is \(0.05\).

The p-values for \(\hat{\beta_x}\) after selection (henceforth known as selection p-values) have a much different distribution concentrated near 0 (in read below).

# histograms of all p-values and selection p-values
par(mfrow=c(2,1),
    mar=c(4,4,2,1),
    cex=.8)
mybreaks <- pretty(range(sel_inf$pval), n=50)
plot(hist(sel_inf$pval, plot=F, breaks=mybreaks), col=rgb(0,0,1, alpha=.5), main="all p-values have a uniform distribution", xlab = "p-value")
plot(hist(sel_inf$pval[selected], plot=F, breaks=mybreaks), col=rgb(1,0,0, alpha=.5), main="p-values after selection have a very different distribution", xlab="p-value")

For all p-values, the probability that \(p<.05\) is indeed near 0.05:

# probability of p<0.05 for all p-values
sum(sel_inf$pval < 0.05)/length(sel_inf$pval)
[1] 0.0495

Thus, if we were to always model x without selection, the probability of a false positive is about 0.05.

However, for the selection p-values, the probability of a false positive is 1!

# probability of p<0.05 for selection p-values
sum(sel_inf$pval[selected] < 0.05)/length(sel_inf$pval[selected])
[1] 1

Thus, these p-values calculated only for selected samples certainly do not represent the probability of observing an extreme test statistic (\(|t|\gt t_{crit}\)) under the null hypothesis.

Similarly, confidence intervals inferred from the same data used for selection will not have their desired properties.

95% confidence intervals should contain the true parameter value, \(\beta_x=0\) in 95% of samples, i.e. 95% coverage.

This is true in our simulation for all coefficients:

# proportion of all confidence intervals that contain 0
sum((sel_inf$ll < 0) & (0 < sel_inf$ul))/length(sel_inf$ll)
[1] 0.9505

The selection confidence intervals have 0% coverage, because they never contain 0 (coincides with all selection p-values < 0.05):

# proportion of selection confidence intervals that contain 0
sum(sel_inf$ll[selected] < 0 & 0 < sel_inf$ul[selected])/length(sel_inf$ll[selected])
[1] 0

Selective Inference

A number of methods collectively called selective inference have been developed that address the problem of inference after a selection procedure (Taylor & Tibshirani, 2015).

Selective inference methods can be broadly classifed into 3 categories (Kuchbhotla et al., 2022):

  • sample splitting
  • conditional selective inference approaches, which calculate p-values and confidence intervals conditioned on the selection
  • simulataneous inference approaches, which frame selective inference as a multiple comparison problem

Sample splitting

The simplest method to avoid problems of using the same data to select predictors and then again for inference is sample splitting.

In sample splitting, the data are randomly split into 2 parts (Wasserman, 2009):

  • the first subset is used for predictor selection via, for example, LASSO with \(\lambda\) tuned via cross-validation
  • the second subset is used for inference based on an unpenalized model with the predictors selected in the first subset
  • to determine how much data to assign to each split:
    • more data in the first subset improves tuning \(\lambda\) and may stabilize the set of selected predictors
    • more data in the second subset improves power and inference, e.g. narrower confidence intervals

The major disadvantage of sample splitting is that less data is used for both selection and inference.

We demonstrate by splitting our data d_lasso into halves.

# split data into two halves
set.seed(53257)
ind <- sample(1:200, size=100) # 100 random observation numbers
d_sel <- d_lasso[ind,] # half for selection
d_inf <- d_lasso[-ind,] # half for inference

Then we tune \(\lambda\) in the first half, and use the predictors selected at "lambda.1se" for a more parsimonious model.

set.seed(82945)
xz_sel <- as.matrix(d_sel[,2:111]) # selection data predictor matrix
y_sel <- as.matrix(d_sel[,1]) # selection data outcome vector
cv_fit_sel <- cv.glmnet(xz_sel, y_sel) # cross-validate to tune lambda and select predictors

coef(cv_fit_sel, s="lambda.1se") # non-zero coefficients at lambda.1se
111 x 1 sparse Matrix of class "dgCMatrix"
             lambda.1se
(Intercept) -0.09829234
x1           1.04015614
x2           0.89411437
x3           1.01871272
x4           0.85994520
x5           0.51793015
x6           0.97266368
x7           1.03250572
x8           0.77971793
x9           0.81640041
x10          1.14665334
z1           .         
z2           .         
z3           .         
z4           .         
z5           .         
z6           .         
z7           .         
z8           .         
z9           .         
z10          .         
z11          .         
z12          .         
z13          .         
z14          .         
z15          .         
z16          .         
z17          .         
z18          .         
z19          .         
z20          .         
z21          0.10113821
z22          .         
z23          .         
z24          .         
z25          .         
z26          .         
z27          0.05186625
z28          .         
z29          .         
z30          .         
z31          .         
z32          .         
z33          .         
z34          .         
z35          .         
z36          .         
z37          .         
z38          .         
z39          .         
z40          .         
z41          .         
z42          .         
z43          .         
z44          .         
z45          .         
z46          .         
z47          .         
z48          .         
z49          .         
z50          .         
z51          .         
z52          .         
z53          .         
z54          0.12353828
z55          .         
z56          .         
z57          .         
z58          .         
z59          .         
z60          .         
z61          .         
z62          .         
z63          .         
z64          .         
z65          .         
z66          .         
z67          .         
z68          .         
z69          .         
z70          .         
z71          .         
z72          .         
z73          0.01223998
z74          .         
z75          .         
z76          .         
z77          .         
z78          .         
z79          .         
z80          .         
z81          .         
z82          .         
z83          .         
z84          .         
z85          .         
z86          .         
z87          0.01532838
z88          .         
z89          .         
z90          .         
z91          .         
z92          .         
z93          .         
z94          .         
z95          .         
z96          .         
z97          .         
z98          .         
z99          .         
z100         .         

Above we see 13 non-zero coefficients.

The columns of the predictor training matrix corresponding to the non-zero coefficients are can be accessed in the object returned by coef() with @i

# columns of xz_sel corresponding to non-zero coefficients
xz_sel_ind <- coef(cv_fit_sel, s="lambda.1se")@i # column indices of selected predictors
xz_sel_ind # 0th column is intercept and is always selected
 [1]  0  1  2  3  4  5  6  7  8  9 10 31 37 64 83 97

We can then perform inference on these columns using the second subset. We use lm() for an unpenalized model (i.e. coefficients are unbiased, unlike those in LASSO):

# run linear regression on 2nd subset using predictors selected in 1st subset
f <- paste("y ~", paste(colnames(xz_sel)[xz_sel_ind[-1]], collapse="+")) # create lm formula using column names of xz_sel
f
[1] "y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z21+z27+z54+z73+z87"
m_inf <- lm(f, data=d_inf) # unpenalized model
summary(m_inf)

Call:
lm(formula = f, data = d_inf)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.77695 -0.68562 -0.01483  0.70274  2.58269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.04329    0.11871  -0.365    0.716    
x1           0.91117    0.12544   7.264 1.77e-10 ***
x2           0.99319    0.13664   7.268 1.73e-10 ***
x3           1.02833    0.12860   7.997 6.16e-12 ***
x4           0.92140    0.13877   6.640 2.92e-09 ***
x5           1.17323    0.13302   8.820 1.37e-13 ***
x6           1.20620    0.13039   9.251 1.85e-14 ***
x7           1.08318    0.12649   8.563 4.49e-13 ***
x8           0.94158    0.13322   7.068 4.29e-10 ***
x9           0.98493    0.12835   7.674 2.71e-11 ***
x10          0.84850    0.13792   6.152 2.49e-08 ***
z21         -0.08537    0.13673  -0.624    0.534    
z27          0.17788    0.14427   1.233    0.221    
z54         -0.11317    0.13074  -0.866    0.389    
z73          0.10923    0.13178   0.829    0.410    
z87         -0.12212    0.13232  -0.923    0.359    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.11 on 84 degrees of freedom
Multiple R-squared:  0.9683,    Adjusted R-squared:  0.9627 
F-statistic: 171.1 on 15 and 84 DF,  p-value: < 2.2e-16

The confidence intervals for all of the regression coefficients contain the true parameter values (\(\beta_0=0\), \(\beta_{x1}=\beta_{x2}=...=\beta_{x_10}=1\), \(\beta_{z1}=\beta_{z2}=...=\beta_{z100}=0\))

# CIs for unpenalized model run on independent data
m_inf_ci <- confint(m_inf)
m_inf_ci
                 2.5 %    97.5 %
(Intercept) -0.2793542 0.1927652
x1           0.6617064 1.1606250
x2           0.7214565 1.2649211
x3           0.7725980 1.2840567
x4           0.6454536 1.1973541
x5           0.9087133 1.4377542
x6           0.9469135 1.4654957
x7           0.8316320 1.3347250
x8           0.6766531 1.2065017
x9           0.7297027 1.2401619
x10          0.5742381 1.1227606
z21         -0.3572702 0.1865276
z27         -0.1090141 0.4647676
z54         -0.3731493 0.1468177
z73         -0.1528357 0.3712892
z87         -0.3852672 0.1410179

The unpenalized model run on the first data subset used for selection (remember, don’t do this!) results in a false positive test of significance for \(z10\):

# problematic inference if unpenalized model is run on data
#  used for selection
m_sel <- lm(f, data=d_sel) 
summary(m_sel)

Call:
lm(formula = f, data = d_sel)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.35513 -0.52145  0.02935  0.60783  2.12045 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.03296    0.09642  -0.342   0.7333    
x1           1.11606    0.10056  11.099  < 2e-16 ***
x2           0.98712    0.10426   9.468 6.80e-15 ***
x3           1.11358    0.12156   9.160 2.82e-14 ***
x4           0.90027    0.10011   8.993 6.13e-14 ***
x5           0.59005    0.10604   5.564 3.07e-07 ***
x6           1.07014    0.10317  10.372  < 2e-16 ***
x7           1.03268    0.11405   9.055 4.61e-14 ***
x8           0.84885    0.09158   9.269 1.70e-14 ***
x9           0.86572    0.13661   6.337 1.11e-08 ***
x10          1.19627    0.11337  10.552  < 2e-16 ***
z21          0.24648    0.10596   2.326   0.0224 *  
z27          0.10270    0.10197   1.007   0.3168    
z54          0.14421    0.11117   1.297   0.1981    
z73          0.04630    0.09679   0.478   0.6336    
z87          0.05788    0.10884   0.532   0.5962    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8704 on 84 degrees of freedom
Multiple R-squared:  0.9831,    Adjusted R-squared:   0.98 
F-statistic: 325.1 on 15 and 84 DF,  p-value: < 2.2e-16

Stability sampling

A common problem encountered with LASSO selection is that the best set of predictors may be sensitive to how the data are split during cross-validation.

  • correlated predictors may be alternately selected
  • unrelated predictors may be selected in noisy data

Below we re-run the \(\lambda\) cross-validation and predictor selection 3 times and can see that the number of nonzero coefficeints varies at "lambda.min" and "lambda.1se":

# selected predictor set varies across cross-validation runs
set.seed(1290)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

     Lambda Index Measure     SE Nonzero
min 0.09394    82   1.185 0.1192      29
1se 0.20715    65   1.296 0.1600      18
set.seed(2741)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.0856    84   1.215 0.1586      31
1se 0.2273    63   1.365 0.1992      17
set.seed(4895)
cv.glmnet(xz_sel, y_sel)

Call:  cv.glmnet(x = xz_sel, y = y_sel) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.1428    73   1.188 0.1322      20
1se 0.2614    60   1.316 0.2009      16

Stability sampling (Meinhausen & Buhlmann, 2010) addresses predictor set variability using this general algorithm:

  • resample a subset (e.g. half) of the training data repeatedly \(R\) times
    • simulates new data
    • can reduce predictor correlations
    • \(R=100\) is usually sufficient
  • use the predictor selection method (e.g. LASSO) on each resampled subset and record which predictors are selected
    • \(\lambda\) can be chosen before resampling using cross-validation on the entire training sample
    • or, \(\lambda\) can tuned via cross-validation for each \(R\) subset
  • calculate the probability that each predictor is selected across the \(m\) resampled subsets
  • choose the predictors that surpass some threshold probability, \(\pi_{thr}\)
    • Meinhausen & Buhlmann, 2010, suggest \(0.6 \le \pi_{thr} \le 0.9\) as sensible thresholds and note that the results tend not to be too sensitive to \(\pi_{thr}\)
    • higher values of \(\pi_{thr}\) result in more regularized predictor sets
    • \(\pi_{thr}\) can also be chosen to control the expected number of false positives (predictors with true zero coeffcients, see Meinshausen & Bühlmann, 2010)

Below we use stability sampling on our training data (already a subset of 100 observations due to sampling splitting) to stabilize selection of the predictor set:

# stability selection
set.seed(13459)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected <- rep(0, 110) # counts of being selected for 110 predictors 
R <- 100 # number of simulations
for (i in 1:R) {
  stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
  xz_stab <- xz_sel[stab_ind,] # resampled predictors
  y_stab <- y_sel[stab_ind,] # resampled outcome
  ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam))@i[-1] # column indices of selected predictor, -1 removes intercept
  selected[ind_selected] <- selected[ind_selected] + 1 # increase counts for selected predictors
}

# predictor set at 4 different pi thresholds
which(selected >= .6*R) # selected in more than 60% of simulations
 [1]   1   2   3   4   5   6   7   8   9  10  31  97 105
which(selected >= .7*R)
 [1]  1  2  3  4  5  6  7  8  9 10 31
which(selected >= .8*R)
 [1]  1  2  3  4  5  6  7  8  9 10
which(selected >= .9*R)
 [1]  1  2  3  4  5  6  7  8  9 10

All 10 \(x\) predictors (corresponding to 1-10) are selected at each threshold, and are the only predictors selected at threshold \(\pi_{thr}=.8\) and \(\pi_{thr}=.9\).

Stability selection with randomized LASSO

Meinhausen & Buhlmann, 2010, recommend the use of the randomized LASSO to improve stability selection.

With the randomized LASSO, the penalties are perturbed randomly across the coefficients.

Imagine two predictors, \(X_1\) and \(X_2\) are correlated and equally predictive of the outcome \(Y\).

  • if in the sample at hand \(X_1\) happens to be more strongly predictive of \(Y\) due to sampling variability, LASSO may always choose \(X_1\), even though \(X_2\) is actually equally predictive
  • the randomized LASSO penalty perturbations will reduce the influence of \(X_1\) in some of the resampled subsets, simulating samples where \(X_2\) happens to be more strongly predictive of \(Y\) and thus allowing LASSO to select \(X_2\) in some samples

Essentially, the randomized LASSO adds weights to the penalty term:

\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\frac{1}{w_j}|\beta_j| \]

These penalty weights, \(w_j\) are drawn from a uniform distribution with bounds \([\alpha, 1]\), \(w_j \sim Uniform(\alpha, 1)\), with \(0.2 \le \alpha \le 0.8\) recommended, with lower values of \(\alpha\) adding more randomness to the penalties.

We will see in part 2 of this workshop how random forests introduce randomness in regression trees to improve prediction.

Below we implement the randomized LASSO in our stability sampling:

  • draw weights from runif() (one for each column of predictor matrix)
  • apply the reciprocal of these weights to LASSO using penalty.factor= in glmnet
set.seed(13769)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected_randomized <- rep(0, 110) # counts of being selected for 110 predictors 
R <- 100 # number of 
for (i in 1:R) {
  stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
  xz_stab <- xz_sel[stab_ind,] # resampled predictors
  y_stab <- y_sel[stab_ind,] # resampled outcome
  pf <- 1/runif(110, .4, 1) # 1/w_j penalty weights
  
  ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam, penalty.factor=pf))@i[-1] # apply weights using penalty.factor=
  selected_randomized[ind_selected] <- selected_randomized[ind_selected] + 1
}
# predictor sets at different pi thresholds 
which(selected_randomized >= .6*R)
 [1]   1   2   3   4   5   6   7   8   9  10  64 105
which(selected_randomized >= .7*R)
 [1]  1  2  3  4  5  6  7  8  9 10 64
which(selected_randomized >= .8*R)
 [1]  1  2  3  4  5  6  7  8  9 10
which(selected_randomized >= .9*R)
 [1]  1  2  3  4  5  6  7  8  9 10

For our simulated data, the randomized LASSO only provides a little more stability to the selected predictor set.

Conditional selective inference

Conditional selective inference methods calculate p-values and confidence intervals (CIs) that account for the fact that the set of predictors was chosen by examining their relationships with the outcome.

Conditional selective inference methods do not require the sample to be split, so the same data are used efficiently for selection and inference.

Recall that test-statistics used to evaluate null hypotheses for regression coefficients (and construct CIs) are formed by dividing the estimated coefficient by its standard error, e.g. \(t_\hat{\beta}\) in linear regression:

\[t_\hat{\beta_j} = \frac{\hat{\beta_j}}{\hat{\sigma}_{\hat{\beta_j}}}\]

\(t_\hat{\beta_j}\) will be near 0 if either \(\hat{\beta_j}\) is near 0 or if \(\hat{\sigma}_{\hat{\beta_j}}\) is very large.

In one conditional selective inference approach (Lee et al., 2016), the test statistic distribution is truncated to account for the fact the selection procedure prevents predictors whose coefficients have test-statistics near zero from being entered into the inference model.

For a given test statistic, this approach essentially calculates bounds defining the interval the test statistic must lie due to the selection event.

These bounds truncate the test statistic’s sampling distribution, and CIs and p-values are then calculated based on this truncated distribution.

To understand how a truncated distribution changes inference, imagine we randomly draw \(x\) repeatedly from a standard normal distribution, \(x \sim N(0,1)\):

  • the probability of observing \(x>2.5\) over all draws is quite low (.0062)

  • the probability of observing \(x>2.5\), given that \(x>2\) and \(x<4\), is much higher (0.27)
    • this is analogous to inference after a selection procedure that prevents predictors associated with small test-statistics from being in the model

Conditional selective inference with the selectiveInference package

The selectiveInference package (Lee et al., 2016) contains conditional selective inference methods for estimating p-values and confidence intervals.

library(selectiveInference)

The fixedLassoInf() function performs selective inference for a fixed \(\lambda\).

Technically, then, it should not be used on data where \(\lambda\) was determined by cross-validation rather than pre-determined. However, the performance of fixedLassoInf() with cross-validated \(\lambda\) has been shown to be “acceptable” via simulation (Kammer et al., 2022).

To use fixedLassoInf(), specify:

  • x=: the training predictor matrix
  • y=: the training outcome vector
  • alpha=: significance level for CIs, .1 by default.
  • beta=: estimated LASSO coefficients
    • important: fixedLassoInf() expects coefficients for unstandardized predictors, so set standardize=F in glmnet() or cv.glmnet()
  • lambda=: lambda value used to compute beta
    • important: the \(\lambda\) used in fixedLassoInf() is the \(\lambda\) used in glmnet multiplied by the sample size

Neglecting the 2 important points above will yield incorrect results.

# REMEMBER: this method may not be theoretically justified if we determine lambda
#           through cross-validation; use with caution!
set.seed(1328) # set random seed to replicate results
# tune lambda using all training data
cv_fit_lasso <- cv.glmnet(xz, y, standardize=F) # get lambda.min via CV on all data
# selective inference for training data
lambda.min <- cv_fit_lasso$lambda.min
n <- nrow(xz) # sample size n
beta <- coef(cv_fit_lasso, x=xz, y=y, s=lambda.min, exact=T)[-1] # coefficients excluding intercept
fLI <- fixedLassoInf(x=xz, y=y, 
                     beta=beta, # coefficient vector
                     alpha=0.05, # significance level
                     lambda=lambda.min*n, # multiply lambda by sample size
                     )
Warning in fixedLassoInf(x = xz, y = y, beta = beta, alpha = 0.05, lambda =
lambda.min * : p > n/2, and sd(y) = 5.945 used as an estimate of sigma; you may
want to use the estimateSigma function
fLI

Call:
fixedLassoInf(x = xz, y = y, beta = beta, lambda = lambda.min * 
    n, alpha = 0.05)

Standard deviation of noise (specified or estimated) sigma = 5.945

Testing results at lambda = 19.532, with alpha = 0.050

 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   1  1.058   2.153   0.396   -10.083   13.282       0.025      0.025
   2  1.006   2.048   0.465   -18.039   18.871       0.025      0.025
   3  1.109   2.196   0.004     0.798   37.010       0.025      0.025
   4  0.879   1.785   0.030    -0.106   16.626       0.025      0.025
   5  0.904   1.845   0.447    -5.202    4.482       0.025      0.025
   6  1.046   2.058   0.060    -1.281   30.699       0.025      0.025
   7  1.014   2.046   0.024     0.015   33.124       0.025      0.025
   8  0.882   1.885   0.008     0.683      Inf       0.025      0.000
   9  0.915   1.736   0.015     0.253   23.569       0.025      0.025
  10  0.986   1.875   0.038    -0.306   15.770       0.025      0.025
  29  0.033   0.064   0.877      -Inf    9.958       0.000      0.025
  30  0.071   0.141   0.727   -36.803   12.499       0.025      0.025
  31  0.049   0.095   0.861      -Inf   11.026       0.000      0.025
  37  0.147   0.283   0.351   -10.730   20.363       0.025      0.025
  57  0.048   0.104   0.848   -30.069    4.010       0.025      0.025
  67  0.045   0.091   0.992      -Inf   -1.571       0.000      0.025
  75  0.058   0.117   0.452   -18.612   22.703       0.025      0.025
  80  0.077   0.159   0.763   -27.716    7.433       0.025      0.025
  83  0.078   0.159   0.598   -19.968   12.966       0.025      0.025
  90 -0.374  -0.751   0.736    -1.865    8.176       0.025      0.025
 106  0.156   0.352   0.242    -5.263   17.994       0.025      0.025

Note: coefficients shown are partial regression coefficients

After our first run, we get a warning that p>n/2, and sd(y)=5.945 used as an estimate of sigma; you may want to use the estimateSigma function, so we should be suspicious of the results (confidence intervals are extremely wide).

This warning indicates that fixedLassoInf() is having trouble estimating the error (residual) standard deviation (via lm()) because of the high-dimensionality of the data.

We can use selectiveInference’s estimateSigma() function to get an estimate of the error standard deviation, which then stabilizes the results and removes the warning.

y_sigma <- estimateSigma(xz, y)$sigmahat # get a better estimate of residual variance
fLI <- fixedLassoInf(xz, y, 
                     beta=beta,
                     alpha=0.05,
                     lambda=lambda.min*n, # multiply lambda by sample size
                     sigma=y_sigma # supply estiamted residual variance to stabilize results
                     )
fLI

Call:
fixedLassoInf(x = xz, y = y, beta = beta, lambda = lambda.min * 
    n, sigma = y_sigma, alpha = 0.05)

Standard deviation of noise (specified or estimated) sigma = 1.011

Testing results at lambda = 19.532, with alpha = 0.050

 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   1  1.058  12.659   0.000     0.716    1.430       0.025      0.025
   2  1.006  12.038   0.001     0.442    1.535       0.025      0.025
   3  1.109  12.909   0.000     0.966    2.155       0.025      0.025
   4  0.879  10.493   0.000     0.721    1.349       0.024      0.025
   5  0.904  10.849   0.000     0.689    1.075       0.024      0.024
   6  1.046  12.098   0.000     0.892    1.912       0.024      0.025
   7  1.014  12.027   0.000     0.872    1.951       0.025      0.025
   8  0.882  11.084   0.000     0.764    2.308       0.025      0.025
   9  0.915  10.208   0.000     0.751    1.581       0.024      0.025
  10  0.986  11.025   0.000     0.814    1.429       0.024      0.025
  29  0.033   0.374   0.848    -2.984    0.345       0.025      0.025
  30  0.071   0.827   0.618    -1.002    0.454       0.025      0.025
  31  0.049   0.560   0.821    -2.667    0.389       0.025      0.025
  37  0.147   1.663   0.146    -0.194    0.744       0.025      0.025
  57  0.048   0.610   0.688    -0.830    0.213       0.025      0.025
  67  0.045   0.535   0.954    -7.430    0.069       0.025      0.025
  75  0.058   0.690   0.373    -0.494    0.724       0.025      0.025
  80  0.077   0.933   0.585    -0.735    0.324       0.025      0.024
  83  0.078   0.937   0.444    -0.513    0.470       0.025      0.025
  90 -0.374  -4.414   0.006    -0.540   -0.099       0.024      0.025
 106  0.156   2.068   0.047    -0.039    0.682       0.025      0.025

Note: coefficients shown are partial regression coefficients

The columns above are:

  • Var: column of predictor matrix
  • Coef: estimated coefficient (unregularized)
  • Z-score: test statistic
  • P-value: one-sided selection p-value
  • LowConfPt: lower limit of selection-corrected confidence interval
  • UpConfPt: upper limit of selection-corrected confidence interval
  • LowTailArea: Achieved area in lower tail
  • UpTailArea: Achieved area in upper tail (1-(LowTailArea + UpTailArea) equals the estimated coverage of the confidence interval)

Simulataneous Confidence Intervals

Another approach to estimating confidence intervals that account for selection is to estimate simultaneous confidence intervals (Kuchibhotla et al., 2022), which control simultaneous coverage, the probability that all of the intervals contain their true parameter values simultaneously.

Simultaneous CIs provide protection against making errors of inference, i.e. inferring a CI contains the true value when it does not, due to multiplicity, similar to procedures that control the family-wise error rate (e.g. Bonferroni).

A description of how Bonferroni simultaneous CIs are constructed illustrates the protection they provide:

Imagine we repeatedly sample \(n=100\) observations from a population with outcome \(Y\) and 20 predictors, \(X=X_1, X_2,...,X_{20}\), all unrelated to the outcome, \(\beta_1=\beta_2=...=\beta_{20}=0\).

If we construct naive (non-simultaneous) 95% CIs, over repeated samples, the probability that all 20 intervals contain their true values (0) simultaneously is \(.95^{20}=0.358\).

The probability that at least one interval does not contain its true value is \(1-0.358=0.642\), which in this scenario is the family-wise error rate (FWER), or the probability of at least 1 false positive claim among 20 p-values equal to 0.05.

Methods such as the Bonferroni procedure control the FWER.

The Bonferroni procedure increases the critical-value of the test statistic used to determine both the CI width and the p-value.

For a given estimated coefficient \(\hat\beta\), its confidence interval is determined by its standard error, \(\hat\sigma_{\beta}\), and the critical t-value, \(t_{crit}\):

\[CI_{\hat\beta} = \left[\hat\beta - t_{crit}\hat\sigma_{\beta}, \hat\beta + t_{crit}\hat\sigma_{\beta} \right]\]

For a linear regression model run on \(n=100\) and 20 predictors and significance level \(\alpha=0.05\), corresponding to 95% CIs:

  • for naive CIs, \(t_{crit.naive} = t_{1-\frac{\alpha}{2},df=79} = t_{.975,df=79} = 1.99)\)
  • for Bonferroni CIs, we use \(\alpha_{bonf} = \frac{\alpha}{20} = .0025\), and \(t_{crit.bonf} = t_{1-\frac{\alpha_{bonf}}{2},df=79} = t_{.9993,df=79} = 3.09)\)

The Bonferroni simultaneous CIs (red) constructed using \(t_{crit.bonf}=3.09\) are about 55% wider than the naive CIs (blue), and they all contain their true values of 0.

Simultaneous CIS after LASSO selection with the PosiR package

Kuchithobla et al., 2022, developed a method to calculate simultaneous CIs after selection.

They define a universe of models \(Q\) as every predictor set considered by the selection procedure.

The simultaneous inference method calculates confidence intervals that guarantee the desired coverage (e.g. 95%) for all models defined by predictor set \(q\) in the universe \(Q\).

  • for each model \(q\), the data are bootstrap sampled to estimate the variance of the coefficients
  • Once variances are estimated for all models in \(Q\), an adjustment to all variances, \(\hat{K}_{\alpha}\), is determined to ensure that all confidence intervals in the universe have at minimum the desired coverage

Their method is implemented in the PosiR package.

# load package to calculate simultaneous CIs after selection
library(PosiR)

Below, we request simultaneous inference confidence intervals, using every model in the lasso sequence used during cross-validaiton as the “universe” of models:

  • first we get the sequence of \(\lambda\) values used during cross-validation
  • then we loop over the \(\lambda\) sequence with sapply() to get the selected predictors at each lasso fitted model
    • simultaneous_ci() adds an intercept column (all 1s) as first column of predictor matrix, so we increment the selected predictor column indices by 1
  • the predictor set at each \(\lambda\) is added as an element of a list, and these elements must be named
  • use simultaneous_ci() to get simultaneous confidence intervals for all lasso fits in the sequence
    • X= is the full predictor matrix, to which simultaneous_ci() adds the intercept column
    • y= is the outcome vector (must be a vector)
    • Q_universe= is the list of predictor sets (columns of predictor matrix in X=) for all lasso fits
lambdas <- cv_fit_lasso$lambda # get lambda sequence (83 lambdas)
all_selected <- sapply(lambdas, function(x) (coef(cv_fit_lasso, s=x)@i + 1)[-1] ) # columns of predictor matrix selected stored as elements of list
names(all_selected) <- paste0("lam", 1:83) # name list elements with "lam" and index in sequence (required by simultaneous_ci)

# get CIs for each LAASO fit
sim_inf_ci <- simultaneous_ci(X=xz, y=as.numeric(y), Q_universe=all_selected) # takes about 80 sec

Now we have simultaneous CIs for all 83 LASSO fits.

Let’s look at the simultaneous CIs at "lambda.min".

# lambda.min is 40th lambda
lam40 <- sim_inf_ci$intervals[sim_inf_ci$intervals$model_id=="lam40",]
knitr::kable(lam40, digits=4)
model_id coefficient_name estimate lower upper psi_hat_nqj se_nqj
435 lam40 (Intercept) -0.0594 -0.4149 0.2960 1.0785 0.0734
436 lam40 x1 1.0582 0.6261 1.4903 1.5942 0.0893
437 lam40 x10 0.9855 0.5422 1.4289 1.6782 0.0916
438 lam40 x2 1.0059 0.6367 1.3751 1.1637 0.0763
439 lam40 x3 1.1086 0.7455 1.4717 1.1256 0.0750
440 lam40 x4 0.8794 0.4651 1.2937 1.4654 0.0856
441 lam40 x5 0.9038 0.4925 1.3150 1.4441 0.0850
442 lam40 x6 1.0460 0.6198 1.4723 1.5513 0.0881
443 lam40 x7 1.0145 0.5782 1.4507 1.6250 0.0901
444 lam40 x8 0.8820 0.5042 1.2598 1.2187 0.0781
445 lam40 x9 0.9151 0.4839 1.3464 1.5881 0.0891
446 lam40 z19 0.0335 -0.4152 0.4822 1.7189 0.0927
447 lam40 z20 0.0714 -0.2849 0.4277 1.0838 0.0736
448 lam40 z21 0.0486 -0.3688 0.4659 1.4871 0.0862
449 lam40 z27 0.1469 -0.2779 0.5717 1.5409 0.0878
450 lam40 z47 0.0480 -0.2758 0.3719 0.8955 0.0669
451 lam40 z57 0.0448 -0.3187 0.4083 1.1282 0.0751
452 lam40 z65 0.0579 -0.3820 0.4979 1.6528 0.0909
453 lam40 z70 0.0770 -0.3484 0.5025 1.5454 0.0879
454 lam40 z73 0.0777 -0.2927 0.4481 1.1715 0.0765
455 lam40 z80 -0.3739 -0.7467 -0.0011 1.1869 0.0770
456 lam40 z96 0.1557 -0.2284 0.5399 1.2600 0.0794

In the table above:

  • model_id: is the name of the model used in the Q_universe= list
  • coefficient_name: predictor name, sorted in alphanumeric order
  • estimate: unregularized estimate
  • lower and upper: limits of simultaneous confidence interval
  • psi_hat_nqj: bootstrap variance estimate, \(\hat{\Psi}_{n,q,j}\)
  • se_nqj: standard error estimate, \(\sqrt{\frac{\hat{\Psi}_{n,q,j}}{n}}\)

Comparison of results across selective inference methods

Below we compare the CIs for the 10 \(x\) predictors across our inference procedures.

  • the oracle model is the true model with just the 10 \(x\) predictors run on the full sample
  • the full model includes all 110 predictors, including 100 non-predictive \(z\) predictors
  • the point estimates differ because the models have different predictor sets, and all predictors are correlated
# oracle model is true model
oracle <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data=d_lasso)
oracle_ci <- confint(oracle)

# full model includes all 110 predictors
f_full <- as.formula(paste("y ~ ", paste(colnames(xz), collapse="+")))
full <- lm(f_full, data=d_lasso)
full_ci <- confint(full)

False positives

The inference procedures did produce a few false positives:

The full model with all predictors produces 6 false positives: z18, z21, z88, z90, z91, z111 sample splitting produced 3 false positives at \(\pi_{thr}=0.6\): z21 z87 z95
sample splitting produced 1 false positives at \(\pi_{thr}=0.6\): z21
stability sampling produced 2 false positives at \(\pi_{thr}=0.6\): z54 z95
stability sampling produced 1 false positive at \(\pi_{thr}=0.7\): z54
fixedLassoInf() produced 1 false positives at significance level 0.05: z80.
simultaneous_ci() produced 1 false positive: z80

Exercise

We will practice fitting LASSO-regularized linear regressions, tuning \(\lambda\), and performing inference after selection.

For the exercise, we will use the QuickStartExample data that comes with the glmnet package.

  • \(n=100\)
  • 20 x predictors
  • outcome y

Use to code below to load the data and set up the matrices for glmnet.

library(glmnet)
data(QuickStartExample) # load data
x <- QuickStartExample$x # predictor matrix
y <- QuickStartExample$y # outcome vector
colnames(x) <- paste0("x", 1:20) # name columns in x predictor matrix
  1. Use cross-validation to find the \(\lambda\) that minimizes test error. Set the random seed to 123.

  2. Create a plot of estimated test MSE vs \(-log(\lambda)\) to see how test error varies with the penalty. Identify "lambda.min" on the plot.

  3. Examine the non-zero coefficients at "lambda.min".

  4. Now we wish to perform inference using sample splitting first.

  • Use the code below to split the data into 2 subsets, one for predictor selection and one for inference.
set.seed(123)
ind <- sample(1:100, size=50) # 50 random row indices
x_sel <- x[ind,] # predictor matrix for selection
y_sel <- y[ind] # outcome vector for selection

d_inf <- data.frame(y=y[-ind], x[-ind,]) # data.frame of subset used for inference
  • Then use the data subset for selection to tune \(\lambda\) and find "lambda.min". Determine which coefficients are non-zero.
  • Run an unpenalized regression using lm() on the inference subset, using the predictors selected in the first subset. Use summary() to get p-values and conf.int() for confidence intervals.
  1. Now use fixedLassoInf() on the whole training data (x and y) to perform selective inference with 95% CIs on the same data used for selection.
  • First re-run cv.glmnet() with unstandardized predictors.
  • Extract "lambda.min" from this cv.glmnet object.
  • Also extract the non-zero coefficient vector at "lambda.min". Remember to remove the intercept.
  • Use fixedLassoInf() for selectiveInference. Remember to multiply "lambda.min" by the sample size 100, and to set alpha= to 0.05.
  1. Evaluate whether the inference using the 2 methods appears similar or different.

Conclusions

Machine learning methods are well suited for scientific problems where the researcher does not want to make many assumptions or decisions about the model.

Machine learning methods perform better with large data.

LASSO regularization can be used to select a predictor set, and the penalty hyperparameter to determine the set, \(\lambda\) can be tuned via cross-validation.

Inference should not be performed on the same data used to select the predictor set without accounting for the selection procedure.

References

Beam, A. L., & Kohane, I. S. (2018). Big data and machine learning in health care. Jama, 319(13), 1317-1318.

Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, 1-22.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R (Vol. 103). New York: springer.

Kammer, M., Dunkler, D., Michiels, S., & Heinze, G. (2022). Evaluating methods for Lasso selective inference in biomedical research: a comparative simulation study. BMC Medical Research Methodology, 22(1), 206.

Kuchibhotla, A. K., Kolassa, J. E., & Kuffner, T. A. (2022). Post-selection inference. Annual Review of Statistics and Its Application, 9, 505-527.

Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4), 417-473.

Rahal, C., Verhagen, M., & Kirk, D. (2024). The rise of machine learning in the academic social sciences. AI & SOCIETY, 39(2), 799-801.

Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N., & Fanos, V. (2020). Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina, 56(9), 455.

Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25), 7629-7634.

Wasserman, L., & Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37(5A), 2178.

Solutions to Exercise

# Prepare data
library(glmnet)
data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y
colnames(x) <- paste0("x", 1:20)


# 1.
set.seed(123)
cv_fit_ex1 <- cv.glmnet(x, y)

# 2. 
plot(cv_fit_ex1)

# 3.
coef(cv_fit_ex1, s="lambda.min")
21 x 1 sparse Matrix of class "dgCMatrix"
              lambda.min
(Intercept)  0.145832036
x1           1.340981414
x2           .          
x3           0.708347140
x4           .          
x5          -0.848087765
x6           0.554823782
x7           0.038519738
x8           0.347221319
x9           .          
x10          0.010034050
x11          0.186365265
x12          .          
x13          .          
x14         -1.086006902
x15         -0.004444319
x16          .          
x17          .          
x18          .          
x19          .          
x20         -1.069942845
# 4.
# use the following code to split the sample into two halves
set.seed(123)
ind <- sample(1:100, size=50)
x_sel <- x[ind,]
y_sel <- y[ind]
x_ind <- x[-ind,]
y_ind <- y[-ind]
d_inf <- data.frame(y=y_ind, x_ind)

cv_fit_sel <- cv.glmnet(x_sel, y_sel)
coef(cv_fit_sel, s="lambda.min")
21 x 1 sparse Matrix of class "dgCMatrix"
             lambda.min
(Intercept)  0.25396946
x1           1.11515493
x2           .         
x3           0.91540769
x4           .         
x5          -0.73492270
x6           0.43568066
x7           .         
x8           0.32516158
x9           .         
x10          0.02507769
x11          0.13759627
x12          .         
x13          .         
x14         -0.95243039
x15          .         
x16          0.16301172
x17          .         
x18          .         
x19          .         
x20         -1.27909551
sel_ind <- coef(cv_fit_sel, s="lambda.min")[-1]

m_inf <- lm(y ~ x1 + x3 + x5 + x6 + x7 + x8 + x9 + x11 + x14 + x15 + x16 + x20, data=d_inf)
summary(m_inf)

Call:
lm(formula = y ~ x1 + x3 + x5 + x6 + x7 + x8 + x9 + x11 + x14 + 
    x15 + x16 + x20, data = d_inf)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.97434 -0.50037  0.02361  0.64250  1.60826 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1020     0.1534   0.665   0.5102    
x1            1.4438     0.1461   9.884 6.30e-12 ***
x3            0.6284     0.1341   4.687 3.69e-05 ***
x5           -0.9927     0.1585  -6.265 2.75e-07 ***
x6            0.6491     0.1320   4.917 1.82e-05 ***
x7            0.1474     0.1533   0.962   0.3423    
x8            0.3581     0.1427   2.510   0.0166 *  
x9           -0.0608     0.1647  -0.369   0.7141    
x11           0.3006     0.1788   1.681   0.1011    
x14          -1.1005     0.1577  -6.978 3.03e-08 ***
x15          -0.1685     0.1726  -0.977   0.3351    
x16          -0.1938     0.1462  -1.325   0.1932    
x20          -0.9637     0.1673  -5.759 1.33e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9624 on 37 degrees of freedom
Multiple R-squared:  0.9324,    Adjusted R-squared:  0.9105 
F-statistic: 42.55 on 12 and 37 DF,  p-value: < 2.2e-16
confint(m_inf)
                  2.5 %     97.5 %
(Intercept) -0.20876555  0.4127145
x1           1.14780312  1.7397499
x3           0.35678937  0.9000752
x5          -1.31381835 -0.6716554
x6           0.38163165  0.9166052
x7          -0.16309512  0.4579830
x8           0.06901150  0.6472758
x9          -0.39447756  0.2728753
x11         -0.06165766  0.6628956
x14         -1.41999962 -0.7809310
x15         -0.51823328  0.1811479
x16         -0.49008343  0.1025001
x20         -1.30274946 -0.6246403
# 5. 
set.seed(123)
cv_fit_ex1_un <- cv.glmnet(x, y, standardize=F)
lam.min <- cv_fit_ex1_un$lambda.min
beta <- coef(cv_fit_ex1_un, lambda="lambda.min")[-1]
fixedLassoInf(x=x, y=y, beta=beta, lambda=lam.min*100, alpha=0.05)

Call:
fixedLassoInf(x = x, y = y, beta = beta, lambda = lam.min * 100, 
    alpha = 0.05)

Standard deviation of noise (specified or estimated) sigma = 0.968

Testing results at lambda = 6.546, with alpha = 0.050

 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
   1  1.379  13.707   0.000     1.182    1.578       0.025      0.024
   3  0.779   7.992   0.000     0.588    0.972       0.025      0.024
   5 -0.906  -8.442   0.000    -1.118   -0.695       0.024      0.025
   6  0.603   6.464   0.000     0.420    0.787       0.025      0.024
   8  0.398   4.163   0.000     0.208    0.586       0.024      0.024
  11  0.255   2.468   0.029    -0.012    0.459       0.025      0.024
  14 -1.120 -12.173   0.000    -1.303   -0.938       0.024      0.024
  20 -1.156  -9.960   0.000    -1.384   -0.927       0.025      0.024

Note: coefficients shown are partial regression coefficients