This two-part workshop introduces usage of machine learning methods for scientific research.
Part 1 discusses:
glmnetPart 2 discusses regression and classification trees and their extensions, including bagging, boosting, random forests and causal forests, with an emphasis on inference.
Machine learning (ML) encompasses a wide array of computational methods that identify patterns in data to make accurate predictions and infer relationships among variables.
ML is a subclass of artificial intelligence: the machine learns patterns without human guidance.
Traditionally, ML has been used for accurate prediction in industry and engineering settings:
ML usage in the sciences has grown rapidly more recently:
Below are figures showing increasing usage of ML methods in health sciences (left) and business, social sciences, and economics (right).
Image is Fig 1 from (Rajula et al., 2020).
Image is Fig 1 from (Rahal et al., 2024).
Data sets continue getting bigger:
Digital tools enable measurement of thousands of variables efficiently.
These wide data sets suffer from the curse of dimensionality: relevant patterns become harder to find as the number of variables grows.
For example in genomics, we might wish to know which few genes out of 25,000 are most strongly related to variation in a trait.
Compared to machines, humans often require much less data to learn.
A computer might require thousands of training pictures to solve the problem.
Instead, computers excel at processing data, especially at large scales.
Thus, for scientific problems, ML is ideally suited for problems where:
For many problems (e.g.,genomics) we wish to explore the relationships of thousands of predictors (features in ML) with an outcome (label in ML).
A researcher can be easily overwhelmed deciding which predictors, interactions, non-linear effects, etc., to evaluate.
ML methods transfer the decision-making from humans to computers (algorithms).
Beam & Kohane, 2018, imagined classical regression and ML methods lying on a spectrum defined by how much human vs machine decision-making is required.
They examined and plotted the relationship between this spectrum and the amount of data used in various studies employing various methods.
As expected, studies using methods requiring more machine-driven decisions also used more data.
Image is Figure from (Beam & Kohane, 2018)
Classical regression generally requires more human decision making, often in the form of assumptions:
These assumptions simplify the model, reducing how many parameters are estimated and thus the amount of data required.
Thus, given sufficent data, machine learning methods are especially useful for scientific problems where regression assumptions are untenable or the researcher is overwhelmed by the number of decisions and assumptions to ponder.
Like regression, supervised
Two primary goals:
Imagine we are interested in estimating the relationship, \(f\), between and outcome \(Y\) and a set of predictors \(X=X_1,X_2,...,X_p\).
\[Y=f(X)\]
For example, in linear regression, \(f\) is assumed to be a linear combination of a set of coefficients and \(X\):
\[\begin{align} Y&=f_{OLS}(X) + \epsilon \\ &=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon \\ \end{align}\]
The error term \(\epsilon\) is not included in \(f\) because it does not depend on \(X\).
In many ML approaches, linear relationships are not assumed for \(f\).
With data, we can estimate these relationships, and the estimated \(\hat{f}\) can be used to make predictions of the outcome, \(\hat{Y}\).
\[\hat{Y} = \hat{f}(X)\]
Predictions \(\hat{Y}\) will rarely perfectly match observations \(Y\).
Prediction error quantifies the difference between \(\hat{Y}\) and \(Y\) and comes in many forms.
For continuous \(Y\), squared errors, \((Y-\hat{Y})^2\), are commonly used.
If we wish to predict \(Y\) in future data sets, we might want to know how much prediction error to expect:
\[E((Y-\hat{Y})^2) = [(f(X) - \hat{f}(X))^2] + Var(\epsilon)\]
where
We will use other measures of prediction error for categorical outcomes.
In the expression for expected squared (prediction) error:
\[\begin{align} E((Y-\hat{Y})^2) &= [(f(X) - \hat{f}(X))^2] + Var(\epsilon) \\ &=\text{reducible error} + \text{irreducible error} \end{align}\]
We can thus optimize predicting the outcome by minimizing the reducible error.
With sample data, we can estimate the expected squared error with Mean Squared Error (MSE):
\[\text{MSE}=\frac{1}{n}\sum_{i=1}^n(y-\hat{f}(x))^2\]
MSE is the averaged squared residual (estimate of error), \((y-\hat{f}(x))^2\).
The MSE for the data used to fit the model is called training MSE, \(\text{MSE}_{\text{train}}\).
A model that predicts well in future data sets will generally be more useful than one that only predicts well for the current sample data.
Thus, we generally care less about \(\text{MSE}_{\text{train}}\) and more about \(\text{MSE}_{\text{test}}\), or prediction error in future data sets:
\[\text{MSE}_{\text{test}}=\frac{1}{\tilde{n}}\sum_{i=1}^\tilde{n}(\tilde{y}-\hat{f}(\tilde{x}))^2\]
\(\hat{f}\) above is the model estimated with the training data.
Above, \(\tilde{y}\) and \(\tilde{x}\) represent outcomes and predictors in future data sets, respectively.
We use test error to refer generally to a measure of prediction error in future data.
Methods that minimize \(\text{MSE}_{\text{train}}\) (e.g. ordinary least squares) usually do not simultaneously minimize \(\text{MSE}_{\text{test}}\).
Models with more parameters have more flexibility
\(\text{MSE}_{\text{train}}\) monotonically decreases with increased model flexibility (analogous to how \(R^2\) always increases with more predictors in linear regression).
However, \(\text{MSE}_{\text{test}}\) has a U-shaped relationship with flexibility, where at some point increasing flexibility will lead to increasing \(\text{MSE}_{\text{test}}\).
The following simulation illustrates how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with model flexibility:
\[y=x-x^2+.4x^3+\epsilon\]
The left plot below shows the training data and predictions from 3 different models (degree=1,3,10)
The right plot below shows how \(\text{MSE}_{\text{train}}\) and \(\text{MSE}_{\text{test}}\) vary with flexibility.
The added flexibility of additional polynomial terms allows the model to fit more closely to the training data.
However, the wiggliness of the maximally flexible model (degree=10) suggests overfitting.
\(\text{MSE}_{\text{train}}\) always decreases with flexibility.
\(\text{MSE}_{\text{test}}\) has the expected U-shaped relationship with flexibility, and is at a minimum at the true model (degree=3).
Figure inspired by Fig 2.9 of (James et al., 2013)
Why does \(\text{MSE}_{\text{test}}\) have a U-shaped relationship with flexibility?
The reducible error in future datasets has two components: bias and variance of \(\hat{f}\):
\[\begin{align}E(\text{MSE}_{\text{test}}) &= \text{reducible error} + \text{irreducible error} \\ &=[Bias(\hat{f}(\tilde{x}))^2 + Var(\hat{f}(\tilde{x})] + Var(\epsilon) \end{align}\]
Thus, as models become more flexible, bias will decrease, but eventually, reductions in bias will be dominated by increases in variance.
The goal, then, is to find the model with the right amount of flexibility that balances bias and variance to minimize test MSE.
An extension of the previous polynomial simulation illustrates how bias and variance of \(\hat{f}\) vary with model flexibility:
Statistical approaches that are more flexible can improve prediction accuracy but tend to be less interpretable.
The additional complexity of flexible models obscures their interpretation (e.g. interactions and non-linear effects).
Classical regression models are less flexible but their parameters that have straightforward interpretations of additive changes in the outcome (or transformed outcome) per unit-change in the predictor.
Machine learning models are often non-parametric and predictor effects are often difficult to summarize.
Image inspired by Fig2.7 from (James et al., 2013)
The flexibility of ML methods is typically controlled by hyperparameters.
Hyperparameters are fixed before a model is fit, and the model is then fit conditionally on the hyperparameters’ values.
Across ML models, hyperparameters typically control
The “best” hyperparameter value that minimizes test error cannot be estimated from a single model fit.
Instead, hyperparameters are tuned or optimized by estimating test error over a range of many hyperparameter values.
Cross-validation is the most common method used to tune hyperparameters.
Cross-validation efficiently uses a single data set to train a model and estimate its test error (called validation if not using future data).
Cross-validation algorithm:
Cross-validation thus allows us to estimate test error at each hyperparameter value and find the value that minimizes test error, with just one data set.
Imagine we wish to tune hyperparameter \(\lambda\) value over a range of 100 values, \(\lambda_1,\lambda_2,...,\lambda_{100}\)
The image below depicts 5-fold cross-validation to estimate test error for the first \(\lambda\) in the sequence, \(\lambda_1\)
With more data, the trained model will be less variable and on average closer to the true model, resulting in lower test error.
Cross-validation uses only a portion of the data to train the model (e.g. 90% with 10 folds) and thus test error will be generally over-estimated compared to using 100% of the data.
To avoid this over-estimation, \(K=n\) folds, where each fold is a single observation, known as leave-one-out cross-validation, might seem appealing. However, the estimated test errors become highly variable because they are based on single observations.
For most applications, \(k=5\) and \(k=10\) have been found to perform well.
Generally, with smaller data, more folds might be preferred so that more data is used to train.
Statistical methods fit models by minimizing training error.
Thus, test error will be generally underestimated if calculated using any observations used to train the model.
Thus, any sort of training should be avoided before calculating test error at any step, including cross-validation.
“Training” includes not only fitting models but also selecting predictors based on their relationship with the outcome (e.g. using correlation thresholds or a forward stepwise procedure).
Regression parameter estimates will vary greatly across repeated samples when:
More variable models will lead to higher test error, so methods less flexible than regression can help.
Regularized models task the machine with using the data to reduce the flexibility of the model, for example, by removing irrelevant predictors.
A model’s loss function is minimized to produce the “best” parameter estimates:
\[LOSS_{ols}(\beta) = \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2\]
Regularization methods impose a penalty term on to a model’s loss function to limit complexity and discourage overfitting.
The penalty term:
Least Absolute Shrinkage and Selection Operator (LASSO) is a popular ML regularization method adapted for several regression models.
The LASSO loss function, \(LOSS_{lasso}(\beta)\), adds a penalty to the sum of the absolute values of the \(p\) parameter estimates:
\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p|\beta_j| \]
where
For example, the loss function for LASSO for linear regression adds the penalty term to the sum of squared errors (loss function for linear regression):
\[\begin{align} LOSS_{lasso.ols}(\beta) &= \sum_{i=1}^n\left(y_i -\beta_0 - \sum_{j=1}^pX_{ij}\beta_j \right)^2 + \lambda\sum_{j=1}^p|\beta_j| \\ &= \text{SSE} + \text{penalty} \end{align}\]
For a given \(\lambda\), the “best” set of estimates \(\hat{\beta}\) minimizes \(Loss_{lasso}\)
Note: The loss function for LASSO linear regression can be equivalently minimized with SSE multiplied by \(\frac{1}{2n}\), as in the glmnet package, or by \(\frac{1}{2}\), as in the selectiveInference package. The same set of \(\hat{\beta}\) will be estimated, but the scaling of \(\lambda\) will be different.
The LASSO penalty can potentially shrink coefficients to zero if \(\lambda\) is large enough.
Predictors with zero coefficients are selected out of the model.
Thus, LASSO is often used for automated variable selection.
Coefficient path plots track how coefficients grow as \(\lambda\) and thus the LASSO penalty decreases.
The plot above tracks 10 coefficients against \(\lambda\)
The LASSO penalty can be added to generalized linear models (GLMs) that do not assume normally distributed errors, such as logistic, Poisson, multinomial, and Cox regression models.
Generally, the loss function to be minimized for LASSO-regularized GLMs will use the negative log likelihood:
\[LOSS_{lasso.glm}(\beta) = \text{negative log likelihood} + \text{LASSO penalty}\]
For example, for logistic regression of binary (0/1) outcome \(y_i\), the loss function can be written as:
\[LOSS_{lasso.logistic}(\beta) = -\frac{1}{n}\sum_{i=1}^n\left(y_i\Big(\beta_0 + \sum_{j=1}^pX_{ij}\beta_j\Big)-log\Big(1+e^{\beta_0 + \sum_{j=1}^pX_{ij}\beta_j}\Big)\right)+\lambda\sum_{j=1}^p|\beta_j|\]
Henceforth, we will only be looking at LASSO-regularized linear regression.
Two regularization methods similar to LASSO are ridge regression and elastic net.
Ridge regression uses a different penalty based on the squares of the coefficients.
\[Loss_{ridge}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\beta_j^2\] \(\lambda\) again serves as a penalty parameter that can be tuned via cross-validation.
Unlike LASSO, ridge regression does not shrink coefficients to zero (unless \(\lambda=\infty\)), so might be a better choice if all predictors are expected to have non-zero coefficients.
Elastic net combines the LASSO and ridge penalties, with an additional mixing parameter \(\alpha\), \(0\le\alpha\le1\), which controls how much of each penalty to use:
\[Loss_{elastic}(\beta) = Loss_{reg}(\beta) + \lambda\left(\alpha \sum_{j=1}^p|\beta_j| + [1-\alpha] \sum_{j=1}^p\beta_j^2 \right)\]
For the remainder of the seminar we focus on LASSO, because most inference methods for regularized methods have been developed for LASSO
glmnet package for LASSOglmnet packageThe glmnet package (Friedman et al., 2010):
We demonstrate usage of the glmnet package for LASSO-regularized models with the following simulated data:
# load simulated data to demonstrate glmnet LASSO
d_lasso <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2025/08/lasso_sim.csv")
names(d_lasso) # column names [1] "y" "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
[11] "x10" "z1" "z2" "z3" "z4" "z5" "z6" "z7" "z8" "z9"
[21] "z10" "z11" "z12" "z13" "z14" "z15" "z16" "z17" "z18" "z19"
[31] "z20" "z21" "z22" "z23" "z24" "z25" "z26" "z27" "z28" "z29"
[41] "z30" "z31" "z32" "z33" "z34" "z35" "z36" "z37" "z38" "z39"
[51] "z40" "z41" "z42" "z43" "z44" "z45" "z46" "z47" "z48" "z49"
[61] "z50" "z51" "z52" "z53" "z54" "z55" "z56" "z57" "z58" "z59"
[71] "z60" "z61" "z62" "z63" "z64" "z65" "z66" "z67" "z68" "z69"
[81] "z70" "z71" "z72" "z73" "z74" "z75" "z76" "z77" "z78" "z79"
[91] "z80" "z81" "z82" "z83" "z84" "z85" "z86" "z87" "z88" "z89"
[101] "z90" "z91" "z92" "z93" "z94" "z95" "z96" "z97" "z98" "z99"
[111] "z100"
The glmnet package works best with the predictors and outcome stored as separate matrix objects (or vector if a single column):
glmnet() for LASSOThe glmnet() function fits regularized generalized linear models, by default using LASSO.
Important arguments:
x=: predictor matrixy=: outcome vectorfamily=: outcome/error distribution family (defaults to "gaussian")alpha=: elastic net mixing parameter controlling how much of the ridge and LASSO penalties to use; defaults to alpha=1 for LASSOglmnet() fits LASSO models across a sequence of \(\lambda\) values:
nlambda= to change number)lambda= to specify a custom sequenceWe fit LASSO-regularized linear regressions for our simulated data across the sequence of \(\lambda\) values using glmnet():
glmnet objectsRather than extracting results directly from the fitted glmnet object, use these extractor functions:
plot(): coefficient path plotprint(): summary of model fit at each \(\lambda\)coef(): coefficient estimates at specific \(\lambda\) value(s)predict(): model-based predictionsplot() used on a glmnet fitted object creates the coefficient path plot:
Remember that \(\lambda\) and the penalty decrease as we move to the right.
Using print() on the fitted glmnet object (or just typing the object’s name) summarizes the model fit at each \(\lambda\) value.
In the table below,
Df: number of non-zero coefficients%Dev: percent of null deviance explained, a measure model fit to the training data
%Dev thus measures how much the null deviance has been reduced by fitted modelLambda: \(\lambda\) used to fit modelNote: if the number of rows in the summary table is less than 100 (or whatever nlambda= equals), glmnet() stopped early because %Dev was not changing enough as \(\lambda\) was decreasing
# summary of glmnet path (model fit at each lambda)
# same as print(fit_lasso)
# glmnet stops at 83rd lambda because %Dev did not change enough at 84th
fit_lasso
Call: glmnet(x = xz, y = y)
Df %Dev Lambda
1 0 0.00 3.6060
2 2 9.58 3.2850
3 7 21.09 2.9930
4 9 33.13 2.7280
5 9 43.64 2.4850
6 10 52.38 2.2640
7 10 60.00 2.0630
8 10 66.32 1.8800
9 10 71.56 1.7130
10 10 75.91 1.5610
11 10 79.53 1.4220
12 10 82.53 1.2960
13 10 85.02 1.1810
14 10 87.09 1.0760
15 10 88.80 0.9802
16 10 90.23 0.8931
17 10 91.41 0.8138
18 10 92.40 0.7415
19 10 93.21 0.6756
20 10 93.89 0.6156
21 10 94.45 0.5609
22 10 94.92 0.5111
23 10 95.30 0.4657
24 10 95.63 0.4243
25 10 95.89 0.3866
26 12 96.12 0.3523
27 13 96.32 0.3210
28 13 96.49 0.2925
29 13 96.63 0.2665
30 13 96.74 0.2428
31 13 96.84 0.2212
32 14 96.92 0.2016
33 16 96.99 0.1837
34 17 97.04 0.1674
35 19 97.14 0.1525
36 19 97.22 0.1389
37 19 97.28 0.1266
38 20 97.34 0.1154
39 21 97.39 0.1051
40 21 97.43 0.0958
41 22 97.46 0.0873
42 26 97.50 0.0795
43 30 97.55 0.0724
44 35 97.64 0.0660
45 37 97.72 0.0602
46 45 97.81 0.0548
47 47 97.90 0.0499
48 52 97.98 0.0455
49 54 98.06 0.0415
50 59 98.14 0.0378
51 61 98.22 0.0344
52 65 98.28 0.0314
53 70 98.34 0.0286
54 72 98.40 0.0260
55 73 98.44 0.0237
56 74 98.48 0.0216
57 78 98.51 0.0197
58 81 98.54 0.0180
59 83 98.57 0.0164
60 84 98.59 0.0149
61 90 98.61 0.0136
62 92 98.63 0.0124
63 94 98.65 0.0113
64 94 98.66 0.0103
65 94 98.68 0.0094
66 95 98.69 0.0085
67 97 98.70 0.0078
68 99 98.71 0.0071
69 100 98.71 0.0064
70 101 98.72 0.0059
71 104 98.72 0.0054
72 104 98.73 0.0049
73 103 98.73 0.0044
74 103 98.74 0.0040
75 104 98.74 0.0037
76 106 98.74 0.0034
77 106 98.75 0.0031
78 106 98.75 0.0028
79 107 98.75 0.0025
80 108 98.75 0.0023
81 108 98.75 0.0021
82 108 98.75 0.0019
83 108 98.76 0.0018
Use coef() on a glmnet fitted object to see the estimated coefficients at specific \(\lambda\) values.
Use s= to specify a specific \(\lambda\) value.
In the previous summary table:
Notice in the output below:
# estimated coefficients at 2nd, 10th, 50th lambda in sequence
# use lambda in fitted object for exact lambda values
coef(fit_lasso, s=fit_lasso$lambda[c(2,10,50)])111 x 3 sparse Matrix of class "dgCMatrix"
s=3.28532713 s=1.56079655 s=0.03777328
(Intercept) -0.2774453 -0.1540268 -0.0788428868
x1 . 0.5581107 1.0833645939
x2 . 0.4656223 0.9903312049
x3 . 0.5624802 1.0888907504
x4 . 0.4084946 0.8699055576
x5 . 0.5186582 0.9046030148
x6 . 0.6403637 1.0915202716
x7 0.2475144 0.7299907 1.0450388127
x8 . 0.4786788 0.9035429299
x9 0.2443391 0.6795551 0.9668759598
x10 . 0.3266957 0.9809235209
z1 . . .
z2 . . .
z3 . . -0.0206654766
z4 . . .
z5 . . -0.0066959227
z6 . . .
z7 . . .
z8 . . .
z9 . . -0.0244619431
z10 . . .
z11 . . -0.0016520651
z12 . . .
z13 . . -0.0839619427
z14 . . .
z15 . . .
z16 . . .
z17 . . .
z18 . . -0.1062617174
z19 . . 0.0676161596
z20 . . 0.0596107857
z21 . . 0.0839758311
z22 . . -0.0552335796
z23 . . .
z24 . . .
z25 . . .
z26 . . -0.0209694844
z27 . . 0.1475372478
z28 . . .
z29 . . .
z30 . . .
z31 . . -0.0120957621
z32 . . .
z33 . . .
z34 . . 0.0345114284
z35 . . .
z36 . . .
z37 . . -0.0408881624
z38 . . .
z39 . . -0.0023945306
z40 . . -0.0135155559
z41 . . .
z42 . . .
z43 . . 0.0501765464
z44 . . 0.0578256815
z45 . . -0.0682585395
z46 . . -0.0033829427
z47 . . 0.0458979268
z48 . . .
z49 . . -0.0235085731
z50 . . .
z51 . . .
z52 . . -0.0667957449
z53 . . .
z54 . . 0.0002443637
z55 . . .
z56 . . .
z57 . . 0.0242324411
z58 . . 0.0308628212
z59 . . .
z60 . . 0.0353664237
z61 . . .
z62 . . 0.0367126979
z63 . . .
z64 . . .
z65 . . 0.0902023966
z66 . . .
z67 . . -0.1160661948
z68 . . .
z69 . . .
z70 . . 0.0817966076
z71 . . .
z72 . . 0.0021532024
z73 . . 0.0424245272
z74 . . .
z75 . . .
z76 . . .
z77 . . -0.0187693362
z78 . . .
z79 . . 0.1060738780
z80 . . -0.2900691069
z81 . . .
z82 . . .
z83 . . -0.0577672389
z84 . . 0.0453429773
z85 . . .
z86 . . .
z87 . . 0.0229062192
z88 . . -0.0539264755
z89 . . .
z90 . . 0.0169112693
z91 . . .
z92 . . -0.0321959400
z93 . . -0.0416651462
z94 . . -0.0088839342
z95 . . 0.0443730370
z96 . . 0.1666781834
z97 . . .
z98 . . .
z99 . . 0.0393199615
z100 . . -0.0778771951
Use predict() on a fitted glmnet object to estimate predicted outcomes at specific \(\lambda\) values.
A data set of predictors must be supplied to predict() with newx=.
# training data outcome predictions at three lambda values
y_pred <- predict(fit_lasso, newx=xz, s=fit_lasso$lambda[c(2,10,50)])
# first 10 rows of predictions at each lambda
head(y_pred, n=10) s=3.28532713 s=1.56079655 s=0.03777328
[1,] -0.27727004 -2.2292917 -4.5230702
[2,] -0.29023088 -2.3982307 -4.1869625
[3,] -0.48639430 -0.5071706 -0.9317479
[4,] -0.01351316 0.4239464 0.6954105
[5,] -0.45602893 -2.2999218 -3.9897567
[6,] 0.09896796 4.1854244 7.4381806
[7,] -0.17889600 -3.0588763 -6.3669035
[8,] -0.17626043 0.8637251 1.9385714
[9,] -0.24148140 -0.4175899 -1.9661407
[10,] -0.09918585 2.4069175 4.1820346
Below we plot the observed outcome (black) and the predicted outcome (red) at the 3 \(\lambda\) values.
# plot observed and predicted outcomes for models with 3 lambda values
# each plot() below plots the observed data in black
# each points() below plots the predictions
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
obs <- 1:200 # observation index (row in data set) is x-axis
plot(obs, y, main=expression(lambda[2]%~~%3.285), xlab="observation index")
points(obs, y_pred[,1], col="red")
plot(obs, y, main=expression(lambda[10]%~~%1.561), xlab="observation index")
points(obs, y_pred[,2], col="red")
plot(obs, y, main=expression(lambda[50]%~~%0.378), xlab="observation index")
points(obs, y_pred[,3], col="red")Predictions can be made for new data specified in newx=.
Below we get predictions for \(x1=-3,-2,...,3\), the range of \(x1\) observed in the data, while holding all other 109 predictors at 0, at the same 3 \(\lambda\) values we used above.
# create new predictor matrix
xz_new <- matrix(rep(0, 7*110), nrow=7) # start with a 7X110 matrix of zeroes
xz_new[,1] <- -3:3 # make first column (x1) -3 through 3
y_pred_x1 <- predict(fit_lasso, newx=xz_new, s=fit_lasso$lambda[c(2,10,50)])
y_pred_x1 s=3.28532713 s=1.56079655 s=0.03777328
[1,] -0.2774453 -1.8283590 -3.32893667
[2,] -0.2774453 -1.2702483 -2.24557207
[3,] -0.2774453 -0.7121375 -1.16220748
[4,] -0.2774453 -0.1540268 -0.07884289
[5,] -0.2774453 0.4040839 1.00452171
[6,] -0.2774453 0.9621946 2.08788630
[7,] -0.2774453 1.5203053 3.17125090
A plot of the predictions against \(x1=1,...10\) shows how its coefficient starts at 0 at high \(\lambda\) and grows as \(\lambda\) decreases:
# plot predicted y (red) across x1=1:10, for models with 3 lambda values
# each plot() below plots the observed y vs observed x1 in black
# each points() below plots predicted y vs observed x1 in red
par(mfrow=c(1,3)) # set up plotting area for 3 plots in one row
plot(xz[,1], y, main=expression(paste(lambda[2]%~~%3.285, ", ", beta[1]==0)), xlab="x1")
points(xz_new[,1], y_pred_x1[,1], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[10]%~~%1.561, ", ", beta[1]==0.558)), xlab="x1")
points(xz_new[,1], y_pred_x1[,2], col="red", type="b")
plot(xz[,1], y, main=expression(paste(lambda[50]%~~%0.378, ", ", beta[1]==1.083)), xlab="x1")
points(xz_new[,1], y_pred_x1[,3], col="red", type="b")cv.glmnet()Generally, a good value of \(\lambda\) to use is not known a priori, so it is tuned via \(K\)-fold cross-validation with cv.glmnet().
Before splitting the data, glmnet() is run on all the training data to determine the \(\lambda\) sequence.
Then, cv.glmnet() splits the data into 10 folds (by default, can be changed with nfolds=), and for each fold:
glmnet() using data in all other foldsAfter progressing through each fold, the test error at each \(\lambda\) is averaged over the \(K\) folds.
For linear regressions, the test error measure is MSE, but can be changed with type=.
Let’s use cv.glmnet() to tune \(\lambda\)
cv.glmnet objectUsing plot() on a cv.glmnet fitted object displays how the estimated test error varies with \(\lambda\).
In the plot above:
"lambda.min", the \(\lambda\) value that produced the lowest MSE"lambda.1se", the largest \(\lambda\) that produces MSE within one standard deviation (of MSE) of the MSE at "lambda.min"
"lambda.1se" might be preferred to "lambda.min" if parsimony is criticalUsing print() on a fitted cv.glmnet object provides a summary of the model at "lambda.min" and "lambda.1se".
Lambda is the \(\lambda\) valueIndex is the position of the \(\lambda\) value in the sequenceMeasure is the model fit statistic (here MSE) used to select "lambda.min" and "lambda.1se"SE is the standard error of MeasureNonzero is the number of non-zero coefficients
Call: cv.glmnet(x = xz, y = y)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.09577 40 1.173 0.1404 21
1se 0.20158 32 1.298 0.1591 14
Use coef() to view the non-zero coefficients at "lambda.min" and "lambda.1se" (only one s= value is allowed in coef() for cv.glmnet objects:
111 x 2 sparse Matrix of class "dgCMatrix"
lambda.min lambda.1se
(Intercept) -0.055875689 -0.059932829
x1 1.014179329 0.976740851
x2 0.972657853 0.941136158
x3 1.080458555 1.044742296
x4 0.868502542 0.850174522
x5 0.865512174 0.830586236
x6 1.054412159 1.049472385
x7 1.009644318 0.999929630
x8 0.899071153 0.897799862
x9 0.923509777 0.921528197
x10 0.955276944 0.920458581
z1 . .
z2 . .
z3 . .
z4 . .
z5 . .
z6 . .
z7 . .
z8 . .
z9 . .
z10 . .
z11 . .
z12 . .
z13 . .
z14 . .
z15 . .
z16 . .
z17 . .
z18 . .
z19 0.006675561 .
z20 0.035252917 .
z21 0.019965285 .
z22 . .
z23 . .
z24 . .
z25 . .
z26 . .
z27 0.095711793 0.046942148
z28 . .
z29 . .
z30 . .
z31 . .
z32 . .
z33 . .
z34 . .
z35 . .
z36 . .
z37 . .
z38 . .
z39 . .
z40 . .
z41 . .
z42 . .
z43 . .
z44 . .
z45 . .
z46 . .
z47 0.021294076 .
z48 . .
z49 . .
z50 . .
z51 . .
z52 . .
z53 . .
z54 . .
z55 . .
z56 . .
z57 0.006398966 .
z58 . .
z59 . .
z60 . .
z61 . .
z62 . .
z63 . .
z64 . .
z65 0.046140567 0.029272534
z66 . .
z67 . .
z68 . .
z69 . .
z70 0.035666459 .
z71 . .
z72 . .
z73 0.036471921 0.002720046
z74 . .
z75 . .
z76 . .
z77 . .
z78 . .
z79 . .
z80 -0.157849444 .
z81 . .
z82 . .
z83 . .
z84 . .
z85 . .
z86 . .
z87 . .
z88 . .
z89 . .
z90 . .
z91 . .
z92 . .
z93 . .
z94 . .
z95 . .
z96 0.115250859 0.068722850
z97 . .
z98 . .
z99 . .
z100 . .
Now that we know how to tune \(\lambda\), we can proceed to inference
Inference is generally more difficult with machine learning
Inference after LASSO is only generally complicated by the latter 2 concerns.
Several methods have been developed to perform inference for regularized model fits, with most focusing on the LASSO to address both concerns.
One often overlooked assumption of regression models is that the set of predictors is fixed before looking at their relationship with the outcome.
If we use data to select a “best” or “relevant” set of predictors and then use the same data for inference, this assumption is violated.
Common selection methods: forward/backward/stepwise selection, predictors whose correlation with the outcome surpasses a threshold, predictors whose coefficients are significantly different from zero when modeled alone, LASSO
Generally, ignoring the uncertainty of which predictors are relevant will cause underestimation of uncertainty in model coefficients, leading to confidence intervals that are too narrow and overly optimistic p-values.
Using the same data for selection and inference has been so widely practiced that it likely contributes to the reproducibility crisis in science (Taylor & Tibshirani, 2015)
A simple simulation demonstrates this underestimation of uncertainty due to selection:
# simulation code
set.seed(12345)
n <- 100 # sample size
nsim <- 10000 # number of simulations
rho <- rep(0, n) # correlation(x,y)
b <- rep(0, n) # lm coef
pval <- rep(0, n) # pvalues
ll <- rep(0, n) # CI lower
ul <- rep(0, n) # CI upper
for (i in 1:nsim) {
ysim <- rnorm(n) # draw ysim
xsim <- rnorm(n) # draw x independently of y
rho[i] <- cor(ysim, xsim) # sample correlation
m <- lm(ysim ~ xsim) # lm model
b[i] <- coef(m)["xsim"] # save coefficient
pval[i] <- summary(m)$coefficients["xsim", "Pr(>|t|)"] # save pvalue
ci <- confint(m)["xsim",] # get CI
ll[i] <- ci[1] # save CI
ul[i] <- ci[2]
}
sel_inf <- data.frame(rho=rho, pval=pval, ll=ll, ul=ul) # put it all in data frameFor selection, we will report p-values and confidence intervals when the correlation between x and y is greater than .2 (or less than -.2)
Under the null hypothesis that \(\beta_x=0\), which is true, p-values have a uniform distribution (in blue below) and the probability of observing \(p<=0.05\) is \(0.05\).
The p-values for \(\hat{\beta_x}\) after selection (henceforth known as selection p-values) have a much different distribution concentrated near 0 (in read below).
# histograms of all p-values and selection p-values
par(mfrow=c(2,1),
mar=c(4,4,2,1),
cex=.8)
mybreaks <- pretty(range(sel_inf$pval), n=50)
plot(hist(sel_inf$pval, plot=F, breaks=mybreaks), col=rgb(0,0,1, alpha=.5), main="all p-values have a uniform distribution", xlab = "p-value")
plot(hist(sel_inf$pval[selected], plot=F, breaks=mybreaks), col=rgb(1,0,0, alpha=.5), main="p-values after selection have a very different distribution", xlab="p-value")For all p-values, the probability that \(p<.05\) is indeed near 0.05:
Thus, if we were to always model x without selection, the probability of a false positive is about 0.05.
However, for the selection p-values, the probability of a false positive is 1!
# probability of p<0.05 for selection p-values
sum(sel_inf$pval[selected] < 0.05)/length(sel_inf$pval[selected])[1] 1
Thus, these p-values calculated only for selected samples certainly do not represent the probability of observing an extreme test statistic (\(|t|\gt t_{crit}\)) under the null hypothesis.
Similarly, confidence intervals inferred from the same data used for selection will not have their desired properties.
95% confidence intervals should contain the true parameter value, \(\beta_x=0\) in 95% of samples, i.e. 95% coverage.
This is true in our simulation for all coefficients:
# proportion of all confidence intervals that contain 0
sum((sel_inf$ll < 0) & (0 < sel_inf$ul))/length(sel_inf$ll)[1] 0.9505
The selection confidence intervals have 0% coverage, because they never contain 0 (coincides with all selection p-values < 0.05):
A number of methods collectively called selective inference have been developed that address the problem of inference after a selection procedure (Taylor & Tibshirani, 2015).
Selective inference methods can be broadly classifed into 3 categories (Kuchbhotla et al., 2022):
The simplest method to avoid problems of using the same data to select predictors and then again for inference is sample splitting.
In sample splitting, the data are randomly split into 2 parts (Wasserman, 2009):
The major disadvantage of sample splitting is that less data is used for both selection and inference.
We demonstrate by splitting our data d_lasso into halves.
Then we tune \(\lambda\) in the first half, and use the predictors selected at "lambda.1se" for a more parsimonious model.
set.seed(82945)
xz_sel <- as.matrix(d_sel[,2:111]) # selection data predictor matrix
y_sel <- as.matrix(d_sel[,1]) # selection data outcome vector
cv_fit_sel <- cv.glmnet(xz_sel, y_sel) # cross-validate to tune lambda and select predictors
coef(cv_fit_sel, s="lambda.1se") # non-zero coefficients at lambda.1se111 x 1 sparse Matrix of class "dgCMatrix"
lambda.1se
(Intercept) -0.09829234
x1 1.04015614
x2 0.89411437
x3 1.01871272
x4 0.85994520
x5 0.51793015
x6 0.97266368
x7 1.03250572
x8 0.77971793
x9 0.81640041
x10 1.14665334
z1 .
z2 .
z3 .
z4 .
z5 .
z6 .
z7 .
z8 .
z9 .
z10 .
z11 .
z12 .
z13 .
z14 .
z15 .
z16 .
z17 .
z18 .
z19 .
z20 .
z21 0.10113821
z22 .
z23 .
z24 .
z25 .
z26 .
z27 0.05186625
z28 .
z29 .
z30 .
z31 .
z32 .
z33 .
z34 .
z35 .
z36 .
z37 .
z38 .
z39 .
z40 .
z41 .
z42 .
z43 .
z44 .
z45 .
z46 .
z47 .
z48 .
z49 .
z50 .
z51 .
z52 .
z53 .
z54 0.12353828
z55 .
z56 .
z57 .
z58 .
z59 .
z60 .
z61 .
z62 .
z63 .
z64 .
z65 .
z66 .
z67 .
z68 .
z69 .
z70 .
z71 .
z72 .
z73 0.01223998
z74 .
z75 .
z76 .
z77 .
z78 .
z79 .
z80 .
z81 .
z82 .
z83 .
z84 .
z85 .
z86 .
z87 0.01532838
z88 .
z89 .
z90 .
z91 .
z92 .
z93 .
z94 .
z95 .
z96 .
z97 .
z98 .
z99 .
z100 .
Above we see 13 non-zero coefficients.
The columns of the predictor training matrix corresponding to the non-zero coefficients are can be accessed in the object returned by coef() with @i
# columns of xz_sel corresponding to non-zero coefficients
xz_sel_ind <- coef(cv_fit_sel, s="lambda.1se")@i # column indices of selected predictors
xz_sel_ind # 0th column is intercept and is always selected [1] 0 1 2 3 4 5 6 7 8 9 10 31 37 64 83 97
We can then perform inference on these columns using the second subset. We use lm() for an unpenalized model (i.e. coefficients are unbiased, unlike those in LASSO):
# run linear regression on 2nd subset using predictors selected in 1st subset
f <- paste("y ~", paste(colnames(xz_sel)[xz_sel_ind[-1]], collapse="+")) # create lm formula using column names of xz_sel
f[1] "y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z21+z27+z54+z73+z87"
Call:
lm(formula = f, data = d_inf)
Residuals:
Min 1Q Median 3Q Max
-2.77695 -0.68562 -0.01483 0.70274 2.58269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04329 0.11871 -0.365 0.716
x1 0.91117 0.12544 7.264 1.77e-10 ***
x2 0.99319 0.13664 7.268 1.73e-10 ***
x3 1.02833 0.12860 7.997 6.16e-12 ***
x4 0.92140 0.13877 6.640 2.92e-09 ***
x5 1.17323 0.13302 8.820 1.37e-13 ***
x6 1.20620 0.13039 9.251 1.85e-14 ***
x7 1.08318 0.12649 8.563 4.49e-13 ***
x8 0.94158 0.13322 7.068 4.29e-10 ***
x9 0.98493 0.12835 7.674 2.71e-11 ***
x10 0.84850 0.13792 6.152 2.49e-08 ***
z21 -0.08537 0.13673 -0.624 0.534
z27 0.17788 0.14427 1.233 0.221
z54 -0.11317 0.13074 -0.866 0.389
z73 0.10923 0.13178 0.829 0.410
z87 -0.12212 0.13232 -0.923 0.359
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.11 on 84 degrees of freedom
Multiple R-squared: 0.9683, Adjusted R-squared: 0.9627
F-statistic: 171.1 on 15 and 84 DF, p-value: < 2.2e-16
The confidence intervals for all of the regression coefficients contain the true parameter values (\(\beta_0=0\), \(\beta_{x1}=\beta_{x2}=...=\beta_{x_10}=1\), \(\beta_{z1}=\beta_{z2}=...=\beta_{z100}=0\))
2.5 % 97.5 %
(Intercept) -0.2793542 0.1927652
x1 0.6617064 1.1606250
x2 0.7214565 1.2649211
x3 0.7725980 1.2840567
x4 0.6454536 1.1973541
x5 0.9087133 1.4377542
x6 0.9469135 1.4654957
x7 0.8316320 1.3347250
x8 0.6766531 1.2065017
x9 0.7297027 1.2401619
x10 0.5742381 1.1227606
z21 -0.3572702 0.1865276
z27 -0.1090141 0.4647676
z54 -0.3731493 0.1468177
z73 -0.1528357 0.3712892
z87 -0.3852672 0.1410179
The unpenalized model run on the first data subset used for selection (remember, don’t do this!) results in a false positive test of significance for \(z10\):
# problematic inference if unpenalized model is run on data
# used for selection
m_sel <- lm(f, data=d_sel)
summary(m_sel)
Call:
lm(formula = f, data = d_sel)
Residuals:
Min 1Q Median 3Q Max
-2.35513 -0.52145 0.02935 0.60783 2.12045
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03296 0.09642 -0.342 0.7333
x1 1.11606 0.10056 11.099 < 2e-16 ***
x2 0.98712 0.10426 9.468 6.80e-15 ***
x3 1.11358 0.12156 9.160 2.82e-14 ***
x4 0.90027 0.10011 8.993 6.13e-14 ***
x5 0.59005 0.10604 5.564 3.07e-07 ***
x6 1.07014 0.10317 10.372 < 2e-16 ***
x7 1.03268 0.11405 9.055 4.61e-14 ***
x8 0.84885 0.09158 9.269 1.70e-14 ***
x9 0.86572 0.13661 6.337 1.11e-08 ***
x10 1.19627 0.11337 10.552 < 2e-16 ***
z21 0.24648 0.10596 2.326 0.0224 *
z27 0.10270 0.10197 1.007 0.3168
z54 0.14421 0.11117 1.297 0.1981
z73 0.04630 0.09679 0.478 0.6336
z87 0.05788 0.10884 0.532 0.5962
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8704 on 84 degrees of freedom
Multiple R-squared: 0.9831, Adjusted R-squared: 0.98
F-statistic: 325.1 on 15 and 84 DF, p-value: < 2.2e-16
A common problem encountered with LASSO selection is that the best set of predictors may be sensitive to how the data are split during cross-validation.
Below we re-run the \(\lambda\) cross-validation and predictor selection 3 times and can see that the number of nonzero coefficeints varies at "lambda.min" and "lambda.1se":
# selected predictor set varies across cross-validation runs
set.seed(1290)
cv.glmnet(xz_sel, y_sel)
Call: cv.glmnet(x = xz_sel, y = y_sel)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.09394 82 1.185 0.1192 29
1se 0.20715 65 1.296 0.1600 18
Call: cv.glmnet(x = xz_sel, y = y_sel)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.0856 84 1.215 0.1586 31
1se 0.2273 63 1.365 0.1992 17
Call: cv.glmnet(x = xz_sel, y = y_sel)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.1428 73 1.188 0.1322 20
1se 0.2614 60 1.316 0.2009 16
Stability sampling (Meinhausen & Buhlmann, 2010) addresses predictor set variability using this general algorithm:
Below we use stability sampling on our training data (already a subset of 100 observations due to sampling splitting) to stabilize selection of the predictor set:
# stability selection
set.seed(13459)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected <- rep(0, 110) # counts of being selected for 110 predictors
R <- 100 # number of simulations
for (i in 1:R) {
stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
xz_stab <- xz_sel[stab_ind,] # resampled predictors
y_stab <- y_sel[stab_ind,] # resampled outcome
ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam))@i[-1] # column indices of selected predictor, -1 removes intercept
selected[ind_selected] <- selected[ind_selected] + 1 # increase counts for selected predictors
}
# predictor set at 4 different pi thresholds
which(selected >= .6*R) # selected in more than 60% of simulations [1] 1 2 3 4 5 6 7 8 9 10 31 97 105
[1] 1 2 3 4 5 6 7 8 9 10 31
[1] 1 2 3 4 5 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
All 10 \(x\) predictors (corresponding to 1-10) are selected at each threshold, and are the only predictors selected at threshold \(\pi_{thr}=.8\) and \(\pi_{thr}=.9\).
Meinhausen & Buhlmann, 2010, recommend the use of the randomized LASSO to improve stability selection.
With the randomized LASSO, the penalties are perturbed randomly across the coefficients.
Imagine two predictors, \(X_1\) and \(X_2\) are correlated and equally predictive of the outcome \(Y\).
Essentially, the randomized LASSO adds weights to the penalty term:
\[Loss_{lasso}(\beta) = Loss_{reg}(\beta) + \lambda\sum_{j=1}^p\frac{1}{w_j}|\beta_j| \]
These penalty weights, \(w_j\) are drawn from a uniform distribution with bounds \([\alpha, 1]\), \(w_j \sim Uniform(\alpha, 1)\), with \(0.2 \le \alpha \le 0.8\) recommended, with lower values of \(\alpha\) adding more randomness to the penalties.
We will see in part 2 of this workshop how random forests introduce randomness in regression trees to improve prediction.
Below we implement the randomized LASSO in our stability sampling:
runif() (one for each column of predictor matrix)penalty.factor= in glmnetset.seed(13769)
lam <- cv_fit_sel$lambda.min # use lambda.min run on entire training sample (after sample splitting)
selected_randomized <- rep(0, 110) # counts of being selected for 110 predictors
R <- 100 # number of
for (i in 1:R) {
stab_ind <- sample(1:100, size=50) # 50 observations resampled each time
xz_stab <- xz_sel[stab_ind,] # resampled predictors
y_stab <- y_sel[stab_ind,] # resampled outcome
pf <- 1/runif(110, .4, 1) # 1/w_j penalty weights
ind_selected <- coef(glmnet(xz_stab, y_stab, lambda=lam, penalty.factor=pf))@i[-1] # apply weights using penalty.factor=
selected_randomized[ind_selected] <- selected_randomized[ind_selected] + 1
}
# predictor sets at different pi thresholds
which(selected_randomized >= .6*R) [1] 1 2 3 4 5 6 7 8 9 10 64 105
[1] 1 2 3 4 5 6 7 8 9 10 64
[1] 1 2 3 4 5 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
For our simulated data, the randomized LASSO only provides a little more stability to the selected predictor set.
Conditional selective inference methods calculate p-values and confidence intervals (CIs) that account for the fact that the set of predictors was chosen by examining their relationships with the outcome.
Conditional selective inference methods do not require the sample to be split, so the same data are used efficiently for selection and inference.
Recall that test-statistics used to evaluate null hypotheses for regression coefficients (and construct CIs) are formed by dividing the estimated coefficient by its standard error, e.g. \(t_\hat{\beta}\) in linear regression:
\[t_\hat{\beta_j} = \frac{\hat{\beta_j}}{\hat{\sigma}_{\hat{\beta_j}}}\]
\(t_\hat{\beta_j}\) will be near 0 if either \(\hat{\beta_j}\) is near 0 or if \(\hat{\sigma}_{\hat{\beta_j}}\) is very large.
In one conditional selective inference approach (Lee et al., 2016), the test statistic distribution is truncated to account for the fact the selection procedure prevents predictors whose coefficients have test-statistics near zero from being entered into the inference model.
For a given test statistic, this approach essentially calculates bounds defining the interval the test statistic must lie due to the selection event.
These bounds truncate the test statistic’s sampling distribution, and CIs and p-values are then calculated based on this truncated distribution.
To understand how a truncated distribution changes inference, imagine we randomly draw \(x\) repeatedly from a standard normal distribution, \(x \sim N(0,1)\):
selectiveInference packageThe selectiveInference package (Lee et al., 2016) contains conditional selective inference methods for estimating p-values and confidence intervals.
The fixedLassoInf() function performs selective inference for a fixed \(\lambda\).
Technically, then, it should not be used on data where \(\lambda\) was determined by cross-validation rather than pre-determined. However, the performance of fixedLassoInf() with cross-validated \(\lambda\) has been shown to be “acceptable” via simulation (Kammer et al., 2022).
To use fixedLassoInf(), specify:
x=: the training predictor matrixy=: the training outcome vectoralpha=: significance level for CIs, .1 by default.beta=: estimated LASSO coefficients
fixedLassoInf() expects coefficients for unstandardized predictors, so set standardize=F in glmnet() or cv.glmnet()lambda=: lambda value used to compute beta
fixedLassoInf() is the \(\lambda\) used in glmnet multiplied by the sample sizeNeglecting the 2 important points above will yield incorrect results.
# REMEMBER: this method may not be theoretically justified if we determine lambda
# through cross-validation; use with caution!
set.seed(1328) # set random seed to replicate results
# tune lambda using all training data
cv_fit_lasso <- cv.glmnet(xz, y, standardize=F) # get lambda.min via CV on all data
# selective inference for training data
lambda.min <- cv_fit_lasso$lambda.min
n <- nrow(xz) # sample size n
beta <- coef(cv_fit_lasso, x=xz, y=y, s=lambda.min, exact=T)[-1] # coefficients excluding intercept
fLI <- fixedLassoInf(x=xz, y=y,
beta=beta, # coefficient vector
alpha=0.05, # significance level
lambda=lambda.min*n, # multiply lambda by sample size
)Warning in fixedLassoInf(x = xz, y = y, beta = beta, alpha = 0.05, lambda =
lambda.min * : p > n/2, and sd(y) = 5.945 used as an estimate of sigma; you may
want to use the estimateSigma function
Call:
fixedLassoInf(x = xz, y = y, beta = beta, lambda = lambda.min *
n, alpha = 0.05)
Standard deviation of noise (specified or estimated) sigma = 5.945
Testing results at lambda = 19.532, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 1.058 2.153 0.396 -10.083 13.282 0.025 0.025
2 1.006 2.048 0.465 -18.039 18.871 0.025 0.025
3 1.109 2.196 0.004 0.798 37.010 0.025 0.025
4 0.879 1.785 0.030 -0.106 16.626 0.025 0.025
5 0.904 1.845 0.447 -5.202 4.482 0.025 0.025
6 1.046 2.058 0.060 -1.281 30.699 0.025 0.025
7 1.014 2.046 0.024 0.015 33.124 0.025 0.025
8 0.882 1.885 0.008 0.683 Inf 0.025 0.000
9 0.915 1.736 0.015 0.253 23.569 0.025 0.025
10 0.986 1.875 0.038 -0.306 15.770 0.025 0.025
29 0.033 0.064 0.877 -Inf 9.958 0.000 0.025
30 0.071 0.141 0.727 -36.803 12.499 0.025 0.025
31 0.049 0.095 0.861 -Inf 11.026 0.000 0.025
37 0.147 0.283 0.351 -10.730 20.363 0.025 0.025
57 0.048 0.104 0.848 -30.069 4.010 0.025 0.025
67 0.045 0.091 0.992 -Inf -1.571 0.000 0.025
75 0.058 0.117 0.452 -18.612 22.703 0.025 0.025
80 0.077 0.159 0.763 -27.716 7.433 0.025 0.025
83 0.078 0.159 0.598 -19.968 12.966 0.025 0.025
90 -0.374 -0.751 0.736 -1.865 8.176 0.025 0.025
106 0.156 0.352 0.242 -5.263 17.994 0.025 0.025
Note: coefficients shown are partial regression coefficients
After our first run, we get a warning that p>n/2, and sd(y)=5.945 used as an estimate of sigma; you may want to use the estimateSigma function, so we should be suspicious of the results (confidence intervals are extremely wide).
This warning indicates that fixedLassoInf() is having trouble estimating the error (residual) standard deviation (via lm()) because of the high-dimensionality of the data.
We can use selectiveInference’s estimateSigma() function to get an estimate of the error standard deviation, which then stabilizes the results and removes the warning.
y_sigma <- estimateSigma(xz, y)$sigmahat # get a better estimate of residual variance
fLI <- fixedLassoInf(xz, y,
beta=beta,
alpha=0.05,
lambda=lambda.min*n, # multiply lambda by sample size
sigma=y_sigma # supply estiamted residual variance to stabilize results
)
fLI
Call:
fixedLassoInf(x = xz, y = y, beta = beta, lambda = lambda.min *
n, sigma = y_sigma, alpha = 0.05)
Standard deviation of noise (specified or estimated) sigma = 1.011
Testing results at lambda = 19.532, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 1.058 12.659 0.000 0.716 1.430 0.025 0.025
2 1.006 12.038 0.001 0.442 1.535 0.025 0.025
3 1.109 12.909 0.000 0.966 2.155 0.025 0.025
4 0.879 10.493 0.000 0.721 1.349 0.024 0.025
5 0.904 10.849 0.000 0.689 1.075 0.024 0.024
6 1.046 12.098 0.000 0.892 1.912 0.024 0.025
7 1.014 12.027 0.000 0.872 1.951 0.025 0.025
8 0.882 11.084 0.000 0.764 2.308 0.025 0.025
9 0.915 10.208 0.000 0.751 1.581 0.024 0.025
10 0.986 11.025 0.000 0.814 1.429 0.024 0.025
29 0.033 0.374 0.848 -2.984 0.345 0.025 0.025
30 0.071 0.827 0.618 -1.002 0.454 0.025 0.025
31 0.049 0.560 0.821 -2.667 0.389 0.025 0.025
37 0.147 1.663 0.146 -0.194 0.744 0.025 0.025
57 0.048 0.610 0.688 -0.830 0.213 0.025 0.025
67 0.045 0.535 0.954 -7.430 0.069 0.025 0.025
75 0.058 0.690 0.373 -0.494 0.724 0.025 0.025
80 0.077 0.933 0.585 -0.735 0.324 0.025 0.024
83 0.078 0.937 0.444 -0.513 0.470 0.025 0.025
90 -0.374 -4.414 0.006 -0.540 -0.099 0.024 0.025
106 0.156 2.068 0.047 -0.039 0.682 0.025 0.025
Note: coefficients shown are partial regression coefficients
The columns above are:
Var: column of predictor matrixCoef: estimated coefficient (unregularized)Z-score: test statisticP-value: one-sided selection p-valueLowConfPt: lower limit of selection-corrected confidence intervalUpConfPt: upper limit of selection-corrected confidence intervalLowTailArea: Achieved area in lower tailUpTailArea: Achieved area in upper tail (1-(LowTailArea + UpTailArea) equals the estimated coverage of the confidence interval)Another approach to estimating confidence intervals that account for selection is to estimate simultaneous confidence intervals (Kuchibhotla et al., 2022), which control simultaneous coverage, the probability that all of the intervals contain their true parameter values simultaneously.
Simultaneous CIs provide protection against making errors of inference, i.e. inferring a CI contains the true value when it does not, due to multiplicity, similar to procedures that control the family-wise error rate (e.g. Bonferroni).
A description of how Bonferroni simultaneous CIs are constructed illustrates the protection they provide:
Imagine we repeatedly sample \(n=100\) observations from a population with outcome \(Y\) and 20 predictors, \(X=X_1, X_2,...,X_{20}\), all unrelated to the outcome, \(\beta_1=\beta_2=...=\beta_{20}=0\).
If we construct naive (non-simultaneous) 95% CIs, over repeated samples, the probability that all 20 intervals contain their true values (0) simultaneously is \(.95^{20}=0.358\).
The probability that at least one interval does not contain its true value is \(1-0.358=0.642\), which in this scenario is the family-wise error rate (FWER), or the probability of at least 1 false positive claim among 20 p-values equal to 0.05.
Methods such as the Bonferroni procedure control the FWER.
The Bonferroni procedure increases the critical-value of the test statistic used to determine both the CI width and the p-value.
For a given estimated coefficient \(\hat\beta\), its confidence interval is determined by its standard error, \(\hat\sigma_{\beta}\), and the critical t-value, \(t_{crit}\):
\[CI_{\hat\beta} = \left[\hat\beta - t_{crit}\hat\sigma_{\beta}, \hat\beta + t_{crit}\hat\sigma_{\beta} \right]\]
For a linear regression model run on \(n=100\) and 20 predictors and significance level \(\alpha=0.05\), corresponding to 95% CIs:
The Bonferroni simultaneous CIs (red) constructed using \(t_{crit.bonf}=3.09\) are about 55% wider than the naive CIs (blue), and they all contain their true values of 0.
PosiR packageKuchithobla et al., 2022, developed a method to calculate simultaneous CIs after selection.
They define a universe of models \(Q\) as every predictor set considered by the selection procedure.
The simultaneous inference method calculates confidence intervals that guarantee the desired coverage (e.g. 95%) for all models defined by predictor set \(q\) in the universe \(Q\).
Their method is implemented in the PosiR package.
Below, we request simultaneous inference confidence intervals, using every model in the lasso sequence used during cross-validaiton as the “universe” of models:
sapply() to get the selected predictors at each lasso fitted model
simultaneous_ci() adds an intercept column (all 1s) as first column of predictor matrix, so we increment the selected predictor column indices by 1simultaneous_ci() to get simultaneous confidence intervals for all lasso fits in the sequence
X= is the full predictor matrix, to which simultaneous_ci() adds the intercept columny= is the outcome vector (must be a vector)Q_universe= is the list of predictor sets (columns of predictor matrix in X=) for all lasso fitslambdas <- cv_fit_lasso$lambda # get lambda sequence (83 lambdas)
all_selected <- sapply(lambdas, function(x) (coef(cv_fit_lasso, s=x)@i + 1)[-1] ) # columns of predictor matrix selected stored as elements of list
names(all_selected) <- paste0("lam", 1:83) # name list elements with "lam" and index in sequence (required by simultaneous_ci)
# get CIs for each LAASO fit
sim_inf_ci <- simultaneous_ci(X=xz, y=as.numeric(y), Q_universe=all_selected) # takes about 80 secNow we have simultaneous CIs for all 83 LASSO fits.
Let’s look at the simultaneous CIs at "lambda.min".
# lambda.min is 40th lambda
lam40 <- sim_inf_ci$intervals[sim_inf_ci$intervals$model_id=="lam40",]
knitr::kable(lam40, digits=4)| model_id | coefficient_name | estimate | lower | upper | psi_hat_nqj | se_nqj | |
|---|---|---|---|---|---|---|---|
| 435 | lam40 | (Intercept) | -0.0594 | -0.4149 | 0.2960 | 1.0785 | 0.0734 |
| 436 | lam40 | x1 | 1.0582 | 0.6261 | 1.4903 | 1.5942 | 0.0893 |
| 437 | lam40 | x10 | 0.9855 | 0.5422 | 1.4289 | 1.6782 | 0.0916 |
| 438 | lam40 | x2 | 1.0059 | 0.6367 | 1.3751 | 1.1637 | 0.0763 |
| 439 | lam40 | x3 | 1.1086 | 0.7455 | 1.4717 | 1.1256 | 0.0750 |
| 440 | lam40 | x4 | 0.8794 | 0.4651 | 1.2937 | 1.4654 | 0.0856 |
| 441 | lam40 | x5 | 0.9038 | 0.4925 | 1.3150 | 1.4441 | 0.0850 |
| 442 | lam40 | x6 | 1.0460 | 0.6198 | 1.4723 | 1.5513 | 0.0881 |
| 443 | lam40 | x7 | 1.0145 | 0.5782 | 1.4507 | 1.6250 | 0.0901 |
| 444 | lam40 | x8 | 0.8820 | 0.5042 | 1.2598 | 1.2187 | 0.0781 |
| 445 | lam40 | x9 | 0.9151 | 0.4839 | 1.3464 | 1.5881 | 0.0891 |
| 446 | lam40 | z19 | 0.0335 | -0.4152 | 0.4822 | 1.7189 | 0.0927 |
| 447 | lam40 | z20 | 0.0714 | -0.2849 | 0.4277 | 1.0838 | 0.0736 |
| 448 | lam40 | z21 | 0.0486 | -0.3688 | 0.4659 | 1.4871 | 0.0862 |
| 449 | lam40 | z27 | 0.1469 | -0.2779 | 0.5717 | 1.5409 | 0.0878 |
| 450 | lam40 | z47 | 0.0480 | -0.2758 | 0.3719 | 0.8955 | 0.0669 |
| 451 | lam40 | z57 | 0.0448 | -0.3187 | 0.4083 | 1.1282 | 0.0751 |
| 452 | lam40 | z65 | 0.0579 | -0.3820 | 0.4979 | 1.6528 | 0.0909 |
| 453 | lam40 | z70 | 0.0770 | -0.3484 | 0.5025 | 1.5454 | 0.0879 |
| 454 | lam40 | z73 | 0.0777 | -0.2927 | 0.4481 | 1.1715 | 0.0765 |
| 455 | lam40 | z80 | -0.3739 | -0.7467 | -0.0011 | 1.1869 | 0.0770 |
| 456 | lam40 | z96 | 0.1557 | -0.2284 | 0.5399 | 1.2600 | 0.0794 |
In the table above:
model_id: is the name of the model used in the Q_universe= listcoefficient_name: predictor name, sorted in alphanumeric orderestimate: unregularized estimatelower and upper: limits of simultaneous confidence intervalpsi_hat_nqj: bootstrap variance estimate, \(\hat{\Psi}_{n,q,j}\)se_nqj: standard error estimate, \(\sqrt{\frac{\hat{\Psi}_{n,q,j}}{n}}\)Below we compare the CIs for the 10 \(x\) predictors across our inference procedures.
# oracle model is true model
oracle <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data=d_lasso)
oracle_ci <- confint(oracle)
# full model includes all 110 predictors
f_full <- as.formula(paste("y ~ ", paste(colnames(xz), collapse="+")))
full <- lm(f_full, data=d_lasso)
full_ci <- confint(full)False positives
The inference procedures did produce a few false positives:
The full model with all predictors produces 6 false positives: z18, z21, z88, z90, z91, z111 sample splitting produced 3 false positives at \(\pi_{thr}=0.6\): z21 z87 z95
sample splitting produced 1 false positives at \(\pi_{thr}=0.6\): z21
stability sampling produced 2 false positives at \(\pi_{thr}=0.6\): z54 z95
stability sampling produced 1 false positive at \(\pi_{thr}=0.7\): z54
fixedLassoInf() produced 1 false positives at significance level 0.05: z80.
simultaneous_ci() produced 1 false positive: z80
We will practice fitting LASSO-regularized linear regressions, tuning \(\lambda\), and performing inference after selection.
For the exercise, we will use the QuickStartExample data that comes with the glmnet package.
Use to code below to load the data and set up the matrices for glmnet.
Use cross-validation to find the \(\lambda\) that minimizes test error. Set the random seed to 123.
Create a plot of estimated test MSE vs \(-log(\lambda)\) to see how test error varies with the penalty. Identify "lambda.min" on the plot.
Examine the non-zero coefficients at "lambda.min".
Now we wish to perform inference using sample splitting first.
"lambda.min". Determine which coefficients are non-zero.lm() on the inference subset, using the predictors selected in the first subset. Use summary() to get p-values and conf.int() for confidence intervals.fixedLassoInf() on the whole training data (x and y) to perform selective inference with 95% CIs on the same data used for selection.cv.glmnet() with unstandardized predictors."lambda.min" from this cv.glmnet object."lambda.min". Remember to remove the intercept.fixedLassoInf() for selectiveInference. Remember to multiply "lambda.min" by the sample size 100, and to set alpha= to 0.05.Machine learning methods are well suited for scientific problems where the researcher does not want to make many assumptions or decisions about the model.
Machine learning methods perform better with large data.
LASSO regularization can be used to select a predictor set, and the penalty hyperparameter to determine the set, \(\lambda\) can be tuned via cross-validation.
Inference should not be performed on the same data used to select the predictor set without accounting for the selection procedure.
Beam, A. L., & Kohane, I. S. (2018). Big data and machine learning in health care. Jama, 319(13), 1317-1318.
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, 1-22.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R (Vol. 103). New York: springer.
Kammer, M., Dunkler, D., Michiels, S., & Heinze, G. (2022). Evaluating methods for Lasso selective inference in biomedical research: a comparative simulation study. BMC Medical Research Methodology, 22(1), 206.
Kuchibhotla, A. K., Kolassa, J. E., & Kuffner, T. A. (2022). Post-selection inference. Annual Review of Statistics and Its Application, 9, 505-527.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4), 417-473.
Rahal, C., Verhagen, M., & Kirk, D. (2024). The rise of machine learning in the academic social sciences. AI & SOCIETY, 39(2), 799-801.
Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N., & Fanos, V. (2020). Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina, 56(9), 455.
Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25), 7629-7634.
Wasserman, L., & Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37(5A), 2178.
# Prepare data
library(glmnet)
data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y
colnames(x) <- paste0("x", 1:20)
# 1.
set.seed(123)
cv_fit_ex1 <- cv.glmnet(x, y)
# 2.
plot(cv_fit_ex1)21 x 1 sparse Matrix of class "dgCMatrix"
lambda.min
(Intercept) 0.145832036
x1 1.340981414
x2 .
x3 0.708347140
x4 .
x5 -0.848087765
x6 0.554823782
x7 0.038519738
x8 0.347221319
x9 .
x10 0.010034050
x11 0.186365265
x12 .
x13 .
x14 -1.086006902
x15 -0.004444319
x16 .
x17 .
x18 .
x19 .
x20 -1.069942845
# 4.
# use the following code to split the sample into two halves
set.seed(123)
ind <- sample(1:100, size=50)
x_sel <- x[ind,]
y_sel <- y[ind]
x_ind <- x[-ind,]
y_ind <- y[-ind]
d_inf <- data.frame(y=y_ind, x_ind)
cv_fit_sel <- cv.glmnet(x_sel, y_sel)
coef(cv_fit_sel, s="lambda.min")21 x 1 sparse Matrix of class "dgCMatrix"
lambda.min
(Intercept) 0.25396946
x1 1.11515493
x2 .
x3 0.91540769
x4 .
x5 -0.73492270
x6 0.43568066
x7 .
x8 0.32516158
x9 .
x10 0.02507769
x11 0.13759627
x12 .
x13 .
x14 -0.95243039
x15 .
x16 0.16301172
x17 .
x18 .
x19 .
x20 -1.27909551
sel_ind <- coef(cv_fit_sel, s="lambda.min")[-1]
m_inf <- lm(y ~ x1 + x3 + x5 + x6 + x7 + x8 + x9 + x11 + x14 + x15 + x16 + x20, data=d_inf)
summary(m_inf)
Call:
lm(formula = y ~ x1 + x3 + x5 + x6 + x7 + x8 + x9 + x11 + x14 +
x15 + x16 + x20, data = d_inf)
Residuals:
Min 1Q Median 3Q Max
-1.97434 -0.50037 0.02361 0.64250 1.60826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1020 0.1534 0.665 0.5102
x1 1.4438 0.1461 9.884 6.30e-12 ***
x3 0.6284 0.1341 4.687 3.69e-05 ***
x5 -0.9927 0.1585 -6.265 2.75e-07 ***
x6 0.6491 0.1320 4.917 1.82e-05 ***
x7 0.1474 0.1533 0.962 0.3423
x8 0.3581 0.1427 2.510 0.0166 *
x9 -0.0608 0.1647 -0.369 0.7141
x11 0.3006 0.1788 1.681 0.1011
x14 -1.1005 0.1577 -6.978 3.03e-08 ***
x15 -0.1685 0.1726 -0.977 0.3351
x16 -0.1938 0.1462 -1.325 0.1932
x20 -0.9637 0.1673 -5.759 1.33e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9624 on 37 degrees of freedom
Multiple R-squared: 0.9324, Adjusted R-squared: 0.9105
F-statistic: 42.55 on 12 and 37 DF, p-value: < 2.2e-16
2.5 % 97.5 %
(Intercept) -0.20876555 0.4127145
x1 1.14780312 1.7397499
x3 0.35678937 0.9000752
x5 -1.31381835 -0.6716554
x6 0.38163165 0.9166052
x7 -0.16309512 0.4579830
x8 0.06901150 0.6472758
x9 -0.39447756 0.2728753
x11 -0.06165766 0.6628956
x14 -1.41999962 -0.7809310
x15 -0.51823328 0.1811479
x16 -0.49008343 0.1025001
x20 -1.30274946 -0.6246403
# 5.
set.seed(123)
cv_fit_ex1_un <- cv.glmnet(x, y, standardize=F)
lam.min <- cv_fit_ex1_un$lambda.min
beta <- coef(cv_fit_ex1_un, lambda="lambda.min")[-1]
fixedLassoInf(x=x, y=y, beta=beta, lambda=lam.min*100, alpha=0.05)
Call:
fixedLassoInf(x = x, y = y, beta = beta, lambda = lam.min * 100,
alpha = 0.05)
Standard deviation of noise (specified or estimated) sigma = 0.968
Testing results at lambda = 6.546, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 1.379 13.707 0.000 1.182 1.578 0.025 0.024
3 0.779 7.992 0.000 0.588 0.972 0.025 0.024
5 -0.906 -8.442 0.000 -1.118 -0.695 0.024 0.025
6 0.603 6.464 0.000 0.420 0.787 0.025 0.024
8 0.398 4.163 0.000 0.208 0.586 0.024 0.024
11 0.255 2.468 0.029 -0.012 0.459 0.025 0.024
14 -1.120 -12.173 0.000 -1.303 -0.938 0.024 0.024
20 -1.156 -9.960 0.000 -1.384 -0.927 0.025 0.024
Note: coefficients shown are partial regression coefficients