Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Examples of linear regression
Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.
Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.
Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.
Description of the data
For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated hypothetical data, which can be obtained from our website.
use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear (highschool and beyond (200 cases))
We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. Although the variable female is binary (coded 0 and 1), we can still use it in the summarize command.
summarize science math female socst read Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- science | 200 51.85 9.900891 26 74 math | 200 52.645 9.368448 33 75 female | 200 .545 .4992205 0 1 socst | 200 52.405 10.73579 26 71 read | 200 52.23 10.25294 28 76
We can use the tabulate command to see the number of males and females.
tabulate female female | Freq. Percent Cum. ------------+----------------------------------- male | 91 45.50 45.50 female | 109 54.50 100.00 ------------+----------------------------------- Total | 200 100.00
Analysis methods you might consider
Below is a list of some analysis methods you may have encountered.
-
- Linear regression, the focus of this page.
- ANCOVA: ANCOVA will give the same results as linear regression, except with a different parameterization. Linear regression will use dummy coding for categorical predictors, while ANCOVA will use effect coding.
- Robust regression: Robust regression is a type of linear regression used when the assumption of homogeneity of variance may be violated.
Linear regression
Below we use the regress command to estimate a linear regression model. The predictor variable female can be entered into the model either with or without i. before it. Putting i. before female would indicate that female is a factor variable (i.e., categorical variable) and is necessary only if the margins command will be used after running the linear regression.
regress science math female socst read Source | SS df MS Number of obs = 200 -------------+---------------------------------- F(4, 195) = 46.69 Model | 9543.72074 4 2385.93019 Prob > F = 0.0000 Residual | 9963.77926 195 51.0963039 R-squared = 0.4892 -------------+---------------------------------- Adj R-squared = 0.4788 Total | 19507.5 199 98.0276382 Root MSE = 7.1482 ------------------------------------------------------------------------------ science | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026942 18.62364 ------------------------------------------------------------------------------
- At the top of the output we see that all 200 observations in our data set were used in the analysis (fewer observations would have been used if any of our variables had missing values).
- The F-test with 4 and 195 degrees of freedom, with a p-value of 0.0000, tells us that our model as a whole fits significantly better than an empty or null model (i.e., a model with no predictors).
- We see the R-squared value of 0.4892, which indicates the proportion of variance in the outcome accounted for by the model.
- We also see the adjusted R-squared value of 0.4788, which adjusts the R-squared value for the number and quality of the predictors.
- In the table we see the coefficients, their standard errors, the t-statistic, associated p-values, and the 95% confidence interval of the coefficients. Both math and read are statistically
significant.
- For every one unit change in math, the expected value of science increases by 0.389.
- For a one unit increase in read, the expected value of science increases by 0.335.
Things to consider
- The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
- The assumptions of linear regression should be checked. Please see Stata Web Book: Linear Regression for information on the assumptions of linear regression and how to assess these assumptions in Stata.
- Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, you may want to see our page on non-independence within clusters.
References
- Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
- Regression Analysis: A Constructive Critique by Richard A. Berk
- Interpreting and Visualizing Regression Models Using Stata by Michael N. Mitchell
See also
- Stata Annotated Output: Linear regression
- Stata Web Book: Linear regression
- Seminar: Regression with Stata