1.2 Linear Regression Model

Before we explore the generalized regression models, it is a good idea to start with linear regression model to understand why and how we extend the linear regression model to generalized linear regression models.

Regression Analysis is a statistical tools to explain relationship between a response (or dependent) variable as a function of one or more predictor (independent) variables.

Some notation that we use in this presentations:

We use $x$ for the predictor and $y$ for the response variable.
In Simple regression model we observe $n$ pairs of observations $(x_i, y_i)$ , $i = 1, 2, \cdots, n$ ,randomly (and most of the time independently) sampled from our population of interest for each individual or case (unit of observations). We would like to investigate the relationship between $x$ and $y$ .
In a Linear Regression Model we assume this relationship is a linear function.
We consider a simple linear model (a linear model with only one predictor) as:

$y = \beta_0 + \beta_1 x + \epsilon$ - The quantities $\beta_0$ and $\beta_1$ are called model parameters and they are estimated from the data, where $\beta_0$ is the intercept of the line, $\beta_1$ is the slope of the line and $\epsilon$ is the error term that represent unexplained uncertainty.

We can extend the model above to a model with multiple predictors then the model will be:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots +\beta_p x_p + \epsilon$

Here, $\beta_0$ is the intercept, and each $\beta_j$ is the slope for predictor $x_j$ for $j = 1,2, \cdots, p$ .
$\beta_0, \beta1, \cdots, \beta_p$ are also called coefficients of the model.
I use $j$ index for each predictors and $i$ for each observation.
There is another parameter of the model that we will estimate. and that is the variance of error, $var(\epsilon)$ or $\sigma^2$ .

Assumptions in OLS

Linear regression model under some assumptions is estimated by Ordinary Least Square (OLS) method. In OLS we estimate the parameters of the model by minimizing sum of squares of errors. OLS has an algebraic solution and the estimated parameters can be drive analytically.
The regression model has two part, a linear function and an error term.

$y = \overbrace{\beta_0 + \beta_1 x }^\text{linear function}+ \overbrace{\epsilon}^\text{error}$

The error part represent the uncertainty in the observation and is a random variable with mean equal to zero and unknown variance $\sigma ^2$ .
In OLS we make some statistical assumptions to be able to

1- Estimate the model parameters(i.e. intercept and slope of the fitted line and a measure of uncertainty)

2- Make inference about model parameters.

In summary assumptions of the OLS are:

1 - Errors have mean equal to zero.

2 - Errors are independent.

3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).

4 - Errors follow a normal distribution.

Briefly, the assumption of OLS is that the error term in the model are independent with identically distributed random variables from a normal distribution with mean equal to zero for each randomly sampled outcomes $y$ for any given value of $x$ .

The linear function as predicted mean

The linear function predicts the mean or average value of the outcome for a given value of predictor. The mean value in mathematical statistics is noted by E, Expected value. We show this by $\mu_{y|x} = \text{E}(y | x)$ which reads the expected value of $y$ given $x$ .
In summary and for simplicity we write:

$\mu = \text{E}(y |x) = \beta_0 + \beta_1 x$

In general, when we have multiple predictors, $x_1, x_2, \cdots, x_p$ , The predicted mean is:

$\mu = \text{E}(y | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p$ *note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more than two levels. There is only one response or dependent variable, and it is assumed to be continuous.

In Regression model, a linear prediction is a linear combination of predictors.
The linear prediction can get values from $-\inf$ to $+\inf$ .

Questions that leads us to a more general linear regression model are:

What should we do if we do not have a continuous outcome and the error does not have equal variance. What if the predicted value from the linear model is not plausible because of the range of the outcome?

1.4 An example of linear regression model

For this example I used the elementary school api score data available in the OARC website.

We want to study the relationship between academic performance api00 and number of students enrolled in schools, enroll, the first 6 observations look like this:

#reading the data from OARC website
dat <- read.csv(
"https://stats.oarc.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
head(dat[,c("api00", "enroll")])

##   api00 enroll
## 1   693    247
## 2   570    463
## 3   546    395
## 4   571    418
## 5   478    520
## 6   858    343

We think the relationship between $y$ (The response) and $x$ (The predictor) might be explained by a line.
We do not know the true relationship between $y$ and $x$

In R we use function lm to fit (estimate) the linear model and to extract the result from the lm object we use summary.

m1 <- lm(api00 ~ enroll, data = dat)
summary(m1)

## 
## Call:
## lm(formula = api00 ~ enroll, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.50 -112.55   -6.70   95.06  389.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

The estimated intercept is 744.25 and The estimated slope is about -0.2.

Intercept means that for a school with zero number of enrollment the expected api score is 744.25 on average. Usually we do not interpret the intercept and for example in here the expected api score for school with zero number of student does not make sense. If we want to be able to interpret the slope we can center the number of enrollment to their mean.
The slope is equal to -0.2 means that for every unit of additional enrollment the expected api score decreases by -0.2 or for every 10 more number of enrollment, the api score decreases by -2. The p-value for the hypothesis of slope equal to zero shows that; if we assume there is no effect of enrollment on school performance then this decrease of score is not likely by chance.

Scatter Plot

in R we can use function plot to graph the scatter plot. I used function abline(m1) after scatter plot to plot the regression line.

plot(api00 ~ enroll, data = dat, 
main = "Scatter plot of school performance vs. number of enrollment", 
ylab = "api00 (performance score in 2000)",
     xlab = "enroll", xlim = c(0,1000))
abline(m1, col = "blue")

1.5 Least Square (Optional)

The goal is to estimate parameters $\beta_0$ and $\beta_1$ , intercept and the slope of the line, in a way that the line fit the best with the data.
Estimate model parameters are noted as $\hat{\beta}_0$ and $\hat{\beta}_1$ .
The estimated model will be

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$

$\hat{y}$ represent the predicted mean of the outcome for a given value of $x$ . We also call it fitted value.
The predicted value for each observation can be calculated by the linear line as: $\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i$
The difference between observed value of $y_i$ and the predicted value $\hat{y_i}$ is called residual which is:

$e_i = y_i - \hat{y_i}$

In the OLS we estimate the parameters of the model (intercept and slope) by minimizing sum of square of residuals.

$\min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2$

This method is called Least Square (LS).

Let’s look at the components of uncertainty in the linear model by plotting the fitted line in the scatter plot.

In the plot above,

$y_i - \bar{y}$ is overall or total uncertainty which we try to explain by the linear model.

$\hat{y}_i - \bar{y}$ is the uncertainty that can be explain by the linear model.

$y_i - \hat{y}_i$ is the residual or uncertainty that cannot be explain by the linear model.

In linear regression, the likelihood function (possibility of observing the outcome given parameters) can be determined using normal distribution. In fact, we can use another statistical method to estimate parameters of the model, which is called Maximum Likelihood Method. In OLS the Maximum Likelihood Method is equivalent to Least square method.

linear regression model diagnostics for api vs. enrollment model (optional)

We will look at model diagnostics plots to evaluate the model assumptions.

opar <- par(mfrow = c(2, 2), 
            mar = c(4.1, 4.1, 2.1, 1.1))
plot(m1)

The plot in the left corner shows residual vs. fit plot.

This plot is showing us that there is not strong evidence to reject assumptions of zero mean and constant variance of error.

If we can confirm that the error has zero mean, then we can be sure if we increase the sample size then the estimated parameter is getting closer to the true value of parameter (i.e.

$\beta$ s). We say estimated parameter is unbiased.

Constant variance assumption is needed to insure the estimation method, OLS, achieve the minimal variability for parameters that has been estimated.

The Normal Q-Q plot is used to investigate that whether the error are distributed from a normal distribution using estimated errors (residuals).

The normal assumption is needed for statistical tests of parameters. Specially for small samples. For large sample sizes, this assumption is not essential.

1.6 The link function

Recall linear regression model where the outcome is continues and error has normal distribution, the linear prediction was.

$\mu = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p$

There are cases that the outcome for a given set of predictor can only gets positive values or values in the interval $(0, 1)$

If the outcome is from a count distribution where it is count number $0, 1, 2, \cdots$ , like Poisson distribution, the mean is always positive ( $\mu > 0$ ).
If the outcome is dichotomous and can get only value 0 and 1 (Bernoulli distribution), the mean is between 0 and 1.

-In this two example the linear prediction is not working correctly.

In generalized linear regression we use a link function to transform linear prediction to get a plausible expected outcome.
for example if the mean is in the interval $(0, 1)$ then we can use logit function to transfer a value between 0 and 1 to a value between $-\infty$ and $+\infty$ . The chart below shows the steps:

The table below (from wikipedia page for GLM) shows link functions for different distributions and regression models.

In the next slides we look at this two functions a little more

Understanding the odds

Suppose the probability of getting admitted to a graduate school on average is 0.3.
Then We define the odds of getting to a graduate school to be:

$\tt{odds(admit)} = \dfrac{p}{1 - p} = \dfrac{0.3}{1 - 0.3} = 0.4285714$

and odds of not getting admitted to a graduate school will be:

$\tt{odds(not \ admit)} = \dfrac{p}{1 - p} = \dfrac{0.7}{1 - 0.7} = 2.333333$

As we can see odds is a positive value between 0 and infinity. Thus, the inverse of odd is a number between 0 and 1.
Question: CNN analysts predict that Los Angeles Lakers has odds of 9 to 1 to go to the play off. What is the probability that the Los Angeles Lakers make the play off?

$odds = \dfrac{p}{1 - p}$

Lets calculate probability of Lakers goes to play off. From the odds formula:

$p = odds * (1 - p)$ and $p + p \times odds = odds$

$p = \dfrac{odds} {1 + odds} = 9 / (9 + 1) = 0.9$

Another useful function is log. If we have a value between zero and infinity then the log of that value is between negative and positive infinity.
The inverse log function is an exponential function. We usually use natural log and exponential function.
We can use log(odds) transformation to transform a value between zero and one to a value between negative infinity and positive infinity. This function, log(odds) is called logit function.

Natural logarithm and exponential function

Exponential function plays a big rule in generalized linear model and understanding its mathematical properties is needed to really understand generalized linear models.

logarithm is the inverse function to exponentiation.

For exampel, for $10^3 = 1000$ ,

$\log_{10}(1000) = 3$

We say the logarithm base 10 of 1000.

$\log_{2}(64) = 6$ , so

$2^6 = 64$ .

Euler’s number or the number

$e$ is a mathematical constant.

$\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\times 2}}+{\frac {1}{1\times 2\times 3}}+\cdots \approx 2.718282.$

Logarithm with the base equal to $e$ is called the natural logarithm.

If $e^6 = a$ Then $\log_e(a) = ?$

The answer is 6.

$e^x$ can also noted as $\exp(x)$ . $\log_e(x)$ is also shown as $\ln(x)$

Multiplication rule

Since $\exp(x + y) = \exp(x)\exp(y)$

$\log(xy) = \log(x) + \log(y)$ and,

$\log(x/y) = \log(x) - \log(y)$

Part 2 Logistic Regression

In linear regression models we assumed the errors are distributed from a normal distribution with mean zero and variance $\sigma^2$ .
We can also say the outcome for a given $x$ (i.e. $y|x$ ) has a normal distribution with mean $\mu = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p$ and variance $\sigma^2$ .
This is not always true and for example in ${0, 1}$ outcome the outcome has a Bernoulli distribution and the variance depends on the mean.
In generalized linear regression models we assume a transformation (by the link function) of the mean of the outcome has a linear relationship with predictors.
In the next few slides we review some of the special distributions.

Continuous random variable

A continuous random variable is a random variable which can take infinity many values in an interval.

For example weight of newborn babies.

The graph below shows the plot of PDF of a normal distribution with $\mu = 2$ and $sd = 1$ .

Discrete Random variable and Probability Mass Function

A discrete random variable is a random variable that can only take values from a countable set.

For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of $\{1, 2, 3, ...\}$ .

Sum of outcomes from rolling two dice as a random variable that takes values in the set of $\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}$ .

Number of patients hospitalized daily for COVID 19 in a healthcare facility.

A dichotomous outcome is a Bernoulli random variable with values of 0 and 1, and is a discrete random variable.

Probability Mass Function

A probability Mass Function is a function that gives the probability of a discrete random variable at each and every value of that random variable.

For example the PMF of a random variable of outcome of rolling a dice with outcome of $\{1, 2, 3, 4, 5, 6\}$ is:

$p(X = x) = 1/6$

for $x \in \{1, 2, 3, 4, 5, 6\}$ .

2.1 Bernoulli distribution

Bernoulli random variable is a random variable that can get only two possible outcome, 1 with probability $p$ and 0 with probability $q = 1 - p$ .
The PMF of Bernoulli with zero and one outcomes is:

$f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}$

the mean of a Bernoulli random variable is:

$\mu_X = E(X) = 1 \times p + 0 \times (1 - p) = p$

and the variance is

$\sigma^2_X = var(X) = E(X- \mu)^2 = (1 - p) ^ 2 \times p + (0 - p) ^ 2 \times (1 - p) \\ = p(1 - p) ( 1 - p + p) = p(1 - p)$

Binomial distribution (optional)

Binomial is another popular discrete random variable that defines as number of success in

$n$ independent Bernoulli trial with probability of success equal to

$p$ .

The PMF of Binomial(n, p) is: (optional) $\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}$ For $x = 0, 1, 2, 3, \cdots, n$ .

Sum of

$n$ independent Bernoulli random variable with the same parameter

$p$ has Binomial distribution with parameters

$p$ and

$n$ .

If we know the PMF of a random variable, we can calculate the probability of observing any given value by plug in that value in the PMF.

For example for a Binomial random variable with

$p = 0.2$ and

$n = 10$ we can calculate probabilities for each value of the random variable:

$P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107$

The graph below shows the plot of PMF of a Binomial random variable with

$p = 0.2$ and

$n = 10$ .

The mean of binomial is

$np$ and the variance is

$np(1 - p)$

2.2 An example of logistic regression

logistic regression is used where the outcome has only 1 and 0 values.

We start with an example:

The data that we are using here is called binary data from OARC stat website.

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

This data set has a binary response (outcome, dependent) variable called admit.
There are three predictor variables: gre, gpa and rank.
We will treat the variables gre and gpa as continuous.
The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

We can get basic descriptive for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the data set.

#reading the data from OARC stat website:
mydata <- read.csv("https://stats.oarc.ucla.edu/stat/data/binary.csv")
#summary statistics of variables
summary(mydata)

##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

#standard deviation of variables
sapply(mydata, sd)

##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

Estimation on the mean of admit without any predictor (optional)

This is also can be seen as regression model with intercept only model.
Variable admit is binary and it has value of 1 if a person admitted and 0 for not admitted. For the variable admit, the mean is the same as proportion of 1 or estimated probability of getting admit in a grad school.

$\hat{p} = \dfrac{X}{n} = \dfrac{\sum_{i=1}^nx_i}{n}$

Where $X$ is the total number of people admitted and $n$ is total number of students and $x_i$ is value of admit for the $i$ -th student.

We can calculate standard deviation directly from the mean.
Note that this variable is binary and assuming a binary outcome is a result of a Bernoulli trial then, $\hat\mu_{admit} = \hat p = \sum_{i=1}^{n} x_i /n$
Since in Bernoulli distribution the variance is $var(x) = p(1-p)$ , we can use this formula to calculate the variance.
Remember in OLS, where the error assumed to be normally distributed and so the outcome, the variance assumed to be constant but in here the variance depends on the predicted mean value.
Of course not in the intercept only model because the expected mean is constant!

Linear Regression model of admit predicted by GRE score (optional)

Now run a regression model of admit predicted by gre, assuming admit is a continues variable.
We are trying to estimate the mean of admit for given value of gre. Since the admit is 0 and 1 this can be seen as probability of admit.

lm.admit <- lm(admit ~ gre, data = mydata)
summary(lm.admit)

## 
## Call:
## lm(formula = admit ~ gre, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4755 -0.3415 -0.2522  0.5989  0.8966 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1198407  0.1190510  -1.007 0.314722    
## gre          0.0007442  0.0001988   3.744 0.000208 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4587 on 398 degrees of freedom
## Multiple R-squared:  0.03402,    Adjusted R-squared:  0.03159 
## F-statistic: 14.02 on 1 and 398 DF,  p-value: 0.0002081

opar <- par(mfrow = c(2, 2), 
            mar = c(4.1, 4.1, 2.1, 1.1))
plot(lm.admit)

par(opar)

phat <- lm.admit$fitted.values

plot(phat * (1 - phat) ~ phat)

As we can see the variance of the predicted outcome is systematically related to the predicted outcome and thus is not constant.

Another issue with this model is that:

In linear model we allow the predicted outcome to be any number between negative infinity to positive infinity.
But the probability can not be a negative number or a number greater than zero. So, we need to use a transformation function such that we can relate the outcome to a value between 0 and 1.

To overcome this issue and also the issue of equality of variance in OLS we need to use a different model.

2.3 logistic Regression model of admit predicted by GRE score

In the logistic regression we model the log of odds of the outcome by linear function of predictors:

$log(odds | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p$

Or,

$\frac{p}{1-p} = e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}$

or,

$p = \frac{e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}}{1+e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}}$ Or,

$p = \frac{1} {1 + e^{-(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}}$

Notice that we do not need to use $\epsilon$ to indicate the uncertainty of the outcome because the probability itself is the expected of the outcome and the outcome is one Bernoulli draw with this given probability.
In R we use function glm to run a generalized linear model to estimate the coefficients of the model. The distribution of the outcome is defined by the argument “family” with the “option”logit" for the link function.

#to estimate coefficient of the model we use glm with  
#the option family = binomial(link = "logit")
glm.mod <- glm(admit ~ gre, family = binomial(link = "logit"), data = mydata)

summary(glm.mod)

## 
## Call:
## glm(formula = admit ~ gre, family = binomial(link = "logit"), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1623  -0.9052  -0.7547   1.3486   1.9879  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
## gre          0.003582   0.000986   3.633  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 486.06  on 398  degrees of freedom
## AIC: 490.06
## 
## Number of Fisher Scoring iterations: 4

coef(summary(glm.mod))

##                 Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
## gre          0.003582212 0.0009860051  3.633056 2.800845e-04

confint(glm.mod)

##                    2.5 %       97.5 %
## (Intercept) -4.119988259 -1.739756286
## gre          0.001679963  0.005552748

$log(\text{odds of admit}) = -2.90134 + 0.00382 \text{ GRE}$

The coefficient of GRE (slope of the model) interpreted as: For one unit increase of GRE the log-odds of admit expected to increases by 0.003582 on average.

This is not very usefull and easy to understand. We just can see this positive relationship is not very likely by chance (p-value = 0.00028).

The intercept is the expected log odds of someone getting admitted to graduate school when their GRE score is zero.
When we exponentiate the intercept then we get the expected odds of someone with GRE equal to zero.

$exp(-2.901344) = 0.05494932$ . Transforming to probability, $p = 0.05494932/(1+0.05494932)= 0.052$ , we get the expected probability of someone with GRE equal to 0.

What if we exponentiate the slope?

Assume the someone has GRE of 500. Then the predicted log-odds based on the model will be:

$odds_{500} = e^{-2.901344 + 0.003582 \times 500}$

If their GRE score increases by one unit then the odds will be:

$odds_{501} = e^{-2.901344 + 0.003582 \times (500 + 1)}$

If we divide the latter by the former we have:

$\frac{odds_{501}}{odds_{500}} = \frac{e^{-2.901344 + 0.003582 \times (500 + 1)}}{e^{-2.901344 + 0.003582 \times (500)}} = \frac{e^{-2.901344}\times e ^{0.003582\times 500} \times e^{(0.003582)}}{e^{-2.901344}\times e ^{0.003582\times 500}} = e^{0.003582}$

This number is constant and does not depend on the GRE score nor on intercept. This is actually exponentiation of the slope.

Thus,

The exponentiation of the slope is the odds ratio of admission for a unit increase of GRE score which is:

$exp(0.003582) = 1.003588$ Or we can say the odds of getting admitted on average increase by $0.3\%$ for one unit improvement of GRE score. This is the old GRE score and at that time it was in 10-point increments. What if someone improve their GRE by 50 unit?

With the same calculations:

$\frac{odds_{550}}{odds_{500}} = \frac{e^{-2.901344 + 0.003582 \times (500 + 50)}}{e^{-2.901344 + 0.003582 \times (500)}} = \frac{e^{-2.901344}\times e ^{0.003582\times 500} \times e^{50\times(0.003582)}}{e^{-2.901344}\times e ^{0.003582\times 500}} = (e^{0.003582})^{50}$

The exponentiation function is increases multiplicative and not additive. Thus, for 50 more GRE score the odds of admission increases by a factor of $(\exp(\beta_1))^{50} = 1.003588 ^{50} = 1.196115$ . In another word their adds of getting admitted increases by about $20\%$ .

2.4 logistic Regression model of admit predicted by gpa score

Exercise: As an exercise use the same data set and run logistic regression of admit predicted by gpa.

mydata <- read.csv("https://stats.oarc.ucla.edu/stat/data/binary.csv")

Calculate the exponentiated of the slope and interpret the slope of the model?
If someone increase their gpa by 0.2 how the expected odds of their admission changes?

2.5 logistic Regression model of admit predicted by gpa and gre score

glm.mod.3 <- glm(admit ~ gre + gpa, family = binomial(link = "logit"), data = mydata)

summary(glm.mod.3)

## 
## Call:
## glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2730  -0.8988  -0.7206   1.3013   2.0620  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
## gre          0.002691   0.001057   2.544   0.0109 *  
## gpa          0.754687   0.319586   2.361   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 480.34  on 397  degrees of freedom
## AIC: 486.34
## 
## Number of Fisher Scoring iterations: 4

coef(summary(glm.mod.3))

##                 Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -4.949378063 1.075092882 -4.603675 4.151004e-06
## gre          0.002690684 0.001057491  2.544403 1.094647e-02
## gpa          0.754686856 0.319585595  2.361455 1.820340e-02

exp(coef(summary(glm.mod.3))[,1])

## (Intercept)         gre         gpa 
## 0.007087816 1.002694307 2.126945379

confint(glm.mod.3)

##                     2.5 %       97.5 %
## (Intercept) -7.1092205752 -2.886726963
## gre          0.0006373958  0.004791712
## gpa          0.1347509288  1.390131131

the odds of getting admitted on average increases by a factor of $2.1269$ for one unit improvement of GPA given GRE score is constant.

2.6 logistic Regression model of admit predicted by gpa and gre score and interaction between gpa and gre score

glm.mod.4 <- glm(admit ~ gre * gpa, family = binomial(link = "logit"), data = mydata)

summary(glm.mod.4)

## 
## Call:
## glm(formula = admit ~ gre * gpa, family = binomial(link = "logit"), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1324  -0.9387  -0.7099   1.2969   2.3519  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -13.818700   5.874816  -2.352   0.0187 *
## gre           0.017464   0.009606   1.818   0.0691 .
## gpa           3.367918   1.722685   1.955   0.0506 .
## gre:gpa      -0.004327   0.002787  -1.552   0.1205  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 477.85  on 396  degrees of freedom
## AIC: 485.85
## 
## Number of Fisher Scoring iterations: 4

coef(summary(glm.mod.4))

##                  Estimate  Std. Error   z value   Pr(>|z|)
## (Intercept) -13.818700374 5.874815854 -2.352193 0.01866309
## gre           0.017464312 0.009606206  1.818024 0.06906048
## gpa           3.367918022 1.722685231  1.955040 0.05057838
## gre:gpa      -0.004326605 0.002786899 -1.552480 0.12054741

exp(coef(summary(glm.mod.4))[,1])

##  (Intercept)          gre          gpa      gre:gpa 
## 9.968153e-07 1.017618e+00 2.901805e+01 9.956827e-01

confint(glm.mod.4)

##                     2.5 %       97.5 %
## (Intercept) -2.576285e+01 -2.677081148
## gre         -9.057462e-04  0.036855984
## gpa          8.009849e-02  6.850511379
## gre:gpa     -9.932612e-03  0.001022485

#The simple effect of gre at GPA 500

exp(3.367918022 + 500 * -0.004326605)

## [1] 3.335476

# or

29.01805 * 0.9956827 ^  500

## [1] 3.335407

The interaction therm is an multiplication factor, for example for GPA, increases/decreases (here decreases) the odds of outcome by a factor of 0.9956827 for every unit of GRE.
For example if someone has GRE 500, for every 1 unit increase of their GPA then their odds of admission changes by a factor of $29.01805 \times 0.9956827 ^ {500} = 3.3354$

Or we can first calculate $3.367918022 + 500 \times -0.004326605$ and then we can take the exponentiation.

Part 3 Regression of count data

3.1 Poisson Random variable and Poisson distribution

The discrete random variable

$Y$ is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter

$\mu > 0$ , if

$Y$ has PMF

$Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .$

The mean of Poisson distribution is

$\mu$ and the variance of it is also

$\mu$ .

In Poisson, we only have one parameter. The mean and variance of Poisson distribution are equal.

In a large number of Bernoulli trial with a very small probability of success the total number of success can be approximated by a Poisson distribution. This is called the law of rare events.

It can be shown that a Binomial distribution will converge to a Poisson distribution as $n$ goes to infinite and $p$ goes to zero such that $np \rightarrow \mu$

3.2 An Example of Poisson Random variable

Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and Wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0, 1, 2, …, 8 accidents can be calculated using the PMF of Poisson:

$P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353$

$P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707$

$P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707$

$P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804$

$P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902$

$P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361$

$P(Y = 6| \mu = 2) \approx 0.0120$

$P(Y = 7| \mu = 2) \approx 0.0034$

$P(Y = 8| \mu = 2) \approx 0.0009$

The following graph shows the plot of PMF of a Poisson random variable with $\mu = 2$

This looks like binomial but not exactly the same!

Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of each $x = 0, 1, 2, \cdots, 8$ accidents using R is:

 dpois(0:8, lambda = 5.5)

## [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
## [7] 0.157117294 0.123449302 0.084871395

The ratio of rates of accident between rainy days to sunny days is 5.5 / 2 = 2.75

We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.

We can draw a sample of number 0 to 8 with probabilities that we calculated from Poisson(2)

A sample of 20 number will be look like this:

prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
domain <- seq(0,8,1)
set.seed(1001)
y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
y1

##  [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0

mean(y1)

## [1] 2

var(y1)

## [1] 2.210526

Now lets add some zeroes and some larger values to our sample

y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
mean(y2)

## [1] 2.213115

var(y2)

## [1] 14.07049

par(mfrow = c(1,2))
hist(y1, breaks = 10)
hist(y2, breaks = 10)

We can see for sample y2, the variance is larger than mean. This is called Overdispersion.

The way we draw our sample is close to Poisson with

$\mu = 2$ but not exactly. If we want to sample from a true Poisson we can use rpois() function in R.

y <- rpois(1000, 2)
mean(y)

## [1] 1.958

var(y)

## [1] 2.038274

hist(y, breaks = 15)

Poisson distribution for unit time and time t

The basic Poisson distribution assumes that each count in the distribution occurs over a unit of time. However, we often count events over a period of time or over various areas. For example if the number of accidents occur on average 2 times per day, over the course of a month (30 days) the average of accident will be $2 \times 30 = 60$ . The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:

$Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .$ Where $t$ represents the length of time. We will talk about this when we talk about exposure in a Poisson regression.

3.3 Count data

Count data are observations that have only non-negative integer values.
Count can range from zero to some grater undetermined value.
Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.
Some types of count data:

A count of items or events occuring within a period of time.

Count of item or events occouring in a given geographical or spatial area.

Count of number of people having a particular disease, adjusted by the size of the population.

Examples of count data:

Number of accidents in a highway.

Number of patients died at a hospital within 48 hours of having myocardial infarction

Count of number of people having a particular disease, adjusted by the size of the population.

3.4 Poisson Regression Model

Regression model for count data refers to a regression models such that the response variable is a non-negative integer.

Can we model the count response as a continuous random variable and by using ordinary least square estimate the parameters?

There are two problem with this approach:

In linear OLS model we may get the mean response as a negative value. This is not correct for count data

In linear OLS model we assume the variance of response for any given predictor is constant. This may not be correct for count data and in fact almost all the time it is not

Poisson regression model is the standard model for count data

In Poisson regression model we assume that the response variable for a given set of predictor variables is distributed as a Poisson distribution with the parameter

$\mu$ depends of values of predictors

Since parameter

$\mu$ should be greater than zero for

$p$ number of predictors, we use the log function as link function

Nearly all of the count models have the basic structure of the linear model. The difference is that the left-hand side of the model equation in in log form.

Formalization of count model

A count model with one predictor can be formulize as:

$\text{log of mean response} = \ln(\mu) = \beta_0 + \beta x$

From this model the expected count or mean response will be:

$\text{mean response} = \mu =E(y|x) = \exp( \beta_0 + \beta_1 x)$ This ensures $\mu$ will be positive for any value of $\beta_0$ , $\beta_1$ , or $X$ .

The Poisson model

The Poisson model is the model that the count $y|x$ has a poisson distribution with the above mean. For an outcome $y$ and single predictor $x$ we have:

$f(y | x; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x)}(\exp( \beta_0 + \beta x))^{y}}{y!}$

Plot of distribution of the response for Poisson regression model will be:

In contrast the plot below shows the distribution of the responses when the response has a normal distribution for a given predictor.

Poisson Regression Model(continue)

Back to our toy example of number of accidents in 405. Let $x$ is a binary variable of 0 for sunny day and 1 for rainy day.

We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.

In fact we can artificially generate this sample using R:

y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
x <- rep(c(0,1), each = 20)
print(data.frame(y = y,x = x))

accidents_model <- glm(y ~ x, family = poisson(link = "log"))
summary(accidents_model)

## 
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5166  -0.4006  -0.2024   0.4407   2.0264  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.8329     0.1474   5.649 1.61e-08 ***
## x             0.9505     0.1736   5.475 4.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 69.616  on 39  degrees of freedom
## Residual deviance: 36.173  on 38  degrees of freedom
## AIC: 160.7
## 
## Number of Fisher Scoring iterations: 5

The coefficient now are in the log metric.

Let’s exponentiate the coefficients:

exp(accidents_model$coeff)

## (Intercept)           x 
##    2.300000    2.586957

We can find the rate of accidents for each group (sunny or rainy) directly using the mean of rainy days to sunny days.

mu0 <- mean(y[x==0])
mu1 <- mean(y[x==1])
c(mu0, mu1, mu1/mu0)

## [1] 2.300000 5.950000 2.586957

3.4 Poisson model log-likelihood function (optional)

This part is just for your information and is optional.

If we have more than one predictors for ith observation we will have:

$\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})$

For simplicity we use $\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}$ and therefore we have

$P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}$ and log-likelihood function will be:

$\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]$

3.5 Analysis of a count data model

For this part We will use German national health registry data set as an example of Poisson regression. The data set is also exist in package COUNT in R. We are going to use only year 1984.
Source: German Health Reform Registry, years pre-reform 1984-1988, From Hilbe and Greene (2008)
The data can be access by installing package COUNT.
The response variable is the number of visit.
We are using outwork and age as predictors. outwork is a binary variable with 1 = not working and 0 = working. age is a continuous variable range 25-64.
Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.

R code for loading the data and descriptive statistics

#you can load the data from package COUNT
library(COUNT)
data("rwm1984")
head(rwm1984)

##   docvis hospvis edlevel age outwork female married kids hhninc educ self
## 1      1       0       3  54       0      0       1    0  3.050 15.0    0
## 2      0       0       1  44       1      1       1    0  3.050  9.0    0
## 3      0       0       1  58       1      1       0    0  1.434 11.0    0
## 4      7       2       1  64       0      0       0    0  1.500 10.5    0
## 5      6       0       3  30       1      0       0    0  2.400 13.0    0
## 6      9       0       3  26       1      0       0    0  1.050 13.0    0
##   edlevel1 edlevel2 edlevel3 edlevel4
## 1        0        0        1        0
## 2        1        0        0        0
## 3        1        0        0        0
## 4        1        0        0        0
## 5        0        0        1        0
## 6        0        0        1        0

summary(rwm1984)

##      docvis           hospvis           edlevel          age    
##  Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
##  Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
##  Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
##  3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
##  Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
##     outwork           female          married            kids       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      hhninc            educ            self            edlevel1     
##  Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
##  Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
##  3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
##     edlevel2         edlevel3         edlevel4      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

#centering the age predictor
rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

3.6 Running a count data model

In R we still use glm to run Poisson model. For family option we use poisson(link = "log").

And if we use centered age:

#running glm
pois_m <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
#extract the outcome by summary
summary(pois_m)

## 
## Call:
## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
##     data = rwm1984)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
## outwork     0.4079314  0.0188447   21.65   <2e-16 ***
## cage        0.0220842  0.0008377   26.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25791  on 3873  degrees of freedom
## Residual deviance: 24190  on 3871  degrees of freedom
## AIC: 31279
## 
## Number of Fisher Scoring iterations: 6

3.7 Interpreting Coefficients and rate ratio

Coefficients are in log metric and we can directly interpret them in term of log response. In general we say: Log mean of the response variable increases by

$\beta$ for a one unit increase of the corresponding predictor variable, other predictors are held constant.

In our results, the expected log number of visits for patients who are out of work is 0.4 more than log number of visits for patients who are working with the same age.

For each one year increase in age, there is an increase in the expected log count of visit to the doctor of 0.022, holding outwork the same.

In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.

Suppose the predicted mean for number of visit when outwork is 0 is $\mu_0$ and the predicted mean for number of visit when outwork is 1 is $\mu_1$ . Then,

$log(\mu_1) -log(\mu_0) = \log(\dfrac{\mu_1}{\mu_0}) = \beta_0 + \beta_1 (\text{at work = 1}) + \beta_2 \text{ age} - (\beta_0 + \beta_1 (\text{at work = 0}) + \beta_2 \text{ age}) = \beta_1$

So, $\exp(0.4079314) = 1.503704$ means that on avarage we expect that the number of doctor visit for a person at work to be 50% more than a person not at work having the same age.

3.8 Approximation of The standard error of expnentiation of coefficients (optional)

The standard error can approximately calculated by delta method.

Delta Method:

$se(\exp(\beta)) = exp(\beta) * se(\beta)$

exp(coef(pois_m))

## (Intercept)     outwork        cage 
##    2.555113    1.503704    1.022330

exp(coef(pois_m)) * sqrt(diag(vcov(pois_m))) #delta method

##  (Intercept)      outwork         cage 
## 0.0325958199 0.0283368154 0.0008564317

exp(confint.default(pois_m))

##                2.5 %   97.5 %
## (Intercept) 2.492018 2.619805
## outwork     1.449178 1.560282
## cage        1.020653 1.024010

# sum of fit statistics for poisson model  
pr <- sum(residuals(pois_m, type = "pearson") ^ 2)
pr / pois_m$df.resid

## [1] 11.34322

pois_m$aic / (pois_m$df.null + 1)

## [1] 8.074027

COUNT::modelfit(pois_m)

## $AIC
## [1] 31278.78
## 
## $AICn
## [1] 8.074027
## 
## $BIC
## [1] 31297.57
## 
## $BICqh
## [1] 8.07418

3.8 Marginal effects (optional)

Marginal effect at the mean: The number of visit made to a doctor increased by 6.6 % for each year of age, when outwork is set at its mean.

Average marginal effect: For each additional year of age, there are on average about 0.07 additional doctor visit with outwork at its mean value.

Discrete change or partial effect is used to evaluate the change in predicted of the response when a binary predictor changes value from 0 to 1

#marginal effect at mean
beta <- coef(pois_m)
mout <- mean(rwm1984$outwork)
mage <- mean(rwm1984$cage)
xb <- sum(beta * c(1,mout,mage))
dfdxb <- exp(xb) * beta[3]
dfdxb

##       cage 
## 0.06552846

#avarage marginal effect
mean(rwm1984$docvis) * beta[3]

##       cage 
## 0.06984968

3.9 Exposure (optional)

Sometimes all data points are not occur at the same period , area or volumes. In this case we can use an offset with a model to adjust for counts of events over time period, area, and volumes.

$\exp(\boldsymbol{x'\beta}) = \mu / t$

$\mu = \exp(\boldsymbol{x'\beta} + \log(t))$ The term $\log(t)$ is referred as offset.

For example we use the data set of Canadian National Cardiovascular Disease registry called FASTRAK.

In this data set:

die : number of death

cases: number of cases

anterior: if the patient had a previous anterior myocardial infarction.

hcabg: if the patient had a history of CABG procedure.

Killip: a summary indicator of the cardiovascular health of the patient.

R code for exposure

data("fasttrakg")
head(fasttrakg)

##   die cases anterior hcabg killip kk1 kk2 kk3 kk4
## 1   5    19        0     0      4   0   0   0   1
## 2  10    83        0     0      3   0   0   1   0
## 3  15   412        0     0      2   0   1   0   0
## 4  28  1864        0     0      1   1   0   0   0
## 5   1     1        0     1      4   0   0   0   1
## 6   0     3        0     1      3   0   0   1   0

summary(fast <- glm(die ~ anterior + hcabg +factor(killip) , family = poisson, 
                    offset = log(cases), data = fasttrakg))

## 
## Call:
## glm(formula = die ~ anterior + hcabg + factor(killip), family = poisson, 
##     data = fasttrakg, offset = log(cases))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4791  -0.5271  -0.1766   0.4337   2.3317  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.0698     0.1459 -27.893  < 2e-16 ***
## anterior          0.6749     0.1596   4.229 2.34e-05 ***
## hcabg             0.6614     0.3267   2.024   0.0429 *  
## factor(killip)2   0.9020     0.1724   5.234 1.66e-07 ***
## factor(killip)3   1.1133     0.2513   4.430 9.44e-06 ***
## factor(killip)4   2.5126     0.2743   9.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 116.232  on 14  degrees of freedom
## Residual deviance:  10.932  on  9  degrees of freedom
## AIC: 73.992
## 
## Number of Fisher Scoring iterations: 5

exp(coef(fast))

##     (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
##      0.01708133      1.96376577      1.93746487      2.46463342      3.04434885 
## factor(killip)4 
##     12.33745552

exp(coef(fast)) * sqrt(diag(vcov(fast))) #delta method

##     (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
##     0.002492295     0.313359487     0.632970849     0.424784164     0.765119601 
## factor(killip)4 
##     3.384215172

exp(confint.default(fast))

##                     2.5 %      97.5 %
## (Intercept)     0.0128329  0.02273622
## anterior        1.4363585  2.68482829
## hcabg           1.0212823  3.67554592
## factor(killip)2 1.7581105  3.45508317
## factor(killip)3 1.8602297  4.98221265
## factor(killip)4 7.2067174 21.12096274

modelfit(fast)

## $AIC
## [1] 73.9917
## 
## $AICn
## [1] 4.93278
## 
## $BIC
## [1] 78.24
## 
## $BICqh
## [1] 5.566187

3.10 Overdispersion (optional)

The overdispersion is the key problem to consider in count model.

As we noted in Poisson random variable overdispersion occurs when the variance is greater than mean.

In count model, overdispersion means the variance of response condition on predictor is more than the conditional mean of response.

There are variety of tests to determine whether the model fits the data.

The first and natural test used in Poisson regression is called the deviance goodness-of-fit test.

The deviance is defined as the difference between a saturated log-likelihood and full model log-likelihood.

dev <- deviance(pois_m)
df <- df.residual(pois_m)
p_value <- 1 - pchisq(dev, df)
print(matrix(c("Deviance GOF ", " ", 
               "D =", round(dev,4) , 
               "df = ", df, 
               "P-value =  ", p_value), byrow = T,ncol = 2))

##      [,1]            [,2]        
## [1,] "Deviance GOF " " "         
## [2,] "D ="           "24190.3581"
## [3,] "df = "         "3871"      
## [4,] "P-value =  "   "0"

With a p < 0.05, the deviance GOF test indicates that we can reject the hypothesis that the model is not well fitted.

The other test statistics is pearson goodness-of-fit test. Defined as sum of square Pearson residuals.

The overdispersion can be model as different ways

Quasi-likelihood models

Sandwich or robust variance estimators

Bootstrap standard errors

Negative binomial 1 (NB1)

Negative binomial 1 (NB2)

In Poisson Mean = $\mu$
Variance = $\mu$

In Negative Binomial1 (NB1)

Mean = $\mu$
Variance = $\mu + \alpha \mu$ or

Variance = $\mu (1 + \alpha) = \mu \omega$

In Negative Binomial2 (NB2)

Mean = $\mu$
Variance = $\mu + \alpha \mu ^2$

3.11 Negative binomial Examples

# Poisson Default standard errors (variance equals mean) based on MLE
poiss_mle <- glm(docvis ~ outwork + age, data = rwm1984, family=poisson()) 
print(summary(poiss_mle),digits=3)

## 
## Call:
## glm(formula = docvis ~ outwork + age, family = poisson(), data = rwm1984)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.457  -2.248  -1.237   0.286  25.132  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.033517   0.039181   -0.86     0.39    
## outwork      0.407931   0.018845   21.65   <2e-16 ***
## age          0.022084   0.000838   26.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25791  on 3873  degrees of freedom
## Residual deviance: 24190  on 3871  degrees of freedom
## AIC: 31279
## 
## Number of Fisher Scoring iterations: 6

# Robust sandwich based on poiss_mle
library(sandwich)
#this will gives robust covariance matrix of estimated model
cov.robust <- vcovHC(poiss_mle, type="HC0")
#diagonal is the variances
se.robust <- sqrt(diag(cov.robust))
coeffs <- coef(poiss_mle)
z_.025 <- qnorm(.975)
t.robust <- coeffs / se.robust
poiss_mle_robust <- cbind(coeffs, se.robust, t.robust, pvalue=round(2*(1-pnorm(abs(coeffs/se.robust))),5), 
                          lower=coeffs - z_.025 * se.robust, upper=coeffs+ z_.025 * se.robust)    
print(poiss_mle_robust,digits=3)

##              coeffs se.robust t.robust pvalue   lower  upper
## (Intercept) -0.0335   0.12333   -0.272  0.786 -0.2752 0.2082
## outwork      0.4079   0.06907    5.906  0.000  0.2726 0.5433
## age          0.0221   0.00283    7.808  0.000  0.0165 0.0276

# Poisson Bootstrap standard errors (variance equals mean)
# install.packages("boot")
library(boot)
boot.poiss <- function(data, indices) {
  data <- data[indices, ] 
  model <- glm(docvis ~ outwork + age, family=poisson(), data=data)
  coefficients(model)
}
set.seed(10101)
# To speed up we use only 100 bootstraps. 
summary.poissboot <- boot(rwm1984, boot.poiss, 400)
print(summary.poissboot,digits=3)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = rwm1984, statistic = boot.poiss, R = 400)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*  -0.0335  0.007078     0.12531
## t2*   0.4079  0.000811     0.07318
## t3*   0.0221 -0.000168     0.00294

# Poisson with NB1 standard errors (assumes variance is a multiple of the mean) w = theta*mu or w = mu*(1 + alpha)
#also called Poisson Quasi-MLE
poiss_qmle <- glm(docvis ~ outwork + age, data = rwm1984, family=quasipoisson()) 
print(summary(poiss_qmle),digits=4)

## 
## Call:
## glm(formula = docvis ~ outwork + age, family = quasipoisson(), 
##     data = rwm1984)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.033517   0.131962  -0.254      0.8    
## outwork      0.407931   0.063468   6.427 1.46e-10 ***
## age          0.022084   0.002821   7.827 6.39e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 11.34324)
## 
##     Null deviance: 25791  on 3873  degrees of freedom
## Residual deviance: 24190  on 3871  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

# MASS does just NB2
# install.packages("MASS")
library(MASS)
# Note: MASS  reports as the overdispersion parameter 1/alpha not alpha
# In this example R reports 0.9285 and 1/0.9285 = 1.077  
# NB2 with default standard errors 
NB2.MASS <- glm.nb(docvis ~ outwork + age, data = rwm1984) 
print(summary(NB2.MASS),digits=3)

## 
## Call:
## glm.nb(formula = docvis ~ outwork + age, data = rwm1984, init.theta = 0.4353451729, 
##     link = log)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.532  -1.282  -0.489   0.104   5.047  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.0396     0.1067   -0.37     0.71    
## outwork       0.4147     0.0554    7.48  7.5e-14 ***
## age           0.0221     0.0024    9.24  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.4353) family taken to be 1)
## 
##     Null deviance: 4100.8  on 3873  degrees of freedom
## Residual deviance: 3909.6  on 3871  degrees of freedom
## AIC: 16674
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.4353 
##           Std. Err.:  0.0135 
## 
##  2 x log-likelihood:  -16665.5250

3.12 weighted least square and generalized linear models (optional)

The R codes below is used for weighted least square and the relationship between weighted least square and generalized linear models.

Weighted least square

set.seed(101)
x = seq(1,3.5,.1)
x = sort(rep(x,10))
Ey = 220 - 60*x
y = rpois(length(Ey),Ey)
m = lm(y~x)

plot(x,y,xlab="Price",ylab="Number Sold"); abline(m)

plot(predict(m),resid(m),xlab="Predicted values",ylab="Residuals")
abline(h=0)

W = 1/predict(m);
Wm = lm(y~x,weights=W)
plot(predict(Wm),sqrt(W)*resid(Wm)); abline(h=0)

glp.m <- glm(y ~ x, family = poisson(link = "identity"))

summary(m)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.2760  -5.1955  -0.3457   4.6567  24.8014 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 222.4431     1.7288  128.67   <2e-16 ***
## x           -60.8978     0.7289  -83.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.815 on 258 degrees of freedom
## Multiple R-squared:  0.9644, Adjusted R-squared:  0.9642 
## F-statistic:  6980 on 1 and 258 DF,  p-value: < 2.2e-16

summary(Wm)

## 
## Call:
## lm(formula = y ~ x, weights = W)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.83173 -0.66938 -0.02412  0.58847  2.97862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  223.055      1.751   127.4   <2e-16 ***
## x            -61.170      0.597  -102.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9548 on 258 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9759 
## F-statistic: 1.05e+04 on 1 and 258 DF,  p-value: < 2.2e-16

summary(glp.m)

## 
## Call:
## glm(formula = y ~ x, family = poisson(link = "identity"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.14035  -0.67749  -0.02331   0.58215   2.81612  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 223.0681     1.8305  121.86   <2e-16 ***
## x           -61.1756     0.6227  -98.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7382.88  on 259  degrees of freedom
## Residual deviance:  236.76  on 258  degrees of freedom
## AIC: 1817.1
## 
## Number of Fisher Scoring iterations: 4

y <- mydata$admit
x <- mydata$gre

mu <- rep(mean(y), nrow(mydata))   # initialize mu
                                   #
eta <- log(mu/(1-mu))              # initialize eta
for (i in 1:10) {                   # loop for 10 iterations
  w <-  mu*(1-mu)                  # weight = variance
  z <- eta + (y - mu)/(mu*(1-mu))  # expected response + standardized error
  mod <- lm(z ~ x, weights = w)    # weighted regression
  eta <- mod$fit                   # linear predictor
  mu <- 1/(1+exp(-eta))            # fitted value
  cat(i, coef(mod), "\n")          # displayed iteration log
}

## 1 -2.783528 0.003434139 
## 2 -2.899506 0.003579592 
## 3 -2.901344 0.003582211 
## 4 -2.901344 0.003582212 
## 5 -2.901344 0.003582212 
## 6 -2.901344 0.003582212 
## 7 -2.901344 0.003582212 
## 8 -2.901344 0.003582212 
## 9 -2.901344 0.003582212 
## 10 -2.901344 0.003582212

coef(summary(mod))

##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -2.901344270 0.6068541966 -4.780958 2.459924e-06
## x            0.003582212 0.0009873336  3.628167 3.226024e-04

confint(mod)

##                    2.5 %       97.5 %
## (Intercept) -4.094384619 -1.708303920
## x            0.001641171  0.005523253

glm.mod <- glm(admit ~ gre, family = binomial, data = mydata)
coef(summary(glm.mod))

##                 Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
## gre          0.003582212 0.0009860051  3.633056 2.800845e-04

confint(glm.mod)

##                    2.5 %       97.5 %
## (Intercept) -4.119988259 -1.739756286
## gre          0.001679963  0.005552748

Generalized linear Regression Models

Table of contents

Part 1 Introduction

1.1 An overview of Generalized Linear Models