Generalized Linear Regression Models

Office of Advanced Research and Computing (OARC), Statistical Methods and Data Analysis

1 Introduction

1.1 Overview of Generalized Linear Models

Chart below shows examples of Generalized Linear Models (GLM):

GLM Chart

In this workshop, we will cover the key concepts of Generalized Linear Models (GLMs) and explore Logistic Regression, Poisson Regression, and briefly, the Negative Binomial model, with examples in R.

1.2 Linear Regression Model

Before exploring generalized regression models, it’s helpful to first understand linear regression. This will provide insight into why and how we extend linear regression to generalized linear models (GLMs).

  • Regression Analysis is a statistical tools to explain relationship between a response (or dependent) variable as a function of one or more predictor (independent) variables.

Notation Used in This Presentation

  • We use \(x\) for the predictor and \(y\) for the response variable.

  • In a Simple regression model we observe \(n\) pairs of data points \((x_i, y_i)\), where \(i = 1, 2, \cdots, n\). These data points are typically randomly and independently sampled from a population of interest. Our goal is to study the relationship between \(x\) and \(y\).

  • In a linear regression model, we assume that this relationship follows a linear function:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

where:

  • \(\beta_0\) (intercept) and \(\beta_1\) (slope) are the model parameters estimated from the data.

  • \(\epsilon\) is the error term, representing unexplained variability.

  • We can extend this to multiple predictors, leading to the multiple linear regression model:

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots +\beta_p x_{ip} + \epsilon_i\]

Here, each \(\beta_j\) represents the effect of predictor \(x_j\) on \(y\).

1.3 Assumptions in OLS

Linear regression is typically estimated using the Ordinary Least Squares (OLS) method, which finds model parameters by minimizing the sum of squared errors. OLS has an algebraic solution, allowing parameters to be computed analytically.

  • The regression model consists of two parts:

\[y = \overbrace{\beta_0 + \beta_1 x }^\text{Linear function}+ \overbrace{\epsilon}^\text{Error term}\]

  • The error term, \(\epsilon\), represents randomness and uncertainty. It is assumed to have:
  1. Mean zero (\(\text{E}(\epsilon)=0\)).

  2. Independence (errors are not correlated).

  3. Constant variance (homoscedasticity).

  4. Normal distribution (\(\epsilon \sim N(0, \sigma ^2)\)).

In other words, we assume that errors are independent and identically distributed (i.i.d.) from a normal distribution with mean zero.

1.4 The linear function as predicted mean

  • The linear function predicts the mean (expected value) of the response variable given a predictor:

\[\mu = \text{E}(y |x) = \beta_0 + \beta_1 x\]

  • For multiple predictors:

\[\mu = \text{E}(y | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\]

  • The above equation is also called the linear prediction.

  • The linear prediction can get values from \(-\infty\) to \(+\infty\).

While linear regression is a powerful tool, it has limitations. This leads us to more general models:

  • What if the outcome is not continuous?

  • What if errors do not have equal variance?

  • What if predicted values fall outside a plausible range?

These limitations motivate the need for generalized linear models (GLMs), which we will explore next.

1.5 Example: Linear Regression Model

For this example, we use elementary school API score data from the OARC website.

  • Our goal is to examine the relationship between academic performance (api00) and school enrollment (enroll).
  • Below are the first six observations from data:
#reading the data from OARC website
dat <- read.csv(
"https://stats.oarc.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
head(dat[,c("api00", "enroll")])
  api00 enroll
1   693    247
2   570    463
3   546    395
4   571    418
5   478    520
6   858    343
  • We assume the relationship between \(y\) (the response variable) and \(x\) (the predictor) can be approximated by a straight line. However, the true nature of this relationship is unknown.

  • In R, we use the lm function to fit a linear model. To extract and interpret the results, we use the summary function:

#Fitting the linear model
m1 <- lm(api00 ~ enroll, data = dat)
#Extracting the results
summary(m1)

Call:
lm(formula = api00 ~ enroll, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-285.50 -112.55   -6.70   95.06  389.15 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 135 on 398 degrees of freedom
Multiple R-squared:  0.1012,    Adjusted R-squared:  0.09898 
F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

The estimated intercept is 744.25, and the estimated slope is -0.2.

  • In R, we can visualize this relationship using a scatter plot. The plot function creates the plot, and abline(m1) adds the regression line:
#Plotting the scatter plot
plot(api00 ~ enroll, data = dat, 
     main = "Scatter Plot: School Performance vs. Enrollment", 
     ylab = "API Score (2000)",
     xlab = "Number of Students Enrolled", 
     xlim = c(0, 1000))
abline(m1, col = "blue")

Scatter plot with regression line

This plot provides a clear visual representation of the negative relationship between school enrollment and academic performance.

1.6 Least Square (Optional)

  • The goal is to estimate the parameters \(\beta_0\) and \(\beta_1\)—the intercept and the slope—so that the fitted line best represents the data.

  • The estimated model parameters are denoted as \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

  • The estimated regression model is:

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

  • Here, \(\hat{y}\) represents the predicted mean of the outcome for a given value of \(x\). It is also known as the fitted value.

  • The predicted value for each observation is given by \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\).

  • The difference between the observed value \(y_i\) and the predicted value \(\hat{y_i}\) is called the residual and is given by:

\[e_i = y_i - \hat{y}_i\]

  • In Ordinary Least Squares (OLS), we estimate the model parameters (intercept and slope) by minimizing the sum of squared residuals:

\[ \min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2\]

  • This is known as the Least Squares method.

  • Let’s visualize the components of uncertainty in the linear model by plotting the fitted line on a scatter plot.

Scatter plot with fitted line and residuals

In the plot above,

\(y_i - \bar{y}\) is overall or total uncertainty which we try to explain by the linear model.

\(\hat{y}_i - \bar{y}\) is the uncertainty that can be explain by the linear model.

\(y_i - \hat{y}_i\) is the residual or uncertainty that cannot be explain by the linear model.

In linear regression, the likelihood function (possibility of observing the outcome given parameters) can be determined using normal distribution. In fact, we can use another statistical method to estimate parameters of the model, which is called Maximum Likelihood Method. In OLS the Maximum Likelihood Method is equivalent to Least square method.

1.7 linear regression model diagnostics for api vs. enrollment model (optional)

par(mfrow = c(2, 2))
plot(m1)

Linear model diagnostics plots

We will look at model diagnostics plots to evaluate the model assumptions.

The plot in the left corner is the Residuals vs. Fitted plot.

  • This plot helps assess whether the assumptions of zero mean and constant variance of the error term hold. Based on the plot, there is no strong evidence to reject these assumptions.

  • If the error has a mean of zero, then as the sample size increases, the estimated parameters will converge to the true values of $\beta$ (i.e., the estimator is unbiased).

  • The assumption of constant variance (homoscedasticity) ensures that the Ordinary Least Squares (OLS) method provides parameter estimates with minimal variability.

The Normal Q-Q plot helps assess whether the residuals follow a normal distribution.

  • The normality assumption is crucial for statistical tests on model parameters, particularly in small samples.

  • However, for large sample sizes, the normality assumption becomes less critical due to the Central Limit Theorem.

2 Generalized Linear Models (GLM)

2.2 Understanding the odds

  • Suppose the probability of being admitted to graduate school is 0.3.

  • The odds of admission are defined as:

\[ \tt{odds(admit)} = \dfrac{p}{1 - p} = \dfrac{0.3}{1 - 0.3} = 0.4285714 \]

  • Conversely, the odds of not being admitted are:

\[ \tt{odds(not \ admit)} = \dfrac{p}{1 - p} = \dfrac{0.7}{1 - 0.7} = 2.333333 \]

  • As we can see, odds are always positive, ranging from 0 to infinity. The inverse of the odds gives a value between 0 and 1.

Example: Calculating Probability from Odds

  • A CNN analyst predicts that the Los Angeles Lakers have 9-to-1 odds of making the playoffs (odds = 9). What is the probability that they qualify?

Using the formula for odds:

\(odds = \dfrac{p}{1 - p}\)

Solving for \(p\):

\(p = \dfrac{odds} {1 + odds} = 9 / (9 + 1) = 0.9\)

Thus, the probability of the Lakers making the playoffs is 90%.

2.3 Log Transformation and the Logit Function

  • Another useful transformation is the log function.

  • The log of a positive value (between 0 and \(+\infty\)) maps it to a range of negative to positive infinity.

  • The inverse log function is the exponential function.

  • We often use the natural log (\(\ln\)) and the exponential function (\(e^x\)) in statistical modeling.

  • Applying the log transformation to odds (i.e., log(odds)) converts a probability (between 0 and 1) into an unrestricted real number (between \(-\infty\) and \(+\infty\)).

This transformation, log(odds), is called the logit function and is a fundamental component of logistic regression.

2.4 Natural logarithm and exponential function

The exponential function plays a crucial role in generalized linear models (GLMs). Understanding its mathematical properties is essential for grasping the concepts behind GLMs.

  • The logarithm is the inverse function to exponentiation.

For example,

\[10^3 = 1000 \Rightarrow \log_{10}(1000) = 3\] This means the logarithm base 10 of 1000 is 3.

Similarly,

\[2^6 = 64 \Rightarrow \log_{2}(64) = 6 \]

  • Euler’s number, \(e\), is a fundamental mathematical constant, approximately 2.718.

  • It can be expressed as an infinite series:

\[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\times 2}}+{\frac {1}{1\times 2\times 3}}+\cdots \approx 2.718282.\]

  • A logarithm with base \(e\) is called the natural logarithm.

If \[e^6 = a\]

Then

\[\log_e(a) = 6\]

  • Logarithm Rules

\[\exp(x + y) = \exp(x)\exp(y)\]

\[\log(xy) = \log(x) + \log(y)\]

\[\log(x/y) = \log(x) - \log(y)\]

2.5 Distribution of the outcome

  • In linear regression models we assumed the errors are distributed from a normal distribution with mean zero and variance \(\sigma^2\).

  • We can also say the outcome for a given \(x\) (i.e. \(y|x\)) has a normal distribution with mean \(\mu = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\) and variance \(\sigma^2\).

  • This is not always true and for example in \({0, 1}\) outcome the outcome has a Bernoulli distribution and the variance depends on the mean.

  • In generalized linear regression models we assume the outcome has a distribution from a family of distributions.

  • In R, the distribution of the outcome is defined by the argument family in the glm function.

  • In generalized linear regression models we assume a transformation (by the link function) of the mean of the distribution of the outcome has a linear relationship with predictors.

  • In the next few slides we review some of the special distributions.

2.6 Continuous random variable

  • A continuous random variable is a random variable which can take infinity many values in an interval.

For example weight of newborn babies.

The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).

PDF of normal distribution

2.7 Discrete Random variable and Probability Mass Function

  • A discrete random variable is a random variable that can only take values from a countable set.

For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).

Sum of outcomes from rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).

Number of patients hospitalized daily for COVID 19 in a healthcare facility.

  • A dichotomous outcome is a Bernoulli random variable with values of 0 and 1, and is a discrete random variable.

Probability Mass Function

  • A probability Mass Function is a function that gives the probability of a discrete random variable at each and every value of that random variable.

For example the PMF of a random variable of outcome of rolling a dice with outcome of \(\{1, 2, 3, 4, 5, 6\}\) is:

\(p(X = x) = 1/6\)

for \(x \in \{1, 2, 3, 4, 5, 6\}\).

2.8 Bernoulli distribution

  • Bernoulli random variable is a random variable that can get only two possible outcome, 1 with probability \(p\) and 0 with probability \(q = 1 - p\).

  • The PMF of Bernoulli with zero and one outcomes is:

\[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]

the mean of a Bernoulli random variable is:

\(\mu_X = E(X) = 1 \times p + 0 \times (1 - p) = p\)

and the variance is

\(\sigma^2_X = var(X) = E(X- \mu)^2 = (1 - p) ^ 2 \times p + (0 - p) ^ 2 \times (1 - p) \\ = p(1 - p) ( 1 - p + p) = p(1 - p)\)

PMF of Bernoulli distribution

2.9 Binomial distribution (optional)

  • Binomial is another popular discrete random variable that defines as number of success in \(n\) independent Bernoulli trial with probability of success equal to \(p\).

The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).

  • Sum of \(n\) independent Bernoulli random variable with the same parameter \(p\) has Binomial distribution with parameters \(p\) and \(n\).

  • If we know the PMF of a random variable, we can calculate the probability of observing any given value by plug in that value in the PMF.

  • For example for a Binomial random variable with \(p = 0.2\) and \(n = 10\) we can calculate probabilities for each value of the random variable:

\(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)

  • The mean of binomial is \(np\) and the variance is \(np(1 - p)\)

  • The graph below shows the plot of PMF of a Binomial random variable with \(p = 0.2\) and \(n = 10\).

PMF of Binomial distribution

2.10 Poisson distribution

The Poisson distribution is one of the most commonly used probability distributions for modeling count data. It describes the probability of a given number of events occurring in a fixed interval of time or space, assuming the events occur independently and at a constant average rate.

Key Characteristics:

  • The outcome is a count (0, 1, 2, …).
  • The mean of the distribution is denoted as \(\lambda\) (lambda), which represents the expected number of occurrences in the given interval.
  • The variance is equal to the mean (i.e., \(\lambda\)). Events occur independently, meaning the occurrence of one event does not affect the probability of another.

Later, in regression for count outcome, we will explore the Poisson distribution in more detail, including its mathematical formulation, and applications.

3 Logistic Regression

3.1 An example of logistic regression

Logistic regression is used when the outcome variable is binary, meaning it can take only two values (e.g., 1 and 0).

Example: Graduate School Admissions

In this example, we analyze binary data from the OARC statistics website.

A researcher is interested in examining how factors such as: GRE scores (Graduate Record Exam), GPA (Grade Point Average), and Undergraduate institution prestige affect the probability of admission into graduate school.

The response variable (admission status) is binary, meaning:

  • 1 = Admitted

  • 0 = Not admitted

3.1.0.1 Variables in the Dataset

  • The outcome variable: admit (binary: 1 = admitted, 0 = not admitted).

  • The predictor variables:

    • gre (Graduate Record Exam score) – continuous
    • gpa (Grade Point Average) – continuous
    • rank (Undergraduate institution prestige) – categorical (values 1 to 4):

      Rank 1: Highest prestige

      Rank 4: Lowest prestige

To summarize the dataset, we can use summary(). If we need the standard deviations for each variable, we can apply the sd function using sapply().

     admit             gre             gpa             rank      
 Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
 Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
 Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
 3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000  
      admit         gre         gpa        rank 
  0.4660867 115.5165364   0.3805668   0.9444602 

3.2 Estimation on the mean of admit without any predictor (optional)

  • This is also can be seen as regression model with intercept only model.

  • Variable admit is binary and it has value of 1 if a person admitted and 0 for not admitted. For the variable admit, the mean is the same as proportion of 1 or estimated probability of getting admit in a grad school.

\[ \hat{p} = \dfrac{X}{n} = \dfrac{\sum_{i=1}^nx_i}{n} \]

Where \(X\) is the total number of people admitted and \(n\) is total number of students and \(x_i\) is value of admit for the \(i\)-th student.

  • We can calculate standard deviation directly from the mean.

  • Note that this variable is binary and assuming a binary outcome is a result of a Bernoulli trial then, \(\hat\mu_{admit} = \hat p = \sum_{i=1}^{n} x_i /n\)

  • Since in Bernoulli distribution the variance is \(var(x) = p(1-p)\), we can use this formula to calculate the variance.

  • Remember in OLS, where the error assumed to be normally distributed and so the outcome, the variance assumed to be constant but in here the variance depends on the predicted mean value.

  • Of course not in the intercept only model because the expected mean is constant!

3.3 Linear Regression model of admit predicted by GRE score (optional)

  • Now run a regression model of admit predicted by gre, assuming admit is a continues variable.

  • We are trying to estimate the mean of admit for given value of gre. Since the admit is 0 and 1 this can be seen as probability of admit.


Call:
lm(formula = admit ~ gre, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4755 -0.3415 -0.2522  0.5989  0.8966 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1198407  0.1190510  -1.007 0.314722    
gre          0.0007442  0.0001988   3.744 0.000208 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4587 on 398 degrees of freedom
Multiple R-squared:  0.03402,   Adjusted R-squared:  0.03159 
F-statistic: 14.02 on 1 and 398 DF,  p-value: 0.0002081

As we can see the variance of the predicted outcome is systematically related to the predicted outcome and thus is not constant.

Another issue with this model is that:

  • In linear model we allow the predicted outcome to be any number between negative infinity to positive infinity.

  • But the probability can not be a negative number or a number greater than zero. So, we need to use a transformation function such that we can relate the outcome to a value between 0 and 1.

To overcome this issue and also the issue of equality of variance in OLS we need to use a different model.

3.4 logistic Regression model of admit predicted by GRE score

  • In the logistic regression we model the log of odds of the outcome by linear function of predictors:

\[ log(odds | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\]

Or,

\[\frac{p}{1-p} = e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}\]

or,

\[p = \frac{e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}}{1+e^{(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}} \] Or,

\[p = \frac{1} {1 + e^{-(\beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p)}}\]

  • Notice that we do not need to use \(\epsilon\) to indicate the uncertainty of the outcome because the probability itself is the expected of the outcome and the outcome is one Bernoulli draw with this given probability.

  • In R we use function glm to run a generalized linear model to estimate the coefficients of the model. The distribution of the outcome is defined by the argument “family” with the “option”logit” for the link function.


Call:
glm(formula = admit ~ gre, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
gre          0.003582   0.000986   3.633  0.00028 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 486.06  on 398  degrees of freedom
AIC: 490.06

Number of Fisher Scoring iterations: 4
                Estimate   Std. Error   z value     Pr(>|z|)
(Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
gre          0.003582212 0.0009860051  3.633056 2.800845e-04
                   2.5 %       97.5 %
(Intercept) -4.119988259 -1.739756286
gre          0.001679963  0.005552748

\[ log(\text{odds of admit}) = -2.90134 + 0.00382 \text{ GRE} \]

  • The coefficient of GRE (slope of the model) interpreted as: For one unit increase of GRE the log-odds of admit expected to increases by 0.003582 on average.

This is not very useful and easy to understand. We just can see this positive relationship is not very likely by chance (p-value = 0.00028).

  • The intercept is the expected log odds of someone getting admitted to graduate school when their GRE score is zero.

  • When we exponentiate the intercept then we get the expected odds of someone with GRE equal to zero.

\(exp(-2.901344) = 0.05494932\). Transforming to probability, \(p = 0.05494932/(1+0.05494932)= 0.052\), we get the expected probability of someone with GRE equal to 0.

What if we exponentiate the slope?

Assume the someone has GRE of 500. Then the predicted log-odds based on the model will be:

\[odds_{500} = e^{-2.901344 + 0.003582 \times 500}\]

If their GRE score increases by one unit then the odds will be:

\[odds_{501} = e^{-2.901344 + 0.003582 \times (500 + 1)}\]

If we divide the latter by the former we have:

\[\frac{odds_{501}}{odds_{500}} = \frac{e^{-2.901344 + 0.003582 \times (500 + 1)}}{e^{-2.901344 + 0.003582 \times (500)}} = \frac{e^{-2.901344}\times e ^{0.003582\times 500} \times e^{(0.003582)}}{e^{-2.901344}\times e ^{0.003582\times 500}} = e^{0.003582} \]

  • This number is constant and does not depend on the GRE score nor on intercept. This is actually exponentiation of the slope.

Thus,

The exponentiation of the slope is the odds ratio of admission for a unit increase of GRE score which is:

\[exp(0.003582) = 1.003588\] Or we can say the odds of getting admitted on average increase by \(0.3\%\) for one unit improvement of GRE score. This is the old GRE score and at that time it was in 10-point increments. What if someone improve their GRE by 50 unit?

  • With the same calculations:

\[\frac{odds_{550}}{odds_{500}} = \frac{e^{-2.901344 + 0.003582 \times (500 + 50)}}{e^{-2.901344 + 0.003582 \times (500)}} = \frac{e^{-2.901344}\times e ^{0.003582\times 500} \times e^{50\times(0.003582)}}{e^{-2.901344}\times e ^{0.003582\times 500}} = (e^{0.003582})^{50} \]

The exponentiation function is increases multiplicative and not additive. Thus, for 50 more GRE score the odds of admission increases by a factor of \((\exp(\beta_1))^{50} = 1.003588 ^{50} = 1.196115\). In another word their adds of getting admitted increases by about \(20\%\).

3.5 logistic Regression model of admit predicted by gpa score

  • Exercise: As an exercise use the same data set and run logistic regression of admit predicted by gpa.
  • Calculate the exponentiated of the slope and interpret the slope of the model?

  • If someone increase their gpa by 0.2 how the expected odds of their admission changes?

3.6 logistic Regression model of admit predicted by gpa and gre score


Call:
glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
gre          0.002691   0.001057   2.544   0.0109 *  
gpa          0.754687   0.319586   2.361   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 480.34  on 397  degrees of freedom
AIC: 486.34

Number of Fisher Scoring iterations: 4
                Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -4.949378063 1.075092882 -4.603675 4.151004e-06
gre          0.002690684 0.001057491  2.544403 1.094647e-02
gpa          0.754686856 0.319585595  2.361455 1.820340e-02
(Intercept)         gre         gpa 
0.007087816 1.002694307 2.126945379 
                    2.5 %       97.5 %
(Intercept) -7.1092205752 -2.886726963
gre          0.0006373958  0.004791712
gpa          0.1347509288  1.390131131

the odds of getting admitted on average increases by a factor of \(2.1269\) for one unit improvement of GPA given GRE score is constant.

3.7 logistic Regression model of admit predicted by gpa and gre score and interaction between gpa and gre score


Call:
glm(formula = admit ~ gre * gpa, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -13.818700   5.874816  -2.352   0.0187 *
gre           0.017464   0.009606   1.818   0.0691 .
gpa           3.367918   1.722685   1.955   0.0506 .
gre:gpa      -0.004327   0.002787  -1.552   0.1205  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 477.85  on 396  degrees of freedom
AIC: 485.85

Number of Fisher Scoring iterations: 4
                 Estimate  Std. Error   z value   Pr(>|z|)
(Intercept) -13.818700374 5.874815854 -2.352193 0.01866309
gre           0.017464312 0.009606206  1.818024 0.06906048
gpa           3.367918022 1.722685231  1.955040 0.05057838
gre:gpa      -0.004326605 0.002786899 -1.552480 0.12054741
 (Intercept)          gre          gpa      gre:gpa 
9.968153e-07 1.017618e+00 2.901805e+01 9.956827e-01 
                    2.5 %       97.5 %
(Intercept) -2.576285e+01 -2.677081148
gre         -9.057462e-04  0.036855984
gpa          8.009849e-02  6.850511379
gre:gpa     -9.932612e-03  0.001022485
[1] 3.335476
[1] 3.335407
  • The interaction therm is an multiplication factor, for example for GPA, increases/decreases (here decreases) the odds of outcome by a factor of 0.9956827 for every unit of GRE.

  • For example if someone has GRE 500, for every 1 unit increase of their GPA then their odds of admission changes by a factor of \(29.01805 \times 0.9956827 ^ {500} = 3.3354\)

Or we can first calculate \(3.367918022 + 500 \times -0.004326605\) and then we can take the exponentiation.

3.8 More on logistic regression

There are more on logistic regression analysis and we do not have time to go over them.

  • For the problem of complete or quasi-complete separation in logistic/probit see the following page:

FAQ What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them?

  • To review of the assumptions of logistic regression you can visit this page:

Logistic Regression Diagnostics

4 Regression of count data

4.1 Poisson Random variable and Poisson distribution

  • The discrete random variable \(Y\) is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter \(\mu > 0\), if \(Y\) has PMF

\[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]

  • The mean of Poisson distribution is \(\mu\) and the variance of it is also \(\mu\).

  • In Poisson, we only have one parameter. The mean and variance of Poisson distribution are equal.

  • In a large number of Bernoulli trial with a very small probability of success the total number of success can be approximated by a Poisson distribution. This is called the law of rare events.

  • It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)

4.2 An Example of Poisson Random variable

Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and Wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0, 1, 2, …, 8 accidents can be calculated using the PMF of Poisson:

\(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)

\(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)

\(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)

\(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)

\(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)

\(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)

\(P(Y = 6| \mu = 2) \approx 0.0120\)

\(P(Y = 7| \mu = 2) \approx 0.0034\)

\(P(Y = 8| \mu = 2) \approx 0.0009\)

The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)

PMF of Poisson distribution

PMF of Poisson(mu = 2)

This looks like binomial but not exactly the same!

Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of each \(x = 0, 1, 2, \cdots, 8\) accidents using R is:

#poisson_with lambda 5.5
 dpois(0:8, lambda = 5.5)
[1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
[7] 0.157117294 0.123449302 0.084871395
  • The ratio of rates of accident between rainy days to sunny days is 5.5 / 2 = 2.75.

We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.

  • We can draw a sample of number 0 to 8 with probabilities that we calculated from Poisson(2).

A sample of 20 number will be look like this:

 [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0
[1] 2
[1] 2.210526

Now lets add some zeroes and some larger values to our sample

[1] 2.213115
[1] 14.07049

Histogram of two samples from Poisson distribution

  • We can see for sample y2, the variance is larger than mean. This is called Overdispersion.

  • The way we draw our sample is close to Poisson with \(\mu = 2\) but not exactly. If we want to sample from a true Poisson we can use rpois() function in R.

y <- rpois(1000, 2)
mean(y)
[1] 1.958
var(y)
[1] 2.038274
hist(y, breaks = 15)

Histogram of a samples from Poisson distribution

Poisson distribution for unit time and time t.

The basic Poisson distribution assumes that each count in the distribution occurs over a unit of time. However, we often count events over a period of time or over various areas. For example if the number of accidents occur on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:

\[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time. We will talk about this when we talk about exposure in a Poisson regression.

4.3 Count data

Count data consists of observations that take non-negative integer values (0, 1, 2, …).

  • The range of counts extends from zero to some unknown upper limit.

  • Theoretically, counts can range from zero to infinity, but in practice, they are always bounded by a reasonable maximum.

Types of Count Data

  • The number of occurrences of an event within a given time period.

  • The number of occurrences of an event within a specific geographic or spatial area.

  • The count of individuals affected by a particular condition, often adjusted for population size.

Examples of count data:

  • The number of accidents on a highway in a day.

  • The number of patients who die within 48 hours of a myocardial infarction in a hospital.

  • The count of people diagnosed with a specific disease, adjusted for population size.

4.4 Poisson Regression Model

  • Regression model for count data refers to a regression models such that the response variable is a non-negative integer.

Can we model the count response as a continuous random variable and by using ordinary least square estimate the parameters?

There are two problem with this approach:

  • In linear OLS model we may get the mean response as a negative value. This is not correct for the count data.

  • In linear OLS model we assume the variance of response for any given predictor is constant. This may not be correct for count data and in fact almost all the time it is not.

  • Poisson regression model is the standard model for count data.

  • In Poisson regression model we assume that the response variable for a given set of predictor variables is distributed as a Poisson distribution with the parameter \(\mu\) depends of values of predictors.

  • Since parameter \(\mu\) should be greater than zero for \(p\) number of predictors, we use the log function as link function.

  • Nearly all of the count models have the basic structure of the linear model. The difference is that the left-hand side of the model equation is in the log form.

Formalization of count model

A count model with one predictor is write as:

\[\text{log of mean response} = \ln(\mu) = \beta_0 + \beta x\]

From this model the expected count or mean response will be:

\[\text{mean response} = \mu =E(y|x) = \exp( \beta_0 + \beta_1 x)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).

The Poisson model

The Poisson model is the model that the count \(y|x\) has a poisson distribution with the above mean. For an outcome \(y\) and single predictor \(x\) we have:

\[ f(y | x; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x)}(\exp( \beta_0 + \beta x))^{y}}{y!}\]

Plot of distribution of the response for Poisson regression model will be:

  • In contrast the plot below shows the distribution of the responses when the response has a normal distribution for a given predictor.

4.5 Poisson Regression Model(continue)

Back to our toy example of number of accidents in 405. Let \(x\) is a binary variable of 0 for sunny day and 1 for rainy day.

We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.

In fact we can artificially generate this sample using R:

y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
x <- rep(c(0,1), each = 20)
print(data.frame(y = y,x = x))
    y x
1   2 0
2   3 0
3   0 0
4   4 0
5   2 0
6   5 0
7   1 0
8   3 0
9   4 0
10  1 0
11  3 0
12  2 0
13  2 0
14  0 0
15  2 0
16  6 0
17  2 0
18  1 0
19  2 0
20  1 0
21  9 1
22 10 1
23  1 1
24  4 1
25  6 1
26  5 1
27  5 1
28  7 1
29  8 1
30  6 1
31  6 1
32  7 1
33  6 1
34  5 1
35  5 1
36  6 1
37  6 1
38  4 1
39  8 1
40  5 1
accidents_model <- glm(y ~ x, family = poisson(link = "log"))
summary(accidents_model)

Call:
glm(formula = y ~ x, family = poisson(link = "log"))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.8329     0.1474   5.649 1.61e-08 ***
x             0.9505     0.1736   5.475 4.38e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 69.616  on 39  degrees of freedom
Residual deviance: 36.173  on 38  degrees of freedom
AIC: 160.7

Number of Fisher Scoring iterations: 5

The coefficient now are in the log metric.

  • Let’s exponentiate the coefficients:
(Intercept)           x 
   2.300000    2.586957 
  • We can find the rate of accidents for each group (sunny or rainy) directly using the mean of rainy days to sunny days.
# Mean of rainy days
mu0 <- mean(y[x==0])
# Mean of sunny days
mu1 <- mean(y[x==1])
# Rate ratio
c(mu0, mu1, mu1/mu0)
[1] 2.300000 5.950000 2.586957

4.6 Poisson model log-likelihood function (optional)

This part is just for your information and is optional.

If we have more than one predictors for ith observation we will have:

\[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]

For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have

\[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:

\[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]

4.7 Analysis of a count data model

  • For this part We will use German national health registry data set as an example of Poisson regression. The data set is also exist in package COUNT in R. We are going to use only year 1984.

  • Source: German Health Reform Registry, years pre-reform 1984-1988, From Hilbe and Greene (2008)

  • The data can be access by installing package COUNT.

  • The response variable is the number of visit.

  • We are using outwork and age as predictors. outwork is a binary variable with 1 = not working and 0 = working. age is a continuous variable range 25-64.

  • Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.

R code for loading the data and descriptive statistics

#you can load the data from package COUNT
library(COUNT)
data("rwm1984")
head(rwm1984)
  docvis hospvis edlevel age outwork female married kids hhninc educ self
1      1       0       3  54       0      0       1    0  3.050 15.0    0
2      0       0       1  44       1      1       1    0  3.050  9.0    0
3      0       0       1  58       1      1       0    0  1.434 11.0    0
4      7       2       1  64       0      0       0    0  1.500 10.5    0
5      6       0       3  30       1      0       0    0  2.400 13.0    0
6      9       0       3  26       1      0       0    0  1.050 13.0    0
  edlevel1 edlevel2 edlevel3 edlevel4
1        0        0        1        0
2        1        0        0        0
3        1        0        0        0
4        1        0        0        0
5        0        0        1        0
6        0        0        1        0
summary(rwm1984)
     docvis           hospvis           edlevel          age    
 Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
 1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
 Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
 Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
 3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
 Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
    outwork           female          married            kids       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
 Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     hhninc            educ            self            edlevel1     
 Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
 1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
 Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
 Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
 3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
    edlevel2         edlevel3         edlevel4      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
#centering the age predictor
rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

4.8 Running a count data model

  • In R we still use glm to run Poisson model. For family option we use poisson(link = "log").

And if we use centered age:

#running glm
pois_m <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
#extract the outcome by summary
summary(pois_m)

Call:
glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
    data = rwm1984)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
outwork     0.4079314  0.0188447   21.65   <2e-16 ***
cage        0.0220842  0.0008377   26.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 25791  on 3873  degrees of freedom
Residual deviance: 24190  on 3871  degrees of freedom
AIC: 31279

Number of Fisher Scoring iterations: 6

4.9 Interpreting Coefficients and rate ratio

  • Coefficients are in log metric and we can directly interpret them in term of log response. In general we say: Log mean of the response variable increases by \(\beta\) for a one unit increase of the corresponding predictor variable, other predictors are held constant.

  • In our results, the expected log number of visits for patients who are out of work is 0.4 more than log number of visits for patients who are working with the same age.

  • For each one year increase in age, there is an increase in the expected log count of visit to the doctor of 0.022, holding outwork the same.

In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.

  • Suppose the predicted mean for number of visit when outwork is 0 is \(\mu_0\) and the predicted mean for number of visit when outwork is 1 is \(\mu_1\). Then,

\[ log(\mu_1) -log(\mu_0) = \log(\dfrac{\mu_1}{\mu_0}) = \beta_0 + \beta_1 (\text{at work = 1}) + \beta_2 \text{ age} - (\beta_0 + \beta_1 (\text{at work = 0}) + \beta_2 \text{ age}) = \beta_1\]

So, \(\exp(0.4079314) = 1.503704\) means that on avarage we expect that the number of doctor visit for a person at work to be 50% more than a person not at work having the same age.

4.10 Approximation of The standard error of expnentiation of coefficients (optional)

The standard error can approximately calculated by delta method.

Delta Method:

\[se(\exp(\beta)) = exp(\beta) * se(\beta)\]

exp(coef(pois_m))
(Intercept)     outwork        cage 
   2.555113    1.503704    1.022330 
exp(coef(pois_m)) * sqrt(diag(vcov(pois_m))) #delta method
 (Intercept)      outwork         cage 
0.0325958199 0.0283368154 0.0008564317 
exp(confint.default(pois_m))
               2.5 %   97.5 %
(Intercept) 2.492018 2.619805
outwork     1.449178 1.560282
cage        1.020653 1.024010
# sum of fit statistics for poisson model  
pr <- sum(residuals(pois_m, type = "pearson") ^ 2)
pr / pois_m$df.resid
[1] 11.34322
pois_m$aic / (pois_m$df.null + 1)
[1] 8.074027
COUNT::modelfit(pois_m)
$AIC
[1] 31278.78

$AICn
[1] 8.074027

$BIC
[1] 31297.57

$BICqh
[1] 8.07418

4.11 Marginal effects (optional)

  • Marginal effect at the mean: The number of visit made to a doctor increased by 6.6 % for each year of age, when outwork is set at its mean.

  • Average marginal effect: For each additional year of age, there are on average about 0.07 additional doctor visit with outwork at its mean value.

  • Discrete change or partial effect is used to evaluate the change in predicted of the response when a binary predictor changes value from 0 to 1.

#marginal effect at mean
beta <- coef(pois_m)
mout <- mean(rwm1984$outwork)
mage <- mean(rwm1984$cage)
xb <- sum(beta * c(1,mout,mage))
dfdxb <- exp(xb) * beta[3]
dfdxb
      cage 
0.06552846 
#avarage marginal effect
mean(rwm1984$docvis) * beta[3]
      cage 
0.06984968 

4.12 Exposure (optional)

Sometimes all data points are not occur at the same period , area or volumes. In this case we can use an offset with a model to adjust for counts of events over time period, area, and volumes.

\[ \exp(\boldsymbol{x'\beta}) = \mu / t\]

\[ \mu = \exp(\boldsymbol{x'\beta} + \log(t))\] The term \(\log(t)\) is referred as offset.

  • For example we use the data set of Canadian National Cardiovascular Disease registry called FASTRAK.
  • In this data set:

    die : number of death

    cases: number of cases

    anterior: if the patient had a previous anterior myocardial infarction.

    hcabg: if the patient had a history of CABG procedure.

    Killip: a summary indicator of the cardiovascular health of the patient.

    R code for exposure

    data("fasttrakg")
    head(fasttrakg)
      die cases anterior hcabg killip kk1 kk2 kk3 kk4
    1   5    19        0     0      4   0   0   0   1
    2  10    83        0     0      3   0   0   1   0
    3  15   412        0     0      2   0   1   0   0
    4  28  1864        0     0      1   1   0   0   0
    5   1     1        0     1      4   0   0   0   1
    6   0     3        0     1      3   0   0   1   0
    summary(fast <- glm(die ~ anterior + hcabg +factor(killip) , family = poisson, 
                        offset = log(cases), data = fasttrakg))
    
    Call:
    glm(formula = die ~ anterior + hcabg + factor(killip), family = poisson, 
        data = fasttrakg, offset = log(cases))
    
    Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
    (Intercept)      -4.0698     0.1459 -27.893  < 2e-16 ***
    anterior          0.6749     0.1596   4.229 2.34e-05 ***
    hcabg             0.6614     0.3267   2.024   0.0429 *  
    factor(killip)2   0.9020     0.1724   5.234 1.66e-07 ***
    factor(killip)3   1.1133     0.2513   4.430 9.44e-06 ***
    factor(killip)4   2.5126     0.2743   9.160  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for poisson family taken to be 1)
    
        Null deviance: 116.232  on 14  degrees of freedom
    Residual deviance:  10.932  on  9  degrees of freedom
    AIC: 73.992
    
    Number of Fisher Scoring iterations: 5
    exp(coef(fast))
        (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
         0.01708133      1.96376577      1.93746487      2.46463342      3.04434885 
    factor(killip)4 
        12.33745552 
    exp(coef(fast)) * sqrt(diag(vcov(fast))) #delta method
        (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
        0.002492295     0.313359487     0.632970849     0.424784164     0.765119601 
    factor(killip)4 
        3.384215172 
    exp(confint.default(fast))
                        2.5 %      97.5 %
    (Intercept)     0.0128329  0.02273622
    anterior        1.4363585  2.68482829
    hcabg           1.0212823  3.67554592
    factor(killip)2 1.7581105  3.45508317
    factor(killip)3 1.8602297  4.98221265
    factor(killip)4 7.2067174 21.12096274
    modelfit(fast)
    $AIC
    [1] 73.9917
    
    $AICn
    [1] 4.93278
    
    $BIC
    [1] 78.24
    
    $BICqh
    [1] 5.566187

    4.13 Overdispersion (optional)

    • The overdispersion is the key problem to consider in count model.

    • As we noted in Poisson random variable overdispersion occurs when the variance is greater than mean.

    • In count model, overdispersion means the variance of response condition on predictor is more than the conditional mean of response.

    • There are variety of tests to determine whether the model fits the data.

    • The first and natural test used in Poisson regression is called the deviance goodness-of-fit test.

    • The deviance is defined as the difference between a saturated log-likelihood and full model log-likelihood.

    dev <- deviance(pois_m)
    df <- df.residual(pois_m)
    p_value <- 1 - pchisq(dev, df)
    print(matrix(c("Deviance GOF ", " ", 
                   "D =", round(dev,4) , 
                   "df = ", df, 
                   "P-value =  ", p_value), byrow = T,ncol = 2))
         [,1]            [,2]        
    [1,] "Deviance GOF " " "         
    [2,] "D ="           "24190.3581"
    [3,] "df = "         "3871"      
    [4,] "P-value =  "   "0"         
    • With a p < 0.05, the deviance GOF test indicates that we can reject the hypothesis that the model is not well fitted.

    • The other test statistics is pearson goodness-of-fit test. Defined as sum of square Pearson residuals.

    • The overdispersion can be model as different ways.

    • Quasi-likelihood models.

    • Sandwich or robust variance estimators.

    • Bootstrap standard errors.

    • Negative binomial 1 (NB1).

    • Negative binomial 1 (NB2).

    In Poisson Mean = \(\mu\)
    Variance = \(\mu\)

    In Negative Binomial1 (NB1)

    Mean = \(\mu\)
    Variance = \(\mu + \alpha \mu\) or

    Variance = \(\mu (1 + \alpha) = \mu \omega\)

    In Negative Binomial2 (NB2)

    Mean = \(\mu\)
    Variance = \(\mu + \alpha \mu ^2\)

    4.14 Negative binomial Examples

    # Poisson Default standard errors (variance equals mean) based on MLE
    poiss_mle <- glm(docvis ~ outwork + age, data = rwm1984, family=poisson()) 
    print(summary(poiss_mle),digits=3)
    
    Call:
    glm(formula = docvis ~ outwork + age, family = poisson(), data = rwm1984)
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept) -0.033517   0.039181   -0.86     0.39    
    outwork      0.407931   0.018845   21.65   <2e-16 ***
    age          0.022084   0.000838   26.36   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for poisson family taken to be 1)
    
        Null deviance: 25791  on 3873  degrees of freedom
    Residual deviance: 24190  on 3871  degrees of freedom
    AIC: 31279
    
    Number of Fisher Scoring iterations: 6
    # Robust sandwich based on poiss_mle
    library(sandwich)
    #this will gives robust covariance matrix of estimated model
    cov.robust <- vcovHC(poiss_mle, type="HC0")
    #diagonal is the variances
    se.robust <- sqrt(diag(cov.robust))
    coeffs <- coef(poiss_mle)
    z_.025 <- qnorm(.975)
    t.robust <- coeffs / se.robust
    poiss_mle_robust <- cbind(coeffs, se.robust, t.robust, pvalue=round(2*(1-pnorm(abs(coeffs/se.robust))),5), 
                              lower=coeffs - z_.025 * se.robust, upper=coeffs+ z_.025 * se.robust)    
    print(poiss_mle_robust,digits=3)
                 coeffs se.robust t.robust pvalue   lower  upper
    (Intercept) -0.0335   0.12333   -0.272  0.786 -0.2752 0.2082
    outwork      0.4079   0.06907    5.906  0.000  0.2726 0.5433
    age          0.0221   0.00283    7.808  0.000  0.0165 0.0276
    # Poisson Bootstrap standard errors (variance equals mean)
    # install.packages("boot")
    library(boot)
    boot.poiss <- function(data, indices) {
      data <- data[indices, ] 
      model <- glm(docvis ~ outwork + age, family=poisson(), data=data)
      coefficients(model)
    }
    set.seed(10101)
    # To speed up we use only 100 bootstraps. 
    summary.poissboot <- boot(rwm1984, boot.poiss, 400)
    print(summary.poissboot,digits=3)
    
    ORDINARY NONPARAMETRIC BOOTSTRAP
    
    
    Call:
    boot(data = rwm1984, statistic = boot.poiss, R = 400)
    
    
    Bootstrap Statistics :
        original    bias    std. error
    t1*  -0.0335  0.007078     0.12531
    t2*   0.4079  0.000811     0.07318
    t3*   0.0221 -0.000168     0.00294
    # Poisson with NB1 standard errors (assumes variance is a multiple of the mean) w = theta*mu or w = mu*(1 + alpha)
    #also called Poisson Quasi-MLE
    poiss_qmle <- glm(docvis ~ outwork + age, data = rwm1984, family=quasipoisson()) 
    print(summary(poiss_qmle),digits=4)
    
    Call:
    glm(formula = docvis ~ outwork + age, family = quasipoisson(), 
        data = rwm1984)
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -0.033517   0.131962  -0.254      0.8    
    outwork      0.407931   0.063468   6.427 1.46e-10 ***
    age          0.022084   0.002821   7.827 6.39e-15 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for quasipoisson family taken to be 11.34324)
    
        Null deviance: 25791  on 3873  degrees of freedom
    Residual deviance: 24190  on 3871  degrees of freedom
    AIC: NA
    
    Number of Fisher Scoring iterations: 6
    # MASS does just NB2
    # install.packages("MASS")
    library(MASS)
    # Note: MASS  reports as the overdispersion parameter 1/alpha not alpha
    # In this example R reports 0.9285 and 1/0.9285 = 1.077  
    # NB2 with default standard errors 
    NB2.MASS <- glm.nb(docvis ~ outwork + age, data = rwm1984) 
    print(summary(NB2.MASS),digits=3)
    
    Call:
    glm.nb(formula = docvis ~ outwork + age, data = rwm1984, init.theta = 0.4353451729, 
        link = log)
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.0396     0.1067   -0.37     0.71    
    outwork       0.4147     0.0554    7.48  7.5e-14 ***
    age           0.0221     0.0024    9.24  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for Negative Binomial(0.4353) family taken to be 1)
    
        Null deviance: 4100.8  on 3873  degrees of freedom
    Residual deviance: 3909.6  on 3871  degrees of freedom
    AIC: 16674
    
    Number of Fisher Scoring iterations: 1
    
                  Theta:  0.4353 
              Std. Err.:  0.0135 
    
     2 x log-likelihood:  -16665.5250 

    4.15 weighted least square and generalized linear models (optional)

    The R code below demonstrates the use of Weighted Least Squares (WLS) and explores its relationship with Generalized Linear Models (GLMs).

    Weighted least square

    # Set the random seed for reproducibility
    set.seed(101)
    
    # Create predictor variable x
    x <- seq(1, 3.5, 0.1)
    x <- sort(rep(x, 10))  # Replicate values for simulation
    
    # Define expected values of y
    Ey <- 220 - 60 * x  
    
    # Generate response variable y from a Poisson distribution
    y <- rpois(length(Ey), Ey)
    
    # Fit an Ordinary Least Squares (OLS) model
    m <- lm(y ~ x)
    
    # Plot the data and OLS regression line
    plot(x, y, xlab = "Price", ylab = "Number Sold")
    abline(m)

    # Residual plot for OLS model
    plot(predict(m), resid(m), xlab = "Predicted Values", ylab = "Residuals")
    abline(h = 0)

    # Estimate weights for WLS (inverse of predicted values)
    W <- 1 / predict(m)
    
    # Fit a Weighted Least Squares (WLS) model
    Wm <- lm(y ~ x, weights = W)
    
    # Residual plot for WLS model
    plot(predict(Wm), sqrt(W) * resid(Wm), xlab = "Predicted Values", ylab = "Weighted Residuals")
    abline(h = 0)

    # Fit a Poisson regression model (GLM)
    glp.m <- glm(y ~ x, family = poisson(link = "identity"))
    
    # Compare the models
    summary(m)
    
    Call:
    lm(formula = y ~ x)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -29.2760  -5.1955  -0.3457   4.6567  24.8014 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 222.4431     1.7288  128.67   <2e-16 ***
    x           -60.8978     0.7289  -83.55   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 8.815 on 258 degrees of freedom
    Multiple R-squared:  0.9644,    Adjusted R-squared:  0.9642 
    F-statistic:  6980 on 1 and 258 DF,  p-value: < 2.2e-16
    summary(Wm)
    
    Call:
    lm(formula = y ~ x, weights = W)
    
    Weighted Residuals:
         Min       1Q   Median       3Q      Max 
    -2.83173 -0.66938 -0.02412  0.58847  2.97862 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  223.055      1.751   127.4   <2e-16 ***
    x            -61.170      0.597  -102.5   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.9548 on 258 degrees of freedom
    Multiple R-squared:  0.976, Adjusted R-squared:  0.9759 
    F-statistic: 1.05e+04 on 1 and 258 DF,  p-value: < 2.2e-16
    summary(glp.m)
    
    Call:
    glm(formula = y ~ x, family = poisson(link = "identity"))
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept) 223.0681     1.8305  121.86   <2e-16 ***
    x           -61.1756     0.6227  -98.24   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for poisson family taken to be 1)
    
        Null deviance: 7382.88  on 259  degrees of freedom
    Residual deviance:  236.76  on 258  degrees of freedom
    AIC: 1817.1
    
    Number of Fisher Scoring iterations: 4

    The following R code estimates logistic regression coefficients using an iterative re-weighted least squares (IRLS) approach and compares it to a standard GLM.

    # Response variable (admission outcome)
    y <- mydata$admit  
    # Predictor variable (GRE scores)
    x <- mydata$gre   
    
    # Initialize values
    mu <- rep(mean(y), nrow(mydata))   # Initial fitted values
    eta <- log(mu / (1 - mu))          # Initial linear predictor
    
    # Iterate re-weighted least squares for 10 iterations
    for (i in 1:10) {                   
      w <- mu * (1 - mu)                # Compute weights (variance)
      z <- eta + (y - mu) / w           # Adjusted response
      mod <- lm(z ~ x, weights = w)     # Weighted regression
      eta <- mod$fitted.values          # Update linear predictor
      mu <- 1 / (1 + exp(-eta))         # Update fitted values
      cat(i, coef(mod), "\n")           # Print iteration results
    }
    1 -2.783528 0.003434139 
    2 -2.899506 0.003579592 
    3 -2.901344 0.003582211 
    4 -2.901344 0.003582212 
    5 -2.901344 0.003582212 
    6 -2.901344 0.003582212 
    7 -2.901344 0.003582212 
    8 -2.901344 0.003582212 
    9 -2.901344 0.003582212 
    10 -2.901344 0.003582212 
    # Model summary and confidence intervals
    coef(summary(mod))
                    Estimate   Std. Error   t value     Pr(>|t|)
    (Intercept) -2.901344270 0.6068541966 -4.780958 2.459924e-06
    x            0.003582212 0.0009873336  3.628167 3.226024e-04
    confint(mod)
                       2.5 %       97.5 %
    (Intercept) -4.094384619 -1.708303920
    x            0.001641171  0.005523253
    # Fit standard logistic regression using GLM
    glm.mod <- glm(admit ~ gre, family = binomial, data = mydata)
    
    # Model summary and confidence intervals
    coef(summary(glm.mod))
                    Estimate   Std. Error   z value     Pr(>|z|)
    (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
    gre          0.003582212 0.0009860051  3.633056 2.800845e-04
    confint(glm.mod)
                       2.5 %       97.5 %
    (Intercept) -4.119988259 -1.739756286
    gre          0.001679963  0.005552748