### Introduction

The purpose of this workshop is to highlight statistical methods that are (relatively) easy to do with unweighted data and are (relatively) more difficult to do with weighted data. There will be (at most) minimal introduction to what the various statistical techniques are; it is assumed that the audience is familiar with them, but it is OK if you are not familiar with all of them.

#### Software

The software packages discussed in this workshop will include Stata, SAS, SUDAAN and Mplus. This is not to suggest that other software packages that I will not be discussing today could not be used to run some of the analyses. Specifically, I will not be covering the use of SPSS, R, MLwiN, IVEware, AM, HLM or LISREL, although all have capacity to analyze complex survey data and may be able to perform some of the analyses shown in this workshop. There is no expectation that audience members are familiar with all of the software packages that will be mentioned in this workshop. Rather, the point is to give audience members a sense of what software packages will do what tasks.

No output from any of the procedures is shown because the point of this workshop is to discuss issues that need to be considered, rather than interpreting the results of the analysis.

#### Topics that will not be covered

- Bayesian
- Running multilevel models after using a data augmentation technique
- Imputing MNAR data
- Geospatial survey data
- IRT
- Multilevel SEM
- Combining datasets with different sampling plans (or combining weighted data with administrative data)

#### Example datasets

NHANES continuous 2015-2016 demographic and bmx (used for most examples), General Social Survey 2018 (used for latent variable examples) and the PISA2000 data (for the weighted multilevel models).

Notice that after the merging of the demographic and bmx NHANES data, some cases have a sampling weight of 0. Some programs simply “ignore” these cases while others generate an error message and won’t run until such cases are removed. In general, removing cases from a weighted dataset is a no-no, but when the sampling weight for the case is 0, it can’t be used in any analysis anyway. If such cases are left in the dataset and the procedure will run without complaining about them, they can contribute to degrees of freedom. This is often not desirable because the cases aren’t used in the analyses, so deleting them is a good option.

There should not be missing values in the PSU or strata variables.

### Weighty issues about sampling weights

#### Normalized sampling weights

Normalized sampling weights have a mean of approximately 1. You can tell when you have them by simply getting the mean of the sampling weight variable (as if it was a regular variable in an unweighted dataset). There are many reasons why a sampling weight might be normalized, such as historical reasons, preventing researchers from making estimation errors (LA FANS), and allowing comparisons across different populations (DHS, GSS). The sum of the normalized weight is the sample size, not the estimated population size. Population totals cannot be estimated with normalized sampling weights.

To calculate a normalized sampling weight, you divide the final sampling weight by the mean of mean of the final sampling weight for the entire sample:

Final sampling weight/ (mean of final sampling weight for entire sample)

Despite being calculated differently than population sampling weights, normalized sampling weights are used in exactly the same way as population sampling weights.

The scaling of the weights in a single-level model doesn’t matter unless you are trying to estimate totals. You can see this for yourself: open your favorite complex dataset, create a new weight variable that is the original final weight variable multiplied by 100, and then take the mean of a variable using the original final sampling weight and the new weight. You will see that the mean of the variable is the same (but the estimate of the population size is not!).

Because normalized weights have little impact on the estimation of point estimates, and the analyst may choose to not use it. However, whether or not the normalized weights are used, robust (or cluster robust) standard errors should be used.

Reviewers may not make the distinction between normalized and non-normalized sampling weights, and you may need to use normalized weights to get your article published.

(Applied Survey Data Analysis, Second Edition (2017) by S. Heeringa, B. T. West and P. A. Berglund, pages 100 and 102, Survey Weights: A Step-by-step Guide to Calculation by Richard Valliant and Jill A. Dever, 2018, pages 133-134)

#### Difference between sampling weight and analytic weight

By now we are familiar with a sampling weight, but there are other types of weights. An analytic weight tries to quantify precision. It is often used with the observations are means. I have seen several different methods for calculating such weights, but in general, the weights are the inverse of the variance of the observation. This addresses the issue of precision because means that are calculated from larger Ns should have smaller variances. When you weight by the inverse of the variances, you are up-weighting observations with smaller variances (those with greater precision) and down-weighted observations with larger variances (those with less precision).

The sampling weight and analytic weight will produce the same point estimates but different standard errors. The reason that this is important is because there are some commands in Stata that you may want to use with weighted data, but the command accepts aweights but not pweights.

For more information, please see Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Second Edition by Tom A. B. Snijders and Roel J. Bosker, 2012, pages 219-221.

### Design-based versus model-based

#### The design-based approach

This approach is based on Neyman’s (1934) paper, and it assumes that the primary source of randomness is the probability that comes from the selection of units from the population of interest. In other words, the variable of interest, call it y, is treated as fixed in a finite population, and statistical inference is based on the distribution of estimates repeatedly sampled from the population. This is the approach with which most analysts are familiar. The point estimates, such as means, regression coefficients, etc., are weighted by the sampling weight. The standard error around the point estimate is adjusted with respect to the clustering and/or stratification that was used in the collection of the sample data. In other words, instead of using the formulas that are used when the data have been collected via a simple random sample (SRS), different formulas are used to account for the way the data were sampled. It should be noted that all descriptive statistics must be calculated using design-based methods.

#### The model-based approach

This approach works when the sampling weight is not associated with the outcome variable. When this is true, the sampling weight can be ignored. The clustering and/or stratification variables (or the variables upon which they are based) are used in the model (e.g., a regression model). To be clear, the models are run using the formulas that are used when the data are collected via a simple random sample. However, “The inclusion mechanism is generally difficult to model, and vulnerable to model misspecification.” Little, Statistical Journal of the IAOS, 2015, page 556. In other words, before deciding to use a model-based approach, you must be sure that the variables that are needed are included in the dataset.

The model-based approach assumes a superpopulation and that the population from which the sample was drawn is just one possible realization of the superpopulation.

The article by Tianji Cai, 2013, Investigation of ways to handle sampling Weights for Multilevel Model Analyses (Sociological Methodology) has a very good paragraph on the model-based approach. A good discussion about design-based versus model-based estimation is on pages 221-231 of Multilevel analysis: An Introduction to Basic and Advanced Multilevel Modeling, Second Edition, 2012.

#### The model-assisted approach

The model-assisted approach combines some aspects of both the design-based approach and the model-based approach. It is often used when neither of the traditional approaches is sufficient for the task at hand. For example, when using multilevel modeling with weighted data, the weights at each level of the model are incorporated into the calculation of the point estimates, but the clustering is accounted for by the multilevel model.

“The “status quo” in current official statistics, which might be termed the “design/model compromise” (DMC[39]), favors design-based inference for descriptive statistics like means and totals based on large probability samples, and model-based inference for questions that are not well addressed by the design-based approach, such as small area estimation, survey nonresponse and response errors…. The approach is pragmatic, but entails a degree of “inferential schizophrenia”.” (page 557).

Of course, there are pros and cons to each of the approaches. One of the disadvantages of the design-based approach is that it requires large sample sizes, and that is often a difficult requirement when analyzing subpopulations.

One of the disadvantages of the model-based approach is that there are often dozens of strata and/or clusters, and adding such a large categorical variable

### Weighted correlations

Correlations are simple to run in non-weighted data, so many researchers think that the same should be true when analyzing weighted data, but sadly, it is not so. The difficulty is in the calculation of the standard error (and hence the p-value), not in the calculation of the correlation coefficient itself. Before we discuss what software can be used, let’s think of the different ways a Pearson correlation coefficient can be calculated. There are at least three ways:

1. Use the formula

2. Standardized regression coefficient (from a simple linear regression)

3. Standardized covariance

The current version of SUDAAN can run a weighted correlation. Stata has a few options, including a community-contributed program in Stata (**corr_svy**) and the commands **corr** and **pwcorr**, while will allow the use of an aweight. Posts on the Stata list note that the **sem** command will produce standardized regression coefficients, and such a coefficient is a correlation coefficient in a simple linear regression. Both SAS and Mplus can also produce a standardized linear regression coefficient with weighted data. While the point estimates are the same using these different techniques and software packages, the standard errors are not, and hence the tests of statistical significance are not.

Another issue that arises is the handling of missing data. If you are using SEM in Stata or Mplus, there is the option of using FIML.

Software package/method | Correlation coefficient | Standard error | p-value |

Stata corr with aweight | 0.9423 | ||

Stata pwcorr with aweight | 0.9423 | 0.0000 | |

Stata corr_svy | 0.9423 | 0.0000 | |

Stata svy: sem (bmxwaist <- bmxwt), standardized | .9423236 | 0.000 | |

Stata svy: sem (bmxwt <- bmxwaist), standardized | .9423236 | 0.000 | |

Stata svy: sem (<- bmxwaist bmxwt), standardized | .9423236 | .0029802 | 0.000 |

SAS proc surveyreg bmxwaist = bmxwt | 0.94232 | <.0001 | |

SAS proc surveyreg bmxwt = bmxwaist | 0.94232 | <.0001 | |

SUDAAN proc vargen | 0.942324 | 0.002980 | 0.000000 |

Mplus using “with” | 0.948 | 0.003 | 0.000 |

Mplus using “on” | 0.942 | 0.000 |

### Weighted quantiles

Quantiles are another type of estimate that seems like it should be easy to get but is not. Quantiles are estimated in only a few software packages, including SAS, SUDAAN and WesVar. There is a “hack” in Stata by using aweights with the **summarize, detail** command, but the standard errors will be wrong (although those are not shown in the output). You can also try using the **qreg** command with a pweight, but there is no way to account for the clustering and stratification. The output will contain robust standard errors, but compared to the results from SAS and SUDAAN, those seem to be underestimated. Another point of concern is that the standard errors of the quantiles are often not consistent across software packages (although they are similar). In general, the standard errors tend to be quite large. Some research suggests that model-based standard errors are more realistic than design-based standard errors (see West, Sakshaug and Aurelien (2018) page 733).

### Weighted linear multilevel models

The ability to run weighted multilevel models is relatively new. However, most statistical software (including SAS, Stata, Mplus, R, MLwiN, LISREL and HLM) can now handle weighted multilevel models (both linear and non-linear). However, the distributors of the data are still catching up, meaning that all of the necessary sampling weight variables may not be in the dataset. Also, distributors of some datasets worry that making sampling weights at different levels publicly available may compromise the confidentiality of the respondents. What does this mean?

The “trick” is that there MUST be a sampling weight at each level of the model. This is because the level one sampling weight enters into the equation for the pseudo-likelihood at a different place than the level two weight. This means that you cannot use an aggregated sampling weight in a weighted multilevel model because there is no way to disaggregate the sampling weight into the levels of the model.

The level two sampling weight must follow the same “rules” as any level two variable: namely, it must be a constant within each value of the level two identifier.

In general, the PSU is the level two identifier. Some researchers find this to be very restrictive, because the dataset usually doesn’t contain interesting predictor variables that were collected at the level of the PSU. Some researchers try to create a level two predictor by aggregating a level one variable, but that may create its own set of problems. Also, the PSU is often not an interesting unit of analysis.

When running a linear multilevel model, the scale of the level one sampling weight needs to be considered. Good sources include the entry in the Stata documentation (ME – mixed – Remarks and Examples – Survey Data) and a 2006 article by Tihomir Asparouhov titled General Multilevel Modeling with Sampling Weights. An article by Carle (Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations, BMC Medical Research Methodology, 9, 49, 2009) has SAS and Stata code for rescaling the level 1 weights in two ways (two of the ways described in the Stata documentation and implemented in Mplus).

When a single final sampling weight is included in a dataset, it was calculated by multiplying together the inclusion probabilities from each stage of sampling. For a two-level multilevel model, what is needed are two sampling weights. The sampling weight for level 2 is the probability of selecting that cluster or PSU, and the sampling weight for level 1 is the probability of selecting that unit (or individual) given that the cluster has been sampled. Some datasets are released with these conditional level 1 weights; others are released with unconditional sampling weight. For some rescaling options, you can use either the conditional or the conditional level 1 sampling weights; for other rescaling options, only a conditional level 1 sampling weight can be used. Unconditional level 1 sampling weights can be made conditional by dividing by the level 2 sampling weight.

Both Stata’s **mixed** command and Mplus have options for scaling the level 1 weights. Stata offers three options: **size**, **effective** and **gk**. Mplus also offers three options: **unscaled**, **cluster** and **ecluster**. The **size** option in Stata corresponds to the **cluster** option in Mplus; the **effective** option in Stata corresponds to the **ecluster** option in Mplus. These options can be used with either conditional or unconditional level 1 sampling weights; Stata’s **gk** option can only be used with conditional level 1 sampling weights. The **size**/**cluster** options scale the level 1 sampling weight to sum to the cluster size. The **effective**/**ecluster** options include a dependence on the sum of the squared weights so that the level 1 sampling weights sum to the “effective” sample size.

Determining which scaling to use is difficult, and there is no “rule of thumb” to follow. The current suggestion is to try more than one rescaling method and compare the results (AKA do a sensitivity analysis). Much less has been written about the need to scale the level one sampling weight in a weighted non-linear multilevel model. In a three-level model, both the level 1 and the level 2 weights need to be rescaled. Some suggest that even the weights at the highest level need to be scaled.

After all of this, it may be tempting to simply ignore the sampling weights all together. If the sampling weights are normalized, this may be a plausible alternative. After all, rescaling a normalized sampling weight has little effect. However, if the sampling weight is not normalized, then they need to be incorporated into the model. “If the sampling weights are ignored at either level the parameter estimates can be substantially biased.” (Asparouhov, page 439).

Before we move on, please note that SAS’s **proc glimmix** will also run weighted multilevel models. However, there are no options for scaling the level 1 sampling weight, so that will need to be done in a data step before running the multilevel model. Also, there is no option for including the stratification into the model, unless you add that variable as a categorical predictor (by putting it on the **class** statement), and there is no option for specifying a subpopulation. Although **proc plm** was not intended for use after the survey procedures, it should work after running a weighted model in **proc glimmix**. As discussed in the Applied Survey Data Analysis page, Stata’s **margins** command will work after commands run with the **svy** prefix (remember to use the option **vce(unconditional)** and respecify the subpopulation specification if there was one in the model).

Let’s cover a few other limitations of weighted multilevel models. First, because the weighted multilevel model is estimated with a pseudo-likelihood, models cannot be compared the way that unweighted multilevel models can be (i.e., using a likelihood ratio chi-squared test). Also, assessing the model fit of an unweighted multilevel model is difficult, as there are no widely agreed upon methods for this. At this time, there are no methods available for assessing the fit of a weighted multilevel model. There are also no agreed-upon methods for assessing the assumptions of a weighted multilevel model.

#### Informativeness of sampling weight

According to Asparouhov (2006, page 450), “The quality of the MPML [multilevel pseudo maximum likelihood] estimates is driven primarily by two factors, the sample size of the clusters and the degree of informativeness of the selection mechanism. While the sample size is an explicit variable, the informativeness is not directly measurable.”

Informativeness = |weighted parameter est – unweighted parameter est| / square root of unweighted variance of Y

Asparouhov (2006), among many others, suggests that the informativeness of the sampling weight should be checked for all outcome variables. The idea is that if the sampling weight is not informative, it should not be used. However, there is no indication of what value would constitute an informative sampling weight.

### Handling of missing data

In some ways, the handling of missing data in a weighted dataset is just like handling missing data in a non-weighted dataset. Let’s start with the similarities.

First, when taking about data augmentation, or filling in missing values, we are talking about imputing item non-response, meaning that the respondent agreed to complete the survey but then did not answer all of the questions. We are not considering the situation in which the selected respondent declined to participate in the survey; that correction is handled in the calculation of the final sampling weight.

Of course, the best kind of missing data is no missing data. If there are missing data, knowing the mechanism that generates the missing data is critical because the methods available for handling the missing values depends on the mechanism that generated the missing values. There are three mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). You can read about them here: https://stats.idre.ucla.edu/sas/seminars/multiple-imputation-in-sas/mi_new_1/ .

As with non-weighted data, a weighted dataset can have all three types of missing data within the dataset. In other words, just because one variable is missing completely at random does not mean that all of the variables in the dataset are missing completely at random. Also, you need to know, usually from reading the documentation for the dataset, what type of mechanism generates the missing data. If that is not clear from the documentation, you may need to contact the distributors of the data.

Now let is discuss what is different about handling missing data in a weighted dataset.

With non-weighted data, the (currently) most commonly method used to impute missing data is multiple imputation. This can be done in Stata with weighted data in two ways. However, when imputing weighted data, the (currently) most popular method is hotdeck. Now remember that we are talking about ways that imputing missing data in a weighted dataset is different from imputing data in a non-weighted dataset, so don’t panic! Hotdeck imputation in weighted data is very different from the single-imputation hotdeck procedure that fell out of favor years ago.

So what is wrong with simply replacing missing values with the mean (or median) of the non-missing values? There are at least two major problems. One is that the variability in the variable being imputed is reduced (because you have filled in all of the missing values with a constant, regardless of how that constant was obtains), and possibly biased estimates if the mean of the non-missing data is different from the mean of the variable in the population. The second problem is the inability to tell the imputed values from the values provided by respondents (AKA not accounting for the imputation variance). The problem of reduced variability in the variable can be addressed by imputing different number for the different cases that have missing values. This is what hotdeck does. However, the problem of not accounting for the imputation variance remains.

One other important difference: the type of imputation method that you can use depends on the software you are using. Only Stata will do multiple imputation with weighted data, and only SAS will do hotdeck that accounts for both the sampling plan variance and the imputation variance. (SUDAAN will do a single-imputation hotdeck, but it does not account for the imputation variance, so this will not be discussed further.)

#### Multiple imputation

Let’s start with a weighted multiple imputation in Stata. This involves every step and consideration that you would have with non-weighted data, including finding a good imputation model. This may be more difficult with complex survey data than with non-weighted data for at least two reasons. The first is that many of the informative variables that you would want to use in the imputation equation are not released with the publicly-available data. Second, many of the variables that are included in the publicly-available data are modified in some way, such as being top-coded or bottom-coded, categorizing a continuous variable, or reducing the number of categories in a categorical variable. These changes to the variables are necessary to preserve the confidentiality of the respondents, but it means that there is less information in these variables because information is necessarily lost when these coding changes are made.

There are two ways that complex survey data can be imputed in Stata, and these correspond to the two different philosophies about that analysis of weighted data. If you want to follow the designed-based philosophy, then you can use the **svy** prefix (after issuing the **svyset** command, of course) when specifying your imputation model. If you want to follow the model-based philosophy, you can omit the **svy** prefix and use the survey elements as predictors in your imputation model, along with the variables from your analysis model and your auxiliary variables. Although there are many statisticians who recommend this approach, two points of caution are warranted. First, putting in the strata and PSU variables as categorical predictors in your imputation model can be problematic if there are lots of levels of one or both of these variables. For example, if you have 80 strata and 160 PSUs, you may find that your model does not converge, especially if you are trying to impute binary variables. If, on the other hand, you decide to include the variables on which the strata and PSU variables were based, you may find that not all of those variables were included in the publicly-available dataset.

#### Hotdeck imputation

Although many researchers are very familiar with multiple imputation from experiences with non-weighted data, among those working with weighted data, the preference is for using hotdeck rather than multiple imputation. Why? Most of the research in this area is being done by the people who need and use these procedures the most: the good folks at the US Census and their colleagues working on other nationally-representative datasets. Multiple imputation has two major downsides: one is that with each additional imputed dataset released to the public, the greater the chance of a disclosure of a respondent’s identity. Secondly, current research suggests that many imputed datasets are necessary to get good variance estimates, and releasing 50+ imputed datasets for each of the data products that these departments and bureaus produce is just unwieldy. Hotdeck offers a single solution to both of these problems. Some forms of hotdeck imputation allows the imputation to be done with a single dataset by adding new variables to that dataset, and because these new variables are replicate weights, they do not increase the disclosure risk for any of the respondents. So, let’s be clear: When doing hotdeck imputation with weighted data, the information regarding the imputation variance needs to be considered. One way to do this is to create a series of replicate weight variables that are added to the dataset (or modifying the replicate weights that were released with the dataset). These replicate weight variables are then used in all analyses of the data. If the original dataset contained PSU and strata variables, those are replaced in the imputed dataset with replicate weights that contain the information for both the sampling variance and the imputation variance; if the original dataset contained replicate weights, they are replaced in the imputed dataset with new replicate weights that contain the information for both the sampling variance and the imputation variance.

So let’s discuss some of the specifics regarding the imputation of complex survey data with hotdeck imputation. In hotdeck imputation, a missing value in one variable is replaced with an observed value from the same variable. The case with the missing value is called the recipient, and the case with the observed value is called the donor. The goal is to find donors who are similar to the recipient. Why is this method called “hotdeck”? The term hearkens back to the days of mainframe computers and punch cards. The donors come from the cards that are currently being processed, so they are “hot”. (Cold deck imputation refers to imputation of missing values in the current dataset with values from an external dataset (called “cold” because it was not currently being processed.)

Analysts like hotdeck imputation for several reasons. First, only plausible values can be imputed (because the values replacing the missing values come from data other respondents provided). Second, hotdeck is less sensitive to model mis-specification than other imputation techniques. Third, hotdeck can be extended to multivariate missingness, thus preserving associations between the variables.

There are four basic steps in hotdeck imputation:

- Create a set of one or more possible donors for each recipient;
- Select a donor for each recipient;
- Use the donor’s observed value to fill in the recipient’s missing value;
- Analyze the complete data, taking special care to handle the variance estimation appropriately (i.e., use the replicate weights).

These steps certainly sound simple, but there are dozens of variations on the hotdeck (steps 1 through 3), and as with multiple imputation, there is not always clear guidance on which methods to use.

When trying to find donors, you can use the naïve approach and assume that every other respondent in the dataset is a potential donor, or you can create a donor pool based on auxiliary information. The potential donors in this pool should be similar to the recipient in important ways. As you might expect, there are many ways to create this donor pool. One of the simplest and most common methods is to create imputation (AKA adjustment) cells based on one or more categorical variables. Of course, these categorical variables must be fully observed (have no missing values). If one of the categorical variables you want to use has no missing data and the others do, you can start with the variable that has no missing values, impute the other categorical variables, and then use those to do the rest of the imputation. Of course, there are some drawbacks to this method.

- You must categorize continuous variables;
- You can quickly end up with too many cells, some (or many) of which contain too few or no donors;
- Ad hoc methods of collapsing cells when the aforementioned difficulty arises;
- A sparseness of donors can lead to the over-usage of some donors.

When selecting a donor, there are deterministic and random methods that can be used. The deterministic method means that you impute the missing value from the one closest donor. One or more auxiliary variables are used to determine the closest donor, and this is called the nearest neighbor method. Random selection means that you randomly select a donor from a pool of potential donors, and of course, there are many different ways to make the selection. One option is to assign an equal probability of selection to all donors. Another option is to use a probability proportional to the donor’s weight. A third option is called sequential selection, which uses both the donor’s and the recipient’s weights. The first two options are available in SAS’s **proc surveyimpute**.

Within the topic of donor selection, there are a variety of methods, including weighted hotdeck and fractional imputation, among others. We will not be covering these methods in detail here, but it is important to understand that there is no one method you should always prefer.

Let’s assume that we have selected a donor for our recipient with missing data. Now what do we do? Again, there are options. You could use direct substitution and simply copy the donor’s observed value over the recipient’s missing value. There are also ratio methods that use a covariate in addition to the variable being imputed.

So to recap, we have two options for donor selection: nearest neighbor and random hotdeck, and we have two options for what to impute: direct substitution or ratio. Hence, there are four possible combinations, and all are legitimate.

With respect to variance estimation, there are some issues to be considered, but the bottom line is that the variance introduced by the data augmentation technique needs to be taken into account in some way. With hotdeck imputation, one of the ways to do this is with replicate weights that are created (or modified if they already exist in the dataset). The replicate weights are easy to use in SAS or any other software, except SPSS.

We will not be discussing the details of a popular method of hotdeck imputation called fully efficient fractional imputation (fefi).

We will not be talking about imputing missing data in the multilevel modeling context, as that is far more complicated. However, if your analysis plan includes running a multilevel model, then you need to make sure that the software you plan to use for the multilevel can also handle the type of imputation you have done. For example, if you use SAS’s **proc surveyimpute** and create replicate weights, you will not be able to use those replicate weights with **proc glimmix**.

Finally, in general, the imputed dataset is more appropriate for describing relationships between variables rather than obtaining exact parameter estimates.

### Propensity scores with weighted data

As with other topics in this workshop, we assume that you are familiar with the creation and use of propensity scores when working with non-weighted data. Below are a few of necessary definitions and the general steps needed when creating propensity scores. This is not intended to be a thorough or complete discussion, but just a quick refresher for those who may find it useful.

Before we get to the definitions, let’s consider the reasons propensity scores could be used with complex survey data. One reason is to try to make causal inferences with such data. While this is being done more frequently, a number of statisticians point out that this is a stretch with observational data collected at a single time point. Although this is a valid and reasonable point, there are datasets that may be conducive to a causal type of analysis with the help of propensity scores. Another reason that propensity scores may be calculated with complex survey data is to address non-response bias. We will not be discussing the use of propensity scores in this context.

#### Definitions

Propensity score: The probability that an individual received the treatment (based on observed characteristics)

ATE: The average effect for all individuals (treated and control) (Average Treatment Effect)

ATT: The average effect for individuals who received the treatment (Average Treatment of the Treated)

SATE: Sample ATE

SATT: Sample ATT

PATE: Population ATE

PATT: Population ATT

#### Assumptions: Strong ignorability

- Positivity (sufficient overlap): any person could be assigned to either the control or treatment condition, and the distribution of covariates at baseline is unique to the condition.
- Unconfounded treatment (no unobserved confounders): all covariates that may impact both the treatment status and the outcomes are measured at baseline.

#### Estimation

Logistic regression is the most commonly-used method to create propensity scores (which are the predicted probabilities from the logistic regression model), although non-parametric methods exist, such as random forests or generalized boosting methods. Design-based methods are available only for logistic regression.

#### Steps

- Decide which value you want to estimate (sample or population ATE or ATT)
- Decide which propensity score method you will use (for example, propensity score matching, propensity score subclassification, propensity score weighting or covariate adjustment using the propensity score)
- Decide if/how to use the survey elements in the propensity score model and/or the analysis model.

The question with complex survey data is “Should the sampling elements should be used in this process, and if so, when and how?” This is a relatively new area of research, but there are a few papers that address this topic. The recommendations are not consistent across the papers, which is to be expected given the early stages of research into this area of statistics. There is debate about whether to use a design-based or model-based approach when creating the propensity scores. The only real consensus is that a design-based model should be employed when running the analysis model.

### Weighted latent variable models

Again, we will not be discussing the specifics of any type of latent variable model, just the more specific issue of these types of models run on complex survey data. Stata and Mplus are the two software packages most widely used for these types of analyses, and fortunately, they have done a good job of making it a fairly straight forward process. As a matter of workflow, I suggest that you start with a simple model that does not contain any of the survey elements. Ensure that the model will run without error, but don’t bother looking at the results. Next, add in the sampling weight and run the model. If the model runs without error, then include the PSU and/or strata variable(s). I suggest this process because if there is a problem with convergence, you want to know what caused the problem. A model that won’t converge without the sampling weight is very unlikely to converge with the sampling weight. Sometimes it is the PSU and/or strata variable(s) that cause an estimation problem. In any event, it is important to know what is causing the problem so that reasonable steps can be taken to address the issue.

In general, most articles that discuss weighted latent variable models use a design-based approach.

One of the main points here is that, like the regression models, the syntax is very much like that of models using non-weighted data. If you are using Stata, you can simply add the **svy** prefix to your **sem** or **gsem** code and run it. However, many of the postestimation commands do not work after models run with the **svy** prefix. If you are using Mplus, you can add the weight, cluster and stratification statements to your code to incorporate the survey elements into the analysis. As with Stata, some of the output that you would have seen with non-weighted data is not given with weighted data.

### References

#### General

Heeringa, Steven G., West, Brady T. and Berglund, Patricia A. (2017). Applied Survey Data Analysis, Second Edition. CRC Press: New York.

Lehtonen, Risto and Pahkinen Erkki. (2004). Practical Methods for Design and Analysis of Complex Surveys, Second Edition. Wiley: England.

Sampling weights (especially normalized weights):

Valliant, Richard and Dever, Jill A. (2018). Survey Weights: A Step-by-step Guide to Calculation. Stata Press: College Station, TX.

#### Weights

Snijders, Tom A. B. and Bosker, Roel J. (2012). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage.

Design-based v. model-based:

Little, Roderick J. (2015). Calibrated Bayes, An Inferential Paradigm for Official Statistics in the Era of Big Data. Statistical Journal of the IAOS; 31: 555-563.

Skinner, Chris and Wakefield, Jon. (2017). Introduction to the Design and Analysis of Complex Survey Data. Statistical Science; 32(2): 165-175.

West, Brady T., Sakshaug, Joseph W. and Aurelien, Guy Alain S. (2018). Accounting for Complex Sampling in Survey Estimation: A Review of Current Software Tools. Journal of Official Statistics; 34(3): 721-752.

#### Multilevel models

Asparouhov, Tihomir. (2006). General Multilevel Modeling with Sampling Weights. Communications in Statistics: Theory and Methods; 35(3): 439-460.

Cai, Tianji. (2013). Investigation of Ways to Handle Sampling Weights for Multilevel Model Analyses. Sociological Methodology; 43(1): 178-219.

Carle, Adam C. (2009). Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. BMC Medical Research Methodology; 9(49).

Laukaityte, Inga and Wiberg, Marie. (2018). Importance of Sampling Weights in Multilevel Modeling of International Large-scale Assessment Data. Communications in Statistics: Theory and Methods; 47(20): 4991-5012.

Pfeffermann, Danny, Moura, Fernando Antonio Da Silva, Silva, Pedro Luis Do Nascimento. (2006). Multi-level Modelling Under Informative Sampling. Biometrika; 93(4): 943-959.

Rabe-Hesketh, Sophia and Skrondal, Anders. (2006). Multilevel Modelling of Complex Survey Data. Journal of the Royal Statistical Society; 169(4): 805-827.

#### Imputation methods

Andridge Rebecca R. and Roderick J. Little. (2009). The Use of Sample Weights in Hot Deck Imputation. Journal of Official Statistics; 25(1): 21-36.

Bell, Bethany A., Kromrey, Jeffrey D., and Ferron, John M. (2009). Section on Survey Research Methods, JSM 2009.

#### Propensity scores

Austin, Peter C., Jembere, Nathaniel, and Chiu, Maria. (2018). Propensity Score Matching and Complex Surveys. Statistical Methods in Medical Research; 27(4): 1240-1257.

DuGoff, Schuler, Megan, and Stuart, Elizabeth A. (2014). Generalizing Observational Study Results: Applying Propensity Score Methods to Complex Surveys. Health Services Research; 49(1), 284-303.

Lenis, David, Nguyen, Trang Quynh, Dong, Nianbo, Stuart, Elizabeth A. (2019). It’s All About Balance: Propensity Score Matching in the Context of Complex Survey Data. Biostatistics; 20(1): 147-163.