What to expect from this workshop
Definition of systematic review and metaanalysis
Information on how to collect data for a systematic review and/or metaanalysis
How to organize data for a metaanalysis
How to run a metaanalysis and interpret the results
How to make some useful graphs
A brief discussion of the different types of biases that may compromise the results of a metaanalysis
Resources for further information
What not to expect
This is not a handson workshop; we will not be running the commands today
This is not an introduction to the use of Stata software
How to access and/or use electronic databases
Will not cover any of the more advanced topics, such as multiple imputation of missing data or multilevel metaanalysis
Assumptions
I assume that you are familiar with basic descriptive statistics, such as means and standard deviations; what an odds ratio is; what heterogeneity means.
Remember that this is a brief introduction to this topic, so we will only briefly touch on many important topics that might be worthy of an entire workshop unto themselves.
Introduction
Here’s an experience we can all relate to: you read about a study on your favorite news website, or hear about it on TV or the radio. Some new drug, or treatment, or something, has been shown to do something. Almost always, there is a quote from one of the study’s authors saying that “this research needs to be replicated with more subjects before anyone should act on the results.” And we all nod our heads, because we know that replication is an important part of the foundation of the scientific method. So let’s say that this topic is something that you really care about, and you wait to hear more results from the replication studies. You find that some of the studies replicated the results while others did not (i.e., failure to replicate results). Now what do you do? You may decide that you are going to find all of the studies on this research question and count up how many found significant results and how many did not. So you do this, but then you notice something that seems a little odd: the greater the number of participants in a study, the more likely the study was to find a significant result. Now to a statistician, this is not surprising at all. In statistics (at least as it is practiced today) “statistical significance” means that the pvalue associated with the test statistic is less than .05. And most statisticians realize that this pvalue is closely related to the number of participants, which is usually called N. So, in general, as N goes up, the pvalue goes down, holding everything else (e.g., alpha, effect size) constant.
But this doesn’t really answer your question. Adding more participants to a study doesn’t make a treatment more or less effective. What you want to know is if the treatment matters in the real world. When you ask a statistician that question, what the statistician hears is “I want to know the effect size.” According to Wikipedia, “in statistics, an effect size is a quantitative measure of the magnitude of a phenomenon.” (https://en.wikipedia.org/wiki/Effect_size ). So for example, let’s say that I invented a new drug to lower blood pressure. But if my drug only reduces blood pressure by one point, you say “so what, your effect size is too small to matter in the real world.” And you would be correct, even if I ran a huge study with 1000s of participants to show a statistically significant effect. Later on in this presentation we will cover many different types of effect sizes, but the point here is that you are interested in the size of the effect, not in the statistical significance.
What you did by collecting information from many studies that tried to answer the same research question was a type of metaanalysis. So a metaanalysis is an analysis in which the observations are effect sizes reported in other research, usually published research. Of course, to have “an apples to apples” comparison, you want each of the studies to be addressing a similar, if not the same, research question. You want the outcome measures used to the similar or the same, and the comparison group to be the same. Other things, such as the number of participants, need not be similar across studies.
A systematic review is very similar to a metaanalysis, except the effect sizes (and other values) are not collected from the articles, and hence there is no statistical analysis of these data. Rather, the goal is to give a descriptive summary of the articles. I think of a systematic review as a qualitative version of a metaanalysis.
The dataset that will be used for the examples in this workshop is fictional and contains the means, standard deviations and sample sizes for both an experimental group and a control group. These fictional data are based on real data that were used in a metaanalysis in a published paper. The research involved the use of supportive interviewing techniques; more details regarding this are given a little later. Although the data in the examples shown in this workshop are fictional, I am using the research question that was used in the published work. I decided to do this because it provides a realistic example of the workflow and issues that arise when doing a metaanalysis for publication.
Four related quantities
We need to pause briefly to have a quick discussion about power. There are four related quantities:
 Alpha: the probability of rejecting the null hypothesis when it is true; usually set at 0.05
 Power: the probability of detecting an effect, given that the effect really does exist; either sought (when conducting an a priori power analysis) or observed (after the data have been collected)
 Effect size: quantitative measure of the magnitude of a phenomenon; estimated (when conducting an a priori power analysis) or observed (after the data have been collected)
 N: the number of subjects/participants/observations who participated in a primary research study or are needed for such a study
You need to know or estimate three of the four quantities, and the software will calculate the fourth for you. Throughout this workshop, we will be discussing the relationship between these four quantities. Let’s quickly look at a few examples:
 Hold alpha and effect size constant: As N increases, power will increase
 Hold alpha and power constant: As effect size increases, N decreases
 Hold alpha and N constant: As effect size increases, probability of finding a statisticallysignificant effect increases
Guidelines
In many ways, metaanalysis is just like any other type of research. The key to both good research and good metaanalyses is planning. To help with that planning, there are published guidelines on how to conduct good systematic reviews and metaanalyses. It is very important that you review these guidelines before you get started, because you need to collect specific information during the data collection process, and you need to know what that information is. Also, many journals will not publish systematic reviews or metaanalyses if the relevant guidelines were not followed.
Below is a list of the five of the most common guidelines.
 MOOSE: Metaanalysis Of Observational Studies in Epidemiology (http://statswrite.eu/pdf/MOOSE%20Statement.pdf and http://www.ijo.in/documents/14MOOSE_SS.pdf)
 STROBE: Strengthening The Reporting of OBservational studies in Epidemiology (https://www.strobestatement.org/index.php?id=strobehome and https://www.strobestatement.org/index.php?id=availablechecklists)
 CONSORT: CONsolidated Standards Of Reporting Trials (http://www.consortstatement.org/ and http://www.equatornetwork.org/reportingguidelines/consort/)
 QUOROM: QUality Of Reporting Of Metaanalyses (https://journals.plos.org/plosntds/article/file?type=supplementary&id=info:doi/10.1371/journal.pntd.0000381.s002)
 PRISMA: Preferred Reporting Items for Systematic reviews and MetaAnalyses (http://www.prismastatement.org/)
There are two organizations that do lots of research and publish many metaanalyses. Because these organizations publish many metaanalyses (and metaanalyses of metaanalyses), they help to set the standards for good metaanalyses. These organizations are the Cochrane Collaboration and the Campbell Collaboration. The Cochrane Collaboration is an organization that collects data related to medicine and health care topics. Because of their focus is on highquality data for international clients, they do a lot of metaanalyses. Not surprisingly, they wrote a whole book of guidelines on how to conduct both systematic reviews and metaanalyses.
The was Campbell Collaboration was founded in 1999 and is named after Donald T. Campbell (the same Don Campbell who coauthored Campbell and Stanley and Cook and Campbell). This organization is like the Cochrane Collaboration, only for social issues. Their site has links to the Cochrane Collaboration guidelines, as well as to other sets of guidelines. Another useful website is from the Countway Library of Medicine: https://guides.library.harvard.edu/metaanalysis/guides .
This sounds like a lot of guidelines, but in truth, they are all very similar. Reading through some of them will give you a good idea of what information is needed in your write up. You will want to know this so that you can collect this information as you move through the data collection process.
Quality checklists
Almost all of the metaanalysis guidelines require that all of the studies included in the metaanalysis be rated on a quality checklist. There are hundreds of quality checklists that you can use. Some have been validated; many have not. You may find that you get different results when you use different quality checklists. The purpose of the quality checklist is to identify studies that are potentially notsogood and ensure that such studies are not having an undo influence on the results of the metaanalysis. For example, if, according to the quality check list being used, one study was found to be of much lower quality than all of the others in the metaanalysis, you might do a sensitivity analysis in which you omit this study from the metaanalysis and then compare those results to those obtained when it is included in the metaanalysis.
Keep in mind that reporting standards have changed over time – witness the evolution of the APA manuals. This became a real issue in our metaanalysis with respect to the reporting of pvalues. Back in the 1990s, it was standard practice to report pvalues as being above or below 0.05, but not the exact pvalue itself. Also, the requirement to report effect sizes is relatively new, so older studies typically did not report them. The quality checklist that we used, the Downs and Black (1998, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1756728/pdf/v052p00377.pdf ), considered these omissions a mark of a lowerquality study.
To give you a sense of what a quality checklist is like, here are a few example items:
1. Is the hypothesis/aim/objective of the study clearly described? yes = 1; no = 0
20. Were the main outcome measures used accurate (valid and reliable)? yes = 1; no = 0; unable to determine = 0
23. Were study subjects randomised to intervention groups? yes = 1; no = 0; unable to determine = 0
Data collection
Once you are familiar with the guidelines, you are ready to start the data collection process. The first step is to clearly state the research question. This is critical, because it will inform the criterion for what studies will be included in your metaanalysis and which will be excluded. With respect to the example that will be used in this workshop, the lead author on the paper was an expert in issues surrounding the gathering of information from children in foster care who had experienced traumatic events. These kids were interviewed as part of legal proceeding related to the traumatic event. For this metaanalysis, we were interested in the effects of interviewer support on children’s memory and suggestibility. Our research question was: Does interviewer supportiveness improve interview outcomes? This is a very specific question, but we actually started with a more general question which was refined several times until we settled on this question.
Grey literature
When deciding where you will search of articles, you will need to consider if you will include any of the socalled grey literature in your search. Grey literature includes articles that are either not published or published in notsoeasy to access journals. There is often a question as to whether these articles have undergone the peerreview process. Master’s theses and dissertations are often considered part of the grey literature. Other examples of grey literature include government reports, NGO reports, research/reports from businesses, white papers, conference proceedings/papers, statements by professional organizations, etc. There is some debate as to whether such research should be included in a metaanalysis. On the one hand, because these articles have not been through the same type of peer review process that articles published in journals have, they may be of substantially lower quality. Also, it is difficult to access this type of literature, as there is no database to search, and so there may be selection bias with respect to those articles that are discovered and included. On the other hand, some argue that omitting such articles may lead to an inflated estimate of the summary effect size, as the effect sizes reported by grey literature articles will likely be smaller than those reported in published articles.
Inclusion and exclusion criteria
Once the research question has been sufficiently refined, you now need to think about how to determine which studies will be included in your metaanalysis. In practice, instead of developing both inclusion and exclusion criterion, you may just develop exclusion criterion. So you will include all of the research articles you find, unless there is a reason to exclude the article. Here are some examples of exclusion criteria that we used in our metaanalysis.
 Language: Articles not written in English.
 Date of publication: Articles published before 1980 or after 2016.
 Publication source: Article not published in a peerreviewed journal.
 Type of research: Did not use an experimental design.
Next, you need to decide where you are going to search for the articles. This is a very important step in the process, because your metaanalysis may be criticized if you don’t search in all possible places and therefore fail to find and include studies that really should be included in the metaanalysis. You decisions regarding the inclusion or exclusion of grey literature becomes very important at this stage, because it will impact where you search for articles.
The next step is to start doing the search of the literature. While ideally you would have a complete list of the exclusion criteria before you start the literature search, the reality is that this may be an iterative process, as you find that items need to be added to the exclusion list as you do the searches. This may mean that some searches need to be done more than once. Remember as you do this to keep track of your search terms for each search, and all of the necessary information for writing up the results. Depending on your research question and your search parameters, you may find very few results or tens of thousands of results. Either of these may lead you to revise your exclusion criteria and/or your search terms.
Sensitivity and specificity
This brings us to the discussion of sensitivity versus specificity. In terms of a metaanalysis, sensitivity means that you get all of what you want. In other words, your search results include all of the articles that should be included in your metaanalysis; nothing is missing. Specificity means that you get none of what you don’t want. In other words, you don’t have articles that shouldn’t be included in your metaanalysis. In practice, when doing the search of the literature, most researchers tend to “error” on the side of sensitivity, to ensure that no relevant study was missed. However, this means more work to sort through the possible studies to eliminate those that should not be included.
Once you have a list of all possible articles to include in the metaanalysis, you need to determine what part of the article you will read in order to determine if the article should be included in the metaanalysis. You will likely have hundreds or thousands of possible articles, so reading each one in its entirety isn’t realistic. Rather, you might read just the abstract, or just the methods section, for example.
Sorting through all of the possible studies takes a lot of time and effort. You need to be very organized so that you don’t end up evaluating the same article multiple times. Usually, this job is done by more than one person. In this situation, some articles need to be evaluated by everyone doing the evaluation task, to ensure that everyone would make the same decision regarding that article (to include it in the study or not). This should be done early on, in case more training is needed. Consistency between evaluators is critical and interrater agreement needs to be reported.
What information to collect
Before you start collecting the actual data for the metaanalysis, decide which statistical software package will be used to analyze the data. Look at the help file for the command that you will be using. For this workshop, we will be using the meta analysis commands that were introduced in Stata 16. Looking at the help file for meta, you can see that there are several different ways that data file could be structured, depending on the type of data that are available. The meta set command can be used if your dataset contains effects sizes their standard errors or effect sizes their confidence intervals. The meta esize command can be used if your dataset contains the means, standard deviations and sample sizes for two groups (usually an experimental group and a control group). The meta esize command can also be used if your dataset contains the Ns for successes and failures for two groups (for a total of four sample sizes).
With the meta set command, you dataset could be formatted in one of two ways:
esvar sevar
or
es cilvar ciuvar
where “esvar” means “effect size”, “sevar” means “standard error of effect size”, “cilvar” means “confidence interval lower”, and “ciuvar” means “confidence interval upper”.
With the meta esize command, your dataset could be formatted in one of two ways:
n1 mean1 sd1 n2 mean2 sd2
or
n11 n12 n21 n22
where “n1” means the sample size for the experimental group, “mean1” means the mean for the experimental group, “sd1” means the standard deviation for the experimental group, “n2” means the sample size for the control group, “mean2” means the mean for the control group, “sd2” means the standard deviation for the control group, “n11” means the number of successes for the experimental group, “n12” means the number of failures for the experimental group, “n21” means the number of successes for the control group, and “n22” means the number of failures for the control group.
As you can see, there are several different ways you can enter data. You want to know which of these you will be using before you start collecting data. For example, in the social sciences, you are likely to find Ns, means and standard deviations in the articles to be included in the metaanalysis. In medical fields, you may be more likely to find odds ratios. Either way, you need to know what information is needed by the software so that you know what to collect. That said, data collection is rarely that straight forward. Because of that, I find that you often end up with two datasets. The first one is the one into which you collect the information that you can find, and the second one is the “cleaned up” one that you use for the analysis. I will describe both of these in more detail a little later on.
Another thing you need to do before starting to collect the data is to determine how you are going to mark the data that you have found. For example, are you going to print out all of the articles and use a highlighter to highlight the needed values? Are you going to have a file of PDFs of the articles and use an electronic highlighter? As before, if more than one person is going to do this task, you will need to get the interrater reliability established before data collection begins.
Another point to consider is the possibility of running a metaregression. We will discuss the topic of metaregressions in more detail later, but the point now is to think about possible covariates that might be included in such a model. You want to think about that before reading through the selected sections of the articles, because you want to collect all of the needed data in as few readings as possible.
Your first dataset may look rather messy and may not be in a form that is ready to be imported into Stata for analysis. This is OK; it is still a good start. There are two major issues that need to be addressed: one is the missing data, and the other is the fact that I have different types of effect sizes.
Missing data
As with complex survey data, there are two types of missing data in metaanalyses. The first type is missing studies (akin to unit nonresponse in survey data). The second type is missing values from a particular study (akin to item nonresponse). We will deal with missing studies later on when we discuss various types of bias. Here we are going to talk about missing values from the articles. To be clear, how missing data are handled in a metaanalysis can very different from how missing data are handled in almost any other type of research. The “cost” of missing data in a metaanalysis is often high, because there is no way to get more data (because you have already included all of the relevant studies in your metaanalysis), and the number of studies may be rather low.
Before trying to use a more “traditional” form of imputation, such as multiple imputation, in a metaanalysis you can try to find an acceptable method to replace the missing value using the information that is available. One of the best places I have found for information on making these substitutions is the manual for the ES program written by the late Will Shadish. This document, which you can download from his ES webpage (http://faculty.ucmerced.edu/wshadish/software/escomputerprogram), documents the formulas used to calculate the different types of effect sizes and describes the consequences of substituting known information for unknown information.
Let’s say that you are collecting Ns, means and standard deviations for your metaanalysis. You must find the N, mean and standard deviation for both the experimental and control group, for a total of six values. One of the articles gives only five values; the N for one of the groups is missing. Now you may be able to figure out what the N is by subtracting the N in one group from the known total sample size, but if you can’t do that, you could just assume that each group had the same N. It turns out the effect size that is calculated with this substitution is very similar what would have been calculated if the real N had been know. Likewise, if you had all but one the standard deviation for one group, you might assume that the standard deviations in the two groups were equal. This is a little more of a compromise than assuming equal Ns, but it isn’t too bad. However, you should be hesitant to assume that the means of the two groups were the same.
Let’s say that in another article, the only information that you can find is the value of the Fstatistic, the degrees of freedom, and the pvalue. You can calculate an effect size from this information. If you can find only the pvalue, you can still estimate the effect size. For example, if you have a pvalue and the degrees of freedom, you can figure out the tvalue, and then calculate an effect size from there. To do this, however, you need to have the exact pvalue.
One of the major concerns with a metaanalysis is collecting all of the relevant research articles, because as we know, not all research is published. Of particular concern are the results of highquality research that found nonsignificant results. Finding such research results can be difficult, as you can’t find them in electronic databases of published research. So what are your options? You can try to contact researchers who have published articles in this particular area of research to ask about nonpublished papers. You can search for dissertations or Master’s theses. You can talk to people at academic conferences. You can post inquiries on relevant mailing lists. However, caution must be exercised, because some of these works may be of substantially poorer quality than the work that is published. There could be flaws in the design, instrument construction, data collection techniques, etc. In other words, finding nonsignificant results isn’t the only reason research is not published.
Another method of handling missing data is to contact the author(s) of the study with the missing data and ask for the value or values needed. In my experience, some authors were very understanding and provided the requested values, while others simply ignored our request. One replied to our email request and said that she would be willing to provide the missing value (which was a mean), but the data were stored on a tape that was no longer readable. There is also the possibility of doing a multiple imputation, but we will not discuss that in this workshop.
Once the missing data have been handled and all of the effect sizes have been converted into the same metric, we have our second dataset, which is ready for preliminary analysis.
Different types of effect sizes
There are many different types of effect sizes, some for continuous outcome variables and others for binary outcome variables. The effect sizes for continuous outcome variables belong to one of two families of effect sizes: the d class and the r class. The d class effect sizes are usually calculated from the mean and standard deviation. They are a scaled difference between the means of two group. Glass’ delta, Cohen’s d and Hedges’ g are examples of this type of effect size. Glass’ delta is the difference between the mean of the experimental and control groups divided by the standard deviation of the control group. Cohen’s d is the different between the mean of the experimental and control group divided by the pooled standard deviation (i.e., pooled across the experimental and control groups). Hedges’ g is a correction to Cohen’s d because Cohen’s d tends to overestimate the effect size in small samples (< 1015 total).
The r class effect sizes are also a ratio, but they are the ratio of variance attributable to an effect divided by the total effect, or more simply, the proportion of variance explained. Examples of this type of effect size include etasquared and omegasquared.
When dealing with binary data, it is common to have a 2×2 table: event/nonevent and treated/controlled.
event nonevent
treated A B
control C D
From such a table, three different types of effect sizes can be calculated. These include the risk ratio, the odds ratio and the risk difference.
Risk ratio = (A/n_{1})/(C/n_{2})
Odds ratio = (AD)/(BC)
Risk difference = (A/n_{1}) – (C/n_{2})
The risk ratio and the odds ratio are relative measures, and as such they are relatively insensitive to the number of baseline events. The risk difference is an absolute measure, so it is very sensitive the number of baseline events.
Converting between different types of effect sizes
While you will need to collect information necessary to calculate an effect size from some articles, other articles will provide the effect size. However, there are dozens of different types of effect sizes, so you may need to convert the effect size given in the paper into the type of effect size that you need for your metaanalysis. There are several good online effect size calculators/converters. One of my favorites is: https://www.psychometrica.de/effect_size.html .
You need to be careful when using effect size converters, because some conversions make more sense than others. For example, you can easily convert a Cohen’s d to an odds ratio, but the reverse is not recommended. Why is that? A Cohen’s d is based on data from continuous variables, while an odds ratio is based on data from dichotomous variables. It is easy to make a continuous variable dichotomous, but you can’t make a dichotomous variable continuous (because the dichotomous variable contains less information than the continuous variable). However, Cox (1970) suggested that d(Cox) = LOR/1.65, (where LOR = log odds ratio) and SanchezMeca, et. al. (2003) showed that this approximation works well. Bonnet (2007) and Kraemer (2004) have good summaries of issues regarding fourfold tables. Another point to keep in mind is the effect of rounding error when converting between different types of effect sizes.
Like any other quantity calculated from sampled data, an effect size is an estimate. Because it is an estimate, we want to calculate the standard error (or confidence interval) around that estimate. If you give the statistical software the information necessary to calculate the effect size, it will also calculate the standard error for that estimate. However, if you supply the estimate of the effect size, you will also need to supply either the standard error for the estimate or the confidence interval. This can be a real problem when an article reports an effect size but not its standard error, because it may be difficult to find a way to derive that information from what is given in the article.
Despite the large number of effect sizes available, there are still some situations in which there is no agreedupon measure of effect. Two examples are count models and multilevel models.
Data inspection and descriptive statistics
In the preliminary analysis, we do all of the data checking that you would do with any other dataset, such as look for errors in the data, get to know the variables, etc. Of particular interest is looking at the estimates of the effect sizes for outliers. Of course, what counts as an outlier depends on the context, but you still want to identify any extreme effect sizes. If the dataset is small, you can simply look through the data, but if the dataset is large, you may need to use a test to identify outliers. You could use the chisquare test for outliers, the Dixon Q test (1953, 1957) for outliers or the Grubbs (1950, 1969, 1972) test for outliers. We used a method proposed by Viechtbauer and Cheung, 2010, and there are others that you could use as well. One of the problems that we had was that one test identified a given data point as an outlier, but another test didn’t identify any points as outliers, or would identify a different data point. The other problem was that if one outlier was removed and the test rerun, a different point would be identified as an outlier. Given that there were only eleven data points in our dataset, I wasn’t eager to lose any data point for any reason. In the end, we didn’t exclude any of the effect sizes identified by any of the techniques because we couldn’t find any compelling reason to do so (e.g., the effect size had not been miscalculated, the effect size didn’t come from a very different type of study or from measures that were very different from those used in other studies included in the analysis), and the value for the data point was not too far beyond the cutoff point for calling the value an outlier. I did do sensitivity analyses with and without the “outlier”, and the results weren’t too different.
After you have done the descriptive statistics and considered potential outliers, it is finally time to do the analysis! Before running the metaanalysis, we should discuss two important topics that will be shown in the output. The first is weighting, and the second is measures of heterogeneity.
Weighting
As we know, some of the studies had more subjects than others. In general, the larger the N, the lower the sampling variability and hence the more precise the estimate. Because of this, the studies with larger Ns are given more weight in a metaanalysis than studies with smaller Ns. This is called “inverse variance weighting”, or in Stata speak, “analytic weighting”. These weights are relative weights and should sum to 100. You do not need to calculate these weights yourself; rather, the software will calculate and use them, and they will be shown in the output.
Heterogeneity
Up to this point, we have focused on finding effect sizes and have considered the variability around these effect sizes, measured by the standard error and/or the confidence interval. This variability is actually comprised of two components: the variation in the true effect sizes, which is called heterogeneity, and spurious variability, which is just random error (i.e., sampling error). When conducting a metaanalysis, we want to get a measure of this heterogeneity, or the variation in the true effect sizes. There are several measures of this, and we will discuss each in turn. Please note that the following material is adapted from Chapter 16 of Introduction to MetaAnalysis by Borenstein, Hedges, Higgins and Rothstein (2009). The explanations found there include useful graphs; reading that chapter is highly recommended.
If the heterogeneity was in fact 0, it would mean that all of the studies in the metaanalysis shared the same true effect size. However, we would not expect all of the effect sizes to be the exact same value, because there would withstudy sampling error. Instead, the effect sizes would fall within a particular range around the true effect.
Now suppose that the true effect size does vary between studies. In this scenario, the observed effect sizes vary for two reasons: Heterogeneity with respect to the true effect sizes and withinstudy sampling error. Now we need to separate the heterogeneity from the withinstudy sampling error. The three steps necessary to do this are:
 Compute the total amount of studytostudy variation actually observed
 Estimate how much of the observed effects would be expected to vary from each other if the true effect was actually the same in all studies
 Assume that the excess variation reflects real differences in the effect size (AKA heterogeneity)
Q
Let’s start with the Q statistic, which is a ratio of observed variation to the withinstudy error.
$$ Q = \sum_{i=1}^k W_i (Y_i M)^2$$
In the above equation, W_{i} is the study weight (1/V_{i}), Y_{i} is the study effect size, M is the summary effect and k is the number of studies. Alternatively, the formula may be written as
$$ Q = \sum_{i=1}^k \left( \frac{Y_iM}{S_i} \right)^2$$
Note that you can call Q either a weighted sums of squares (WSS) or a standardized difference (rather like Cohen’s d is a standardized difference).
Looking back at the three steps listed above, the first step is to calculate Q. There is a formula for doing this by hand, but most researchers use a computer program to do this. Once you have Q, the next step is to calculate the expected value of Q, assuming that all studies share a common effect size and hence all of the variation is due to sampling error within studies. Because Q is a standardized measure, the expected value depends only on the degrees of freedom, which is df = k – 1, where k is the number of studies. The third and final step is the find the “excess” variation, which is simply Q – df.
If you want to know if the heterogeneity is statistically significant, you can do so with Q and df. Specifically, the null hypothesis is that all studies share a common effect size, and under this null hypothesis, Q will follow a central chisquared distribution with degrees of freedom equal to k – 1. As you would expect, this test is sensitive to both the magnitude of the effect (i.e., the excess dispersion), and the precision with which the effect is measured (i.e., the number of studies). While a statisticallysignificant pvalue is evidence that the true effects vary, the converse is not true. In other words, you should not interpret a nonsignificant result. The result could be nonsignificant because the true effects do not vary, or because there is not enough power to detect the effect, or some other reason. Also, don’t confuse Q with an estimate of the amount of true variance; other methods can be used for that purpose. Finally, be cautious with Q when you have either a small number of studies in your metaanalysis, and/or lots of within study variance, which is often caused by studies with small Ns.
This has all been pretty easy to calculate, but there are some limitations to Q. First of all, the metric is not intuitive. Also, Q is a sum, not a mean, which means that it is very sensitive the number of studies included in the metaanalysis. But calculating Q has not been a waste of time, because it is used in the calculation of other measures of heterogeneity that may be more useful. If we take Q, remove the dependence on the number of studies and return it to the original metric, then we have T^{2}, which is an estimate of variance of the true effects. If we take Q, remove the dependence on the number of studies and express the result as a ratio, we will have I^{2}, which estimates the proportion of the observed variance that is heterogeneity (as opposed to random error).
tausquared and T^{2}
Now let’s talk about tausquared and T^{2}. Tausquared is defined as the variance of the true effect sizes. To know this, we would need to have an infinite number of studies in our metaanalysis, and each of those studies would need to have an infinite number of subjects. In other words, we aren’t going to be able to calculate this value. Rather, we can estimate tausquared by calculating T^{2}. To do this, we start with (Q – df) and divide this quantity by C.
$$T^2 = \frac{Qdf}{C} $$
where
$$ C = \sum W_i – \frac{\sum W_i^2}{W_i} $$
This puts T^{2} back into the original metric and makes T^{2} an average of squared deviations. If tausquared is the actual value of the variance and T^{2} is the estimate of that actual value, then you can probably guess that tau is the actual standard deviation and T is the estimate of this parameter.
While tausquared can never be less than 0 (because the actual variance of the true effects cannot be less than 0), T^{2} can be less than 0 if the observed variance is less than expected based on the withinstudy variance (i.e., Q < df). When this happens, T^{2} should be set to 0.
I^{2}
Notice that T^{2} and T are absolute measures, meaning that they quantify deviation on the same scale as the effect size index. While this is often useful, it is also useful to have a measure of the proportion of observed variance, so that you can ask questions like “What proportion of the observed variance reflects real differences in effect size?” In their 2003 paper, “Measuring Inconsistency in Metaanalysis”, Higgins, et. al. proposed I^{2}. I^{2} can be thought of as a type of signaltonoise ratio.
$$ I^2 = \left( \frac{Qdf}{Q} \right) \times 100\%$$
Alternatively,
$$I^2= \left( \frac{Variance_{bet}}{Variance_{total}} \right) \times 100\% = \left( \frac{\tau^2}{\tau^2+V_Y} \right) \times 100\% $$ In words, I^{2} is the ratio of excess dispersion to total dispersion. I^{2} is a descriptive statistic and not an estimate of any underlying quantity. Borenstein, et. al. note that: “I^{2} reflects the extent of overlap of confidence intervals, which is dependent on the actual location or spread of the true effects. As such it is convenient to view I^{2} as a measure of inconsistency across the findings of the studies, and not as a measure of the real variation across the underlying true effects.” (page 118).
Let’s give some examples of interpreting I^{2}. An I^{2} value near 0 means that most of the observed variance is random; it does not mean that the effects are clustered in a narrow range. For example, the observed effects could vary widely because the studies had a lot of sampling error. On the other hand, an I^{2} value near 100% indicates that most of the observed variability is real, not that the effects have a wide range. Instead, they could have a very narrow range and be estimated with great precision. The point here is to stress that I^{2} is a measure of proportion of variability, not a measure of the amount of true variability.
There are several advantages to using I^{2}. One is that the range is from 0 to 100%, which is independent of the scale of the effect sizes. It can be interpreted as a ratio, similar to indices used in regression and psychometrics. Finally, I^{2} is not directly influenced by the number of studies included in the metaanalysis.
Because I^{2} is on a relative scale, you should look at it first to decide if there is enough variation to warrant speculation about the source or cause of the variation. In other words, before jumping into a metaregression or subgroup analysis, you want to look at I^{2}. If it is really low, then there is no point to doing a metaregression or subgroup analysis.
Running the metaanalysis and interpreting the results (including the forest plot)
Let’s return to our example. The cleaned dataset looks like this:
clear input id sup_pos_n sup_pos_mean sup_pos_sd sup_neg_n sup_neg_mean sup_neg_sd db pubdate 1 33 .97 .13 29 .95 .14 17 2008 2 33 .90 .31 82 .79 .38 19 2013 3 39 .90 .04 40 .90 .02 20 1999 4 24 .71 .08 26 .68 .10 23 2001 5 42 .75 .10 54 .70 .10 20 2015 6 39 .69 .18 53 .66 .15 21 2015 7 37 .59 .14 37 .56 .13 23 2011 8 42 .65 .16 39 .59 .18 22 2010 9 41 .67 .11 43 .60 .12 20 2011 end
Because our data came from mostly psychology journals, we collected Ns, means and standard deviations, and these values were used in the Stata code, as shown below. The N, mean and standard deviation for the experimental group is given first, and then those values for the control group. One option on the –metan command was used, and that was the –hedges option. We used it because some of the studies included in our metaanalysis had fairly low Ns, and because there is no penalty for using it if it is not needed. By default, the –metan command produces two types of output: the table of results and a forest plot. Let’s look at the command and then its output.
meta esize sup_pos_n sup_pos_mean sup_pos_sd sup_neg_n sup_neg_mean sup_neg_sd Metaanalysis setting information Study information No. of studies: 9 Study label: Generic Study size: _meta_studysize Summary data: sup_pos_n sup_pos_mean sup_pos_sd sup_neg_n sup_neg_mean sup_neg_sd Effect size Type: hedgesg Label: Hedges's g Variable: _meta_es Bias correction: Approximate Precision Std. Err.: _meta_se Std. Err. adj.: None CI: [_meta_cil, _meta_ciu] CI level: 95% Model and method Model: Randomeffects Method: REML meta summarize, fixed Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Metaanalysis summary Number of studies = 9 Fixedeffects model Heterogeneity: Method: Inversevariance I2 (%) = 0.00 H2 = 0.68  Study  Hedges's g [95% Conf. Interval] % Weight + Study 1  0.147 0.347 0.640 8.88 Study 2  0.302 0.101 0.705 13.29 Study 3  0.000 0.437 0.437 11.33 Study 4  0.325 0.225 0.874 7.15 Study 5  0.496 0.090 0.902 13.10 Study 6  0.182 0.229 0.593 12.80 Study 7  0.220 0.233 0.672 10.56 Study 8  0.350 0.085 0.785 11.41 Study 9  0.602 0.168 1.035 11.49 + theta  0.297 0.150 0.444  Test of theta = 0: z = 3.96 Prob > z = 0.0001 Test of homogeneity: Q = chi2(8) = 5.44 Prob > Q = 0.7102
The first column of the output table gives the study ID number. The next column is labeled “Hedges’s g”, which is the effect size. The next two columns give the 95% confidence interval around the estimate of the effect size. Notice that in this example, all but two of the confidence intervals include 0. The last column gives the “% Weight”. The sum of the % Weight column is, of course, 100. At the bottom of the table, we see the mean effect size, which is 0.297, with a confidence interval of 0.150 to 0.444. Notice that this confidence interval does not include 0. This means that, based on this metaanalysis, the effect size is 0.297, which is small by most measures, but it is not 0. The Test of theta = 0 is given below the table, and it indicates that the z test statistic equals 3.96 with a pvalue of 0.0001, which is statistically significant at the alpha = 0.05 level. The next line of the output below the table give estimates of the heterogeneity. The Heterogeneity chisquared equals 5.44 on 8 degrees of freedom, with a pvalue of 0.7102, which is not statistically significant. Congruent with that is the estimate of Isquared, which is 0.0% (shown immediately above the table). Having an Isquared equal to 0 is somewhat unusual, and it means that there is no point to running any metaregressions (which are used to explain heterogeneity).
meta forestplot, fixed title("Fixedeffects metaanalysis")
The forest plot shows essentially the same information as the table. The summary effect of 0.16 is shown with the red dotted line and the blue rectangle at the bottom of the graph. For each study, a square shows its place on the scale and the confidence interval is represented by the line on either side of the square. For Study 6, there is an arrow on the right side of the confidence interval, which indicates that the confidence interval is wider on that side than the highest value on the scale (but that is difficult to see because of the rounded of the values for the CIs). Lots of options can be added to make the forest plot ready for publication, and almost all published metaanalyses include a forest plot.[/caption]
Fixed versus random effects
Quoting from page 183 of Borenstien, et. al:
“In this volume, we have been using the term fixed effect to mean that the effect is identical (fixed) across all relevant studies (within the full population, or within a subgroup).
In fact the use of the term fixed effect in connection with metaanalysis is at odds with the usual meaning of fixed effects in statistics. A more suitable term for the fixedeffect metaanalysis might be a commoneffect metaanalysis. The term fixed effects is traditionally used in another context with a different meaning. Concretely, we can talk about the subgroups as being fixed in the sense of fixed rather than random. This has important implications when talking about subgroup analyses, because in that context, a mixed effect metaanalysis means a fixed effect subgroup across groups while a randomeffects model was used for the withingroup analysis.”
The difference between a fixed and a random effects metaanalysis is an important one, and it is one of the few decisions we have not yet mentioned. Fixed effect models and random effect models make different assumptions, and you should choose between these options based on your assessment of how well your data meet the assumptions of these types of models.
Fixed effect models are appropriate if two conditions are satisfied. The first is that all of the studies included in the metaanalysis are identical in all important aspects. Secondly, the purpose of the analysis is to compute the effect size for a given population, not to generalize the results to other populations. Let’s talk about these two points. How reasonable do they seem to you?
On the other hand, you may think that because the research studies were conducted by independent researchers, there is no reason to believe that the studies are functionally equivalent. In other words, given that the studies gathered data from different subjects, used different interventions and/or different measures, it might not make sense to assume that there is a common effect size. Also, given the differences between the studies, you might want to generalize your results to a range of similar (but not identical) situations or scenarios.
In an ideal world, you choose between a fixed and random effects metaanalysis based on the assumptions that you were willing to make. In reality, though, other considerations are also important. One of those considerations is the size of your dataset. Just as a study with a small N is unlikely to capture the true amount of variability in the population, so a metaanalysis with few studies is likely to produce a precise estimate of the betweenstudies variance. This means that if even if we preferred the randomeffect model, our dataset may not contain the necessary (or amount of) information. You have some options in this situation, but each option comes with a down side. One option is to report the effect sizes for each study but omit the summary effect size. The down side that is that some readers (and possibly journal reviewers) will not understand that conclusions should not be drawn from the summary effect and its confidence interval and wonder why it has been omitted. They may even resort to “vote counting” and possibly come to the wrong conclusion. Conducting a fixedeffects analysis is another option, but the down side to this is that you really wanted to generalize your results to a larger population, so this analysis doesn’t serve your purpose well. Another possible down side is that readers may generalize your results, even if you state that such a generalization is not warranted. A third, and possibly the best option, is to run the analysis as a Bayesian metaanalysis. In this approach, the estimate of tausquared is based on data from beyond the studies included in the current analysis. The down side to this option is that most analysts have little or no experience with this type of analysis, no to speak of using the software and/or procedures to run the analysis.
Here is the syntax for a randomeffects metaanalysis using our example data.
meta summarize, random Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Metaanalysis summary Number of studies = 9 Randomeffects model Heterogeneity: Method: REML tau2 = 0.0000 I2 (%) = 0.00 H2 = 1.00  Study  Hedges's g [95% Conf. Interval] % Weight + Study 1  0.147 0.347 0.640 8.88 Study 2  0.302 0.101 0.705 13.29 Study 3  0.000 0.437 0.437 11.33 Study 4  0.325 0.225 0.874 7.15 Study 5  0.496 0.090 0.902 13.10 Study 6  0.182 0.229 0.593 12.80 Study 7  0.220 0.233 0.672 10.56 Study 8  0.350 0.085 0.785 11.41 Study 9  0.602 0.168 1.035 11.49 + theta  0.297 0.150 0.444  Test of theta = 0: z = 3.96 Prob > z = 0.0001 Test of homogeneity: Q = chi2(8) = 5.44 Prob > Q = 0.7102
meta forestplot, random title("Randomeffects metaanalysis")
>
Potential sources of bias
As mentioned previously, it is really important to include all of the relevant studies in a metaanalysis. However, some may be unobtainable because of publication bias. Publication bias is the bias by publishers of academic journals to prefer to publish studies reporting statistically significant results rather than studies reporting statistically nonsignificant results. In a similar vein, researchers may be loath to write up a paper reporting statistically nonsignificant results on the belief that the paper is more likely to be rejected. The effect on a metaanalysis is that there could be missing data (i.e., unit nonresponse), and these missing data bias the sample of studies included in the metaanalysis. This, of course, leads to a biased estimate of the summary effect. One other point to keep in mind: For any given sample size, the result is more likely to be statistically significant if the effect size is large. Hence, publication bias refers to both statistically significant results and large effect sizes. There are other types of bias that should also be considered. These include:

 Language bias: Englishlanguage databases and journals are more likely to be searched (does someone on your research team speak/read another language, and do you have access to journals in that language?)

 Availability bias: including those studies that are easiest for the metaanalyst to access (To which journals does your university subscribe?)

 Cost bias: including those studies that are freely available or lowest cost (To which journals does your university subscribe? Elsevier debate)

 Familiarity bias: including studies from one’s own field of research (an advantage to an interdisciplinary research team)

 Duplication bias: multiple similar studies reporting statistical significance are more likely to published (checking reference sections for articles)

 Citation bias: studies with statistically significant results are more likely to be cited and hence easier to find (checking reference sections for articles)
Tests for publication bias
Two tests are often used to test for bias. They were proposed by Begg, et. al. (1994) and Egger et. al. (1997). However, both of these test suffer from several limitations. First, the tests (and the funnel plot itself) may yield different results simply by changing the metric of the effect size. Second, both a reasonable number of studies must be included in the analysis, and those studies must have a reasonable amount of dispersion. Finally, these tests are often underpowered; therefore, a nonsignificant result does not necessarily mean that there is no bias.
Tests and graphs to detect publication bias
The meta bias command can be used to assess bias. The effect size and its standard error are provided, and there are three options that can be used. They are egger, harbord and peters. The harbord and peters options can only be used with binary data. In the example below, the egger option is used. The results indicate that bias may not be a problem (because the test is nonsignificant (p = 0.6813)).
meta bias, egger Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Regressionbased Egger test for smallstudy effects Randomeffects model Method: REML H0: beta1 = 0; no smallstudy effects beta1 = 1.52 SE of beta1 = 3.703 z = 0.41 Prob > z = 0.6813
A funnel plot can also be used as a method for investigating publication bias. Remember that the effect size is usually on the xaxis and the sample size or variance on the yaxis with the largest sample size or smallest variance at the top. If there is no publication bias, then the studies will be distributed evenly around the mean effect size. Smaller studies will appear near the bottom because they will have more variance than the larger studies (which are at the top of the graph). If there is publication bias, then there will seem to be a few studies missing from the middle left of the graph, and very few, if any, studies in the lower left of the graph. (The lower left being where small studies reporting small effect sizes would be.)
meta funnelplot, fixed
We can also create a funnel plot for the metaanalysis with random effects.
meta funnelplot, random
A contour funnel plot can also be made using the contour option. The numbers in parentheses give the range of pvalues.
meta funnelplot, random contours(1 5 10)
Other approaches to publication bias
Rosenthal’s failsafe N
In the output from our metaanalysis, we saw the summary effect and a pvalue which indicated if the effect was statistically significantly different from 0. In the presence of publication bias, this summary effect would be larger than it should be. If the missing studies were included in the analysis (with no publication bias), the summary effect might no longer be statistically significant. Rosenthal’s idea was to calculate how many studies would need to be added to the metaanalysis in order to render the summary effect nonsignificant. If only a few studies were needed to render our statistically significant summary effect nonsignificant, then we should be quite worried about our observed result. However, if it took a large number of studies to make our summary effect nonsignificant, then we wouldn’t be too worried about the possible publication bias. There are some drawbacks to Rosenthal’s approach. First, it focuses on statistical significance rather than practical, or real world, significance. As we have seen, there can be quite a difference between these two. Second, it assumes that the mean of the missing effect sizes is 0, but it could negative or slightly positive. If it was negative, then fewer studies would be needed to render our summary effect nonsignificant. On a more technical note, Rosenthal’s failsafe N is calculated using a methodology that was acceptable when he proposed his measure, but isn’t considered acceptable today.
Orwins’ failsafe N
Orwin proposed a modification of Rosenthal’s failsafe N that addresses the first two limitations mentioned above. Orwin’s failsafe N allows researchers to specify the lowest summary effect size that would still be meaningful, and it allows researchers to specify the mean effect size of the missing studies.
Duval and Tweedie’s trim and fill
Duval and Tweedie’s trim and fill method is an iterative procedure that tries to estimate what the summary effect size would be if there was no publication bias. To understand how this is done, think of a funnel plot that shows publication bias, meaning that there are studies in the lower right of the plot but few, if any, on the lower left. To estimate the new summary effect, the procedures “trims” the most extreme study from the lower right of the plot and recalculates the summary effect size. In theory, once all of the extreme effect sizes have been “trimmed,” an unbiased summary effect can be calculated. When trying to calculate the standard error around the new summary effect, this “trimming” has caused a substantial decrease in the variance. To account for this, the studies that were “trimmed” are added back (a process called “filling”) so that a more reasonable standard error can be calculated. To be clear, the “trimming” process is used only in the calculation of the new, unbiased, summary effect size, while the “filling” process is used only the calculation of the standard error around the new, unbiased, summary effect size. The advantages to this approach are that it gives an estimate of the unbiased effect size, and there is usually a graph associated with it that is easy to understand (it usually includes the imputed studies). The disadvantages include a strong assumption about why the missing studies are missing and that one or two really aberrant studies can have substantial influence on the results.
The meta trimfill command can be used.
meta trimfill, random Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Nonparametric trimandfill analysis of publication bias Linear estimator, imputing on the right Iteration Number of studies = 9 Model: Randomeffects observed = 9 Method: REML imputed = 0 Pooling Model: Randomeffects Method: REML  Studies  Hedges's g [95% Conf. Interval] + Observed  0.297 0.150 0.444 Observed + Imputed  0.297 0.150 0.444 
Cumulative metaanalysis and forest plot
A cumulative metaanalysis is an iterative process in which the metaanalysis is run with the first study only, and then with first and second study only, and so on. The same is true for the creation of the funnel plot. The first line in the table of the cumulative metaanalysis shows the summary effect based on only the first study. The second line in the table shows the summary effect based on only the first two studies, and so on. Of course, the final summary effect will be the same as from the regular metaanalysis, because both are based on all of the studies. The studies can be sorted in different ways to address different questions. For example, if you want to look at only the largest studies to see when the estimate of the summary effect size stabilizes, you sort the studies based on N. Or you might be interested in sorting the studies by year of publication. In this scenario, let’s say that we are interested in a surgical technique used with folks who have experienced a heart attack. It is really important to know if this surgical technique increases life expectancy, so many studies are done. The question is, at what point in time have enough studies been done to answer this question? A third use for a cumulative metaanalysis is as a method to detect publication bias. For this, you would sort the studies from most to least precise. You might suspect publication bias if the effects in the most precise studies were small but increased as the less precise studies were added. The forest plot would show not only whether there was a shift, but also the magnitude of the shift.
Cumulative metaanalysis can be done with the meta summarize command with the cumulative option. In the example below, the random option was used to specify a randomeffects metaanalysis. In the parentheses after the cumulative option is the name of the variable by which the metaanalysis is cumulative. In this example, it is the publication date.
meta summarize, random cumulative(pubdate) Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Cumulative metaanalysis summary Number of studies = 9 Randomeffects model Method: REML Order variable: pubdate  Study  Hedges's g [95% Conf. Interval] Pvalue pubdate + Study 3  0.000 0.437 0.437 1.000 1999 Study 4  0.126 0.216 0.468 0.472 2001 Study 1  0.132 0.149 0.413 0.356 2008 Study 8  0.196 0.040 0.432 0.103 2010 Study 7  0.201 0.008 0.411 0.059 2011 Study 9  0.277 0.089 0.466 0.004 2011 Study 2  0.282 0.111 0.452 0.001 2013 Study 5  0.314 0.156 0.471 0.000 2015 Study 6  0.297 0.150 0.444 0.000 2015  meta forestplot, random cumulative(pubdate)
Smallstudy effect
Sterne, et. al. (2001) coined the term “small study effect” to describe the phenomenon that smaller studies tend to have larger effect sizes. They were very careful to point out that there is no way to know why this is true. It could be publication bias, or it could be that the smaller studies, especially if they were the first studies done, included subjects who were more ill, more motivated, more something, than the laterconducted studies that included more subjects. It is also possible that the smaller studies had better quality control. In the end, any one of these reasons, other reasons, or any combination thereof may explain why the smaller studies reported larger effects. This is important to remember when writing up results.
Subgroup analysis and metaregression
Let’s stop and think about the variability in our metaanalysis dataset. Up until now, we have assumed that the variability was caused by random error (sampling error in the individual studies) or some other, as yet undiscussed, source of variability. These other sources of variability could be that some studies in the metaanalysis compared drug A with dosage X to placebo, while the other studies compared drug A with dosage 2X to placebo. Or perhaps some of the studies included only females, while the rest included only males. In other words, there is some possibility that perhaps all of the studies shouldn’t be combined into a single analysis, but rather analyzed separately. This is called a subgroup analysis. In primary studies, you would be considering ttests and ANOVAs for such an analysis. In metaanalysis, you do something similar, but you just have more options of types of models. One option is to use a fixedeffects model. A second option is to use a randomeffects model using an estimate of tausquared from each group. A third option is to use a randomeffects model using a pooled estimate of tausquared. In addition to these options, you also need to choose your method: a Ztest, a Qtest based on an analysis of variance, or a Qtest for heterogeneity. Any of the methods can be used with any of the models, creating nine different possibilities. Your decision regarding choice of model should be based on what you know about your dataset. Your choice of method should be based on the type of information you are seeking about your data and the type of conclusion you wish to draw. Another decision you need to make is if a summary effect (from all groups combined) should be reported, or if only the summary effects for each group should be reported. Again, this will depend on your data and your purpose. Please see Chapter 19 in Borenstein, et. al. for a complete discussion and examples.
When analyzing primary data, a regression model is used when there are one or more predictors to be associated with an outcome. With a metaregression, the predictors are at the level of the study, and the outcome is the effect size. The purpose of the metaregression is to explain the heterogeneity. Of course, this presumes that there is any heterogeneity to be explained. In my metaanalysis, there was no heterogeneity, so a metaregression wasn’t even an option. Also, you need to consider the size of your metaanalysis dataset. In my dataset, there were only eleven studies, so even if there was some heterogeneity to be explained, I would have had difficulty running a metaregression, and at most, only one predictor could be included. If there is heterogeneity to be explained and the dataset is large enough, though, almost all of the regression techniques are available, including the use of interaction terms, quadratic terms, logistic regression, etc. Of course, the limitations encountered when analyzing primary data are also found with metaregressions. For example, alpha inflation (from multiple comparisons), power issues, etc.
Be aware that R^{2} does not have the same meaning in a metaanalysis that it does in a primary analysis. In a primary analysis, R^{2} could equal 1 if a predictor or a group of predictors could account for all of the variability in the outcome variable. Such a model is extremely unlikely, but it is possible. In a metaanalysis, R^{2} cannot equal 1 because there is no model, not even a theoretical model, that could explain all of the variance. This is because at least some of the variance is sampling variance (e.g., random error), and this cannot be explained by predictors in the model. In other words, the upper limit of R^{2} is something less than 1. Also, R^{2} will not have the same range across the studies in the metaanalysis. In a metaanalysis, R^{2} = (T^{2}_{explained}) /(T^{2}_{total}), where T^{2} = true variance.
A metaregression can be done in Stata 16 with the meta regress command. Although there is not heterogeneity in these data to be explained by a metaregression, an example of the command and its output is given below. The estat bubbleplot command is then used. The command meta summarize, subgroup() can be used to do a subgroup analysis. An example of this is not given, as there is no appropriate variable in the dataset to use for a subgroup analysis.
meta regress db Effectsize label: Hedges's g Effect size: _meta_es Std. Err.: _meta_se Randomeffects metaregression Number of obs = 9 Method: REML Residual heterogeneity: tau2 = 2.2e07 I2 (%) = 0.00 H2 = 1.00 Rsquared (%) = 72.19 Wald chi2(1) = 0.02 Prob > chi2 = 0.8861  _meta_es  Coef. Std. Err. z P>z [95% Conf. Interval] + db  .0063368 .0442535 0.14 0.886 .0803984 .093072 _cons  .1671164 .9097733 0.18 0.854 1.616006 1.950239  Test of residual homogeneity: Q_res = chi2(7) = 5.41 Prob > Q_res = 0.6095 estat bubbleplot
Concluding remarks
Clearly the purpose of the today’s discussion has been to introduce you to the what, why and how of metaanalysis. But I hope that it also makes you think about the reporting of your primary research as well. For example, many researchers conduct an a priori power analysis. When you do so, you need to guess at your effect size. When your study is complete and the data have been analyzed, perhaps you should calculate the effect size (you might need to include this in your paper anyway). Compare your calculated effect size with the guess you made when running the power analysis. How accurate was your guess? If you do this with each of your primary studies, you might find that you tend to under or overestimate effect sizes when running power analyses. Also, when you report an effect size, be sure to include its standard error. You may also want to include a few more descriptive statistics in your paper, as a future metaanalyst may be looking at something slightly different than you.
Two final points: Try to be responsive to requests for additional information regarding studies that you published. Secondly, make sure that older data and syntax/command/script files are readable with the current technology. For example, I still have ZIP disks. Finding a ZIP drive isn’t that difficult, but finding drivers is a little more challenging. My point is that you need to pay attention to both hardware and software when maintaining older files.
While we have discussed many topics, there are still quite a few that we have not discussed, such as multiple imputation for missing data, multilevel modeling, psychometric metaanalysis, network metaanalysis, SEM, generalized SEM and power analyses. These are more advanced topics that are beyond the scope of an introductory workshop. On a final note, I want to say that the area of metaanalysis is an area of statistics that is evolving and changing rapidly. This means that reviewing the current guidelines regarding the conducting and reporting of a metaanalysis is extra important, even if you have published metaanalyses in the past.
References
Borenstein, M., Hedges, L. V., Higgins, J. P. T., and Rothstein, H. R. (2009). Introduction to MetaAnalysis. Wiley: United Kingdom.
Borenstein, M. (2019). Common Mistakes in Metaanalysis and How to Avoid Them. Biostat, Inc.: Englewood, New Jersey.
Cleophas, T. J. and Zwinderman, A. H. (2017). Modern Metaanalysis: Reivew and Update of Methodologies. Springer International: Switzerland.
CONSORT: CONsolidated Standards Of Reporting Trials (http://www.consortstatement.org/ and http://www.equatornetwork.org/reportingguidelines/consort/ )
Countway Library of Medicine: https://guides.library.harvard.edu/metaanalysis/guides .
Dixon (1953), “Processing Data for Outliers,” Biometrics, Vol. 9, No. 1, pp. 7489.
Dixon and Massey (1957), “Introduction to Statistical Analysis,” Second Edition, McGrawHill, pp. 275278.
Downs, S. H. and Black, N. (1998). The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and nonrandomised studies of health care interventions. Journal of Epidemiological Community Health. Vol. 52, pages 377384.
Duval, S. and Tweedie, R. (2000). Trim and fill: A simple funnelplotbased method of testing and adjusting for publication bias in metaanalysis. Biometrics (June). Vol 56(2): 455463. (https://www.ncbi.nlm.nih.gov/pubmed/10877304 )
Egger, M., DaveySmith, G. Schneider, M. and Minder C. (1997). Bias in MetaAnalysis Detected by a Simple, Graphical Test. British Medical Journal (Sept. 13) Vol. 315(7109): 629634. (https://www.ncbi.nlm.nih.gov/pubmed/9310563 )
Grubbs F.E. (1950). Sample criteria for testing outlying observations. Ann. Math. Stat. 21, 2758.
Grubbs F.E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 121.
Grubbs, F.E. and Beck G. (1972). Extension of sample sizes and percentage points for significance tests of outlying observations. Technometrics, 14, 847854.
Hedges, L. V. and Pigott, T. D. (2001). The Power of Statistical Tests in MetaAnalysis. Psychological Methods. Vol 6(3): 2032017. (https://www.ncbi.nlm.nih.gov/pubmed/11570228 )
MOOSE: Metaanalysis Of Observational Studies in Epidemiology (http://statswrite.eu/pdf/MOOSE%20Statement.pdf and http://www.ijo.in/documents/14MOOSE_SS.pdf )
Orwin, R. G. (1983). A failsafe N for effect size in metaanalysis. Journal of Educational Statistics. Vol. 8(2) Summer, 1983, 157159.
Palmer, T. M. and Sterne, J. A. C. (2009). MetaAnalysis in Stata: An Updated Collection from the Stata Journal, Second Edition. Stata Press: College Station, TX.
PRISMA: Preferred Reporting Items for Systematic reviews and MetaAnalyses (http://www.prismastatement.org/ )
QUOROM: QUality Of Reporting Of Metaanalyses (https://journals.plos.org/plosntds/article/file?type=supplementary&id=info:doi/10.1371/journal.pntd.0000381.s002 )
Rosenthal, R. (1979). The file drawer problem and tolerance for null results: Psychological Bulletin. Vol 86(3) May 1979, 638641.
Stata blog: https://blog.stata.com/2013/09/05/measuresofeffectsizeinstata13/
Sterne, J. A. and DaveySmith, G. (2001). Sifting the evidencewhat’s wrong with significance tests? British Medical Journal (Jan. 27). Vol. 322(7280): 226231. (https://www.ncbi.nlm.nih.gov/pubmed/11159626 )
STROBE: Strengthening The Reporting of OBservational studies in Epidemiology (https://www.strobestatement.org/index.php?id=strobehome and https://www.strobestatement.org/index.php?id=availablechecklists )
https://www.psychometrica.de/effect_size.html .
http://faculty.ucmerced.edu/wshadish/software/escomputerprogram
Wikipedia: https://en.wikipedia.org/wiki/Effect_size .