Banner

Good review practice: a researcher guide to systematic review methodology in the sciences of food and health

  • About this guide
  • Part A: Systematic review method
  • What are Good Practice points?
  • Part C: The core steps of the SR process
  • 1.1 Setting eligibility criteria
  • 1.2 Identifying search terms
  • 1.3 Protocol development
  • 2. Searching for studies
  • 3. Screening the results
  • 4. Evaluation of included studies: quality assessment
  • 5. Data extraction
  • 6. Data synthesis and summary
  • 7. Presenting results
  • Links to current versions of the reference guidelines
  • Download templates
  • Food science databases
  • Process management tools
  • Screening tools
  • Reference management tools
  • Grey literature sources
  • Links for access to protocol repository and platforms for registration
  • Links for access to PRISMA frameworks
  • Links for access to 'Risk of Bias' assessment tools for quantitative and qualitative studies
  • Links for access to grading checklists
  • Links for access to reporting checklists
  • What questions are suitable for the systematic review methodology?
  • How to assess feasibility of using the method?
  • What is a scoping study and how to construct one?
  • How to construct a systematic review protocol?
  • How to construct a comprehensive search?
  • Study designs and levels of evidence
  • Download a pdf version This link opens in a new window

Data synthesis and summary

Data synthesis and summary .

Data synthesis includes synthesising the findings of primary studies and when possible or appropriate some forms of statistical analysis of numerical data. Synthesis methods vary depending on the nature of the evidence (e.g., quantitative, qualitative, or mixed), the aim of the review and the study types and designs.  Reviewers have to decide and preselect a method of analysis based on the review question at the protocol development stage. 

Synthesis Methods

Narrative summary : is a summary of the review results when meta-analysis is not possible. Narrative summaries describe the results of the review, but some can take a more interpretive approach in summarising the results . [8]  These are known as " evidence statements " and can include the results of  quality appraisal  and  weighting  processes and provide the  ratings of the  studies.

Meta-analysis : is a quantitative synthesis of the results from included studies using statistical analysis methods that are extensions to those used in primary studies. [9]  Meta-analysis can provide a more precise estimate of the outcomes by measuring and counting for uncertainty of outcomes from individual studies by means of statistical methods. However, it is not always feasible to conduct statistical analyses due to several reasons including inadequate data, heterogeneous data, poor quality of included studies and the level of complexity. [10]

Qualitative Data Synthesis (QDS) : is a method of identifying common themes across qualitative studies to create a great degree of conceptual development compared with narrative reviews. The key concepts are identified through a process that begins with interpretations of the primary findings reported to researchers which will then be interpreted to their views of the meaning in a second-order and finally interpreted by reviewers into explanations and generating hypotheses. [11]

Mixed methods synthesis:  is an advanced method of data synthesis developed by EPPI-Centre to better understand the meanings of quantitative studies by conducting a parallel review of user evaluations to traditional systematic reviews and combining the findings of the syntheses to identify and provide clear directions in practice. [11]

  • << Previous: 5. Data extraction
  • Next: 7. Presenting results >>
  • Last Updated: May 17, 2024 6:08 PM
  • URL: https://ifis.libguides.com/systematic_reviews

Attention: UniSA network-related systems are currently down - impacting internet access and access to resources.

Phone support is available on 1300 137 659

  • Overview of systematic reviews
  • Systematic or scoping?
  • Other review types
  • Glossary of terms
  • Define question
  • Top tools and techniques
  • How to search
  • Where to search
  • Subject headings
  • Search filters
  • Review your search
  • Run your search on other databases
  • Search the grey literature
  • Report search results
  • Updating a search
  • How to screen
  • Critical appraisal

Data synthesis

Data synthesis overview.

Now that you have extracted your data , the next step is to synthesise the data.

Move through the slide deck below to learn about data synthesis. Alternatively, download the PDF document at the bottom of this box.

  • Data synthesis This document is a printable version of the slide deck above.

Forest plot example

If you have conducted a meta-analysis, you can present a summary of each study and your overall findings in a forest plot.

Select the  i  icons in the image below to learn about each component of a forest plot.

Test your knowledge

Test your knowledge about meta-analysis and forest plots in the activity below.

Guidelines and standards

Medical icon

  • Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) website

Items 13a - 13f of the PRISMA 2020 Checklist address synthesis methods and are described further in the Explanation and Elaboration document.

Other standards

  • Overview of systematic reviews See the overview page (of this guide) for additional guidelines and standards.
  • Interpreting and understanding meta-analysis graphs: a practical guide (2006) Provides a practical guide for appraising systematic reviews for relevance to clinical practice and interpreting meta-analysis graphs .
  • What is a meta-analysis? (CEBI, University of Oxford) Provides addition information about meta-analyses.
  • How to read a forest plot (CEBI, University of Oxford) Additional detailed information about how to read a forest plot.

Additional readings:

  • Modern meta-analysis review and update of methodologies (2017) Comprehensive details about conducting meta-analyses.
  • Meta-synthesis and evidence-based health care - a method for systematic review (2012)  This article describes the process of systematic review of qualitative studies.
  • Lessons from comparing narrative synthesis and meta-analysis in a systematic review (2015). Investigates the contribution and implications of narrative synthesis and meta-analysis in a systematic review.
  • Speech-language pathologist interventions for communication in moderate-severe dementia: a systematic review (2018) An example of a systematic review without a meta-analysis.
  • << Previous: Extraction
  • Next: Write >>
  • Last Updated: Sep 9, 2024 11:35 AM
  • URL: https://guides.library.unisa.edu.au/SystematicReviews
  • Research article
  • Open access
  • Published: 16 March 2013

Overview of data-synthesis in systematic reviews of studies on outcome prediction models

  • Tobias van den Berg 1 ,
  • Martijn W Heymans 1 ,
  • Stephanie S Leone 2 ,
  • David Vergouw 2 ,
  • Jill A Hayden 3 ,
  • Arianne P Verhagen 4 &
  • Henrica CW de Vet 1  

BMC Medical Research Methodology volume  13 , Article number:  42 ( 2013 ) Cite this article

42k Accesses

18 Citations

1 Altmetric

Metrics details

Many prognostic models have been developed. Different types of models, i.e. prognostic factor and outcome prediction studies, serve different purposes, which should be reflected in how the results are summarized in reviews. Therefore we set out to investigate how authors of reviews synthesize and report the results of primary outcome prediction studies.

Outcome prediction reviews published in MEDLINE between October 2005 and March 2011 were eligible and 127 Systematic reviews with the aim to summarize outcome prediction studies written in English were identified for inclusion.

Characteristics of the reviews and the primary studies that were included were independently assessed by 2 review authors, using standardized forms.

After consensus meetings a total of 50 systematic reviews that met the inclusion criteria were included. The type of primary studies included (prognostic factor or outcome prediction) was unclear in two-thirds of the reviews. A minority of the reviews reported univariable or multivariable point estimates and measures of dispersion from the primary studies. Moreover, the variables considered for outcome prediction model development were often not reported, or were unclear. In most reviews there was no information about model performance. Quantitative analysis was performed in 10 reviews, and 49 reviews assessed the primary studies qualitatively. In both analyses types a range of different methods was used to present the results of the outcome prediction studies.

Conclusions

Different methods are applied to synthesize primary study results but quantitative analysis is rarely performed. The description of its objectives and of the primary studies is suboptimal and performance parameters of the outcome prediction models are rarely mentioned. The poor reporting and the wide variety of data synthesis strategies are prone to influence the conclusions of outcome prediction reviews. Therefore, there is much room for improvement in reviews of outcome prediction studies.

Peer Review reports

The methodology for prognosis research is still under development. Although there is abundant literature to help researchers perform this type of research [ 1 – 5 ], there is still no widely agreed approach to building a multivariable prediction model [ 6 ]. An important distinction in prognosis is made between prognostic factor models, also called explanatory models and outcome prediction models [ 7 , 8 ]. Prognostic factor studies investigate causal relationships, or pathways between a single (prognostic) factor and an outcome, and focus on the effect size (e.g. relative risk) of this prognostic factor which ideally is adjusted for potential confounders. Outcome prediction studies, on the other hand, combine multiple factors (e.g. clinical and non-clinical patient characteristics) in order to predict future events in individuals, and therefore focus on absolute risks, i.e. predicted probabilities in logistic regression analysis. Methods that can be used to summarize data from prognostic factor studies in a meta-analysis can easily be found in the literature [ 9 , 10 ], but this is not the case for outcome prediction studies. Therefore, in the present study we focus on how authors of published reviews have synthesized outcome prediction models. The nomenclature to indicate various types of prognosis research is not standardized. We use prognosis research as an umbrella term for all research that might explain or predict a future outcome and prognostic factor and outcome prediction as specific types of prognosis studies.

In 2006, Hayden et al. showed that in systematic reviews of prognosis studies, different methods are used to assess the quality of primary studies [ 11 ]. Moreover, when quality is assessed, integration of these quality scores in the synthesis of the review is not guaranteed. For reviews of outcome prediction models, additional characteristics are important in the synthesis of models to reflect choices made in the primary studies, such as which variables are included in statistical models and how this selection was made. These choices therefore also reflect the internal and external validity of a model and influence the predictive performance of the model. In systematic reviews the researchers synthesize results across primary outcome prediction studies which include different variables and show methodological diversity. Moreover, relevant information is not always available, due to poor reporting in the studies. For example, several researchers have found that current knowledge about the recommended number of events per variable, and the coding and selection of variables, among other features, are not always reported in primary outcome prediction research [ 12 – 14 ]. Although improvement in primary studies themselves is needed, reviews that summarize outcome prediction evidence need to consider the current diversity in methodology in primary studies.

In this meta-review we focus on reviews of outcome prediction studies, and how they summarize the characteristics of design and analysis, and the results of primary studies. As there is no guideline nor agreement how primary outcome prediction models in medical research and epidemiology should be summarized in systematic reviews, an overview of current methods helps researchers to improve and develop these methods. Moreover, current methods for outcome prediction reviews are unknown to the research community. Therefore, the aim of this review was to provide an overview on how published reviews of outcome prediction studies describe and summarize the characteristics of the analyses in primary studies, and how the data is synthesized.

Literature search and selection of studies

Systematic reviews and meta-analyses of outcome prediction models that were published between October 2005 and March 2011 were searched. We were only interested in reviews that included multivariable outcome prediction studies. In collaboration with a medical information specialist, we developed a search strategy in MEDLINE, extending on the strategy used by Hayden [ 11 ], by adding recommended other search terms for predictive and prognostic research [ 15 , 16 ]. The full search strategy is presented in Appendix 1.

Based on title and abstract, potential eligible reviews were selected by one author (TvdB), who in case of any doubt included the review. Another author (MH) checked the set of potential eligible reviews. Ineligible reviews were excluded after consensus between both authors. The full texts of the included reviews were read, and if there was any doubt on eligibility a third review author (HdV) was consulted. The inclusion criteria were met if the study design was a systematic review with or without a meta-analysis, multiple variables were studied in an outcome prediction model, and the review was written in the English language. Reviews were excluded if they were based on individual patient data only, or when the topic was genetic profiling.

Data-extraction

A data-extraction form was developed, based on important items to prognosis [ 1 , 2 , 12 , 13 , 17 ], to assess the characteristics of reviews and primary studies and is available from the first author on request. The items on this data-extraction form are shown in Appendix 2. Before the form was finalized it was pilot-tested by all review authors and minor adjustments were made after discussion about the differences in scores. One review author scored all reviews (TvdB) while other review authors (MH, AV, DV, and SL) collectively scored all reviews. Consensus meetings were held within 2 weeks after a review had been scored to solve disagreements. If consensus was not reached, a third reviewer (MH or HdV) was consulted to make a final decision.

An item was scored ‘yes’ if positive information was found about that specific methodological item, e.g. if it was clear that sensitivity analyses were conducted. If it was clear that a specific methodological requirement was not fulfilled, a ‘no’ was scored, e.g. no sensitivity analyses were conducted. In case of doubt or uncertainty, ‘unclear’ was scored. Sometimes, a methodological item could be scored as ‘not applicable’. The number of reviews within a specific answer category was reported, as well as the proportion.

Literature search and selection process

The search strategy revealed 7889 references and, based on title and abstract, 216 were selected to be read in full text (see the flowchart in Figure  1 ). Of these reviews, 89 were excluded and 127 remained. Exclusions after the full text had been read were mainly due to the focus of the research on a single variable with an outcome (prognostic factor study), analysis based on individual patient data only, or a narrative overview study design. After completing the data-extraction, the objectives and methods of 44 reviews described summaries of prognostic factor studies, and 33 reviews had an unclear approach. Therefore, a total of 50 reviews on outcome prediction studies were analyzed [ 18 – 67 ].

figure 1

Flowchart of the search and selection process.

After completing the data-extraction form for all of the included reviews, most disagreements between review authors were found on items concerning the review objectives, the type of primary studies included, and the method of qualitative data-synthesis. Unclear reporting and, to a lesser degree, reading errors contributed to the disagreements. After consensus meetings only a small proportion of items needed to be discussed with a third reviewer.

Objective and design of the review

Table  1 , section 1 shows the items with regard to information about the reviews. Of the 50 reviews rated as summaries of outcome prediction studies, less than one third included only outcome prediction studies *[ 23 , 27 , 28 , 32 , 35 , 39 , 44 , 48 ],[ 50 , 52 , 55 , 58 , 60 , 66 ]. In about two thirds, the type of primary studies that were included was unclear, and the remaining reviews included a combination of prognostic factor and outcome prediction studies. Most reviews clearly described their outcome of interest. Also information about the assessment of the methodological quality of the primary studies, i.e. risk of bias, was provided in most reviews. In those that did, two thirds described the basic design of the primary studies in addition to a list of methodological criteria (defined in our study as a list consisting of at least four quality items). In some reviews an established criteria list was used or adapted, or a new criteria list was developed. In the reviews that assessed methodological quality, less than half actually used this information to account for differences in study quality, mainly by performing a ‘levels of evidence’ analysis, subgroup-analyses, or sensitivity analyses.

Information about the design and results of the primary studies

In Table  1 , section 2 shows information provided about the included primary studies. The outcome measures used in the included studies were reported in most of the reviews. Only 2 reviews [ 28 , 52 ] described the statistical methods that were used in the primary studies to select variables for inclusion of a final prediction model, e.g. forward or backward selection procedures, and 6 others whether and how patients were treated.

A minority of reviews [ 23 , 24 , 27 , 28 ] described for all studies the variables that were considered for inclusion in the outcome prediction model and only 5 reviews [ 36 , 37 , 39 , 48 , 55 ] reported univariable point estimates (i.e.. regression coefficients or odds ratios) and estimates of dispersion (e.g. standard errors) of all studies. Similarly, multivariable point estimates and estimates of dispersion were reported in respectively 11 and 10 of the reviews [ 21 , 26 , 27 , 31 , 33 , 37 , 44 , 52 ],[ 55 , 64 , 65 ].

With regard to the presentation of univariable and multivariable point estimates, 2 reviews presented both types of results [ 37 , 55 ], 31 did not report any estimates, and 17 reviews were unclear or reported only univariable or multivariable results [not shown in the table]. Lastly, model performance and number of events per variable were reported in 7 reviews [ 32 , 39 , 41 , 60 , 61 , 65 , 66 ] and 4 reviews [ 40 , 48 , 58 , 61 ], respectively.

Data-analysis and synthesis in the reviews

Table  1 , section 3 illustrates how the results of primary studies were summarized in the reviews. It shows that heterogeneity was described in almost all reviews by reporting differences in the study design and the characteristics of the study population. All but one review [ 57 ] summarized the results of included studies in a qualitative manner. Methods that were mainly used for that purpose were number of statistical significant results, consistency of findings, or a combination of these. Quantitative analysis, i.e. statistical pooling, was performed in 10 of the 50 reviews [ 25 , 28 , 31 , 36 , 37 , 44 , 45 , 57 – 59 ]. The quantitative methods used included random effects models and fixed effects models of regression coefficients, odds ratios or hazard ratios. Of these quantitative summaries, 40% assessed the presence of statistical heterogeneity using I 2 , Chi 2 , or the Q statistic. In two reviews [ 25 , 59 ], statistical heterogeneity was found to be present, and subgroup analysis was performed to determine the source of this heterogeneity [results not shown]. In 8 of the reviews there was a graphical presentation of the results, in which a forest plot [ 25 , 28 , 36 – 38 , 52 , 59 ], per single predictor, was the frequently used method. Other studies used a barplot [ 57 ] or a scatterplot [ 38 ]. In 6 reviews [ 25 , 26 , 32 , 43 , 46 , 58 ] a sensitivity analysis was performed to test the robustness of the choices made such as changing the cut-off value for a high or low quality primary study.

We made an overview of how systematic reviews summarize and report the results of primary outcome prediction studies. Specifically, we extracted information on how the data-synthesis was performed in reviews since outcome prediction models may consider different potential predictors, and include a dissimilar set of variables in the final prediction model, and use a variety of statistical methods to obtain an outcome prediction model.

Currently, in prognosis studies a distinction is made between outcome prediction models and prognostic factor models. The methodology of data synthesis in a review of the latter type of prognosis is comparable to the methodology of aetiological reviews. For that reason, in the present study we only focused on reviews of outcome prediction studies. Nonetheless, we found it difficult to distinct between both review types. Less than half of the reviews that we initially selected for data-extraction in fact seemed to serve an outcome prediction purpose. It appeared that the other reviews summarized prognostic factor studies only, or the objective was unclear. In particular, prognostic factor reviews that investigated more than one variable in addition to non-specific objectives made it difficult to clarify what the purpose of reviews was. As a consequence, we might have misclassified some of the 44 excluded reviews rated as prognostic factor. The objective of a review should also include information about the type of study that is included, that is of outcome prediction studies in this case. However, we found that in reviews aimed at outcome prediction the type of primary study was unclear for two-thirds of the reviews. An example we encountered in a review was that their purpose was “to identify preoperative predictive factors for acute post-operative pain and analgesic consumption” although the review authors included any study that identified one or more potential risk factors or predictive factors. The risk of combining both types of studies, i.e. risk factor or prognostic factor studies and predictive factor studies, is that inclusion of potential covariables in the former type are based on change in regression coefficient of the risk factor while in the latter study type all potential predictor variables are included based on their predictive ability of the outcome. This distinction may lead to: 1) biased results in a meta-analysis or other form of evidence synthesis because a risk factor is not always predictive for an outcome and 2) risk factor studies – if adjusted for potential confounders at all – have a slightly different method to obtain a multivariable model compared to outcome prediction studies which may also lead to biased regression coefficients. The distinction between prognostic factor and outcome prediction studies was already emphasized in 1983 by Copas [ 68 ]. He stated that “a method for achieving a good predictor may be quite inappropriate for other questions in regression analysis such as the interpretation of individual regression coefficients”. In other words, the methodology of outcome prediction modelling differs from that of prognostic factor modelling, and therefore combining both types of research into one review to reflect current evidence should be discouraged. Hemingway et al. [ 2 ] appealed for standard nomenclature in prognosis research, and the results of our study underline their plea. Authors of reviews and primary studies should clarify their type of research, for example by using the terms applied by Hayden et al. [ 8 ] ‘prognostic factor modelling’ and ‘outcome prediction modelling’, and give a clear description of their objective.

Studies included in outcome prediction reviews are rarely similar in design and methodology, and this is often neglected when summarizing the evidence. Differences, for instance in the variables studied and the method of analysis for variable selection might explain heterogeneity in results, and should therefore be reported and reflected on when striving to summarize evidence in the most appropriate way. There is no doubt that the methodological quality of primary studies included in reviews is related to the concept of bias [ 69 , 70 ] and it is therefore important to assess this [ 11 , 69 , 70 ]. Dissemination bias reflects if publication bias is likely to be present, how this is handled and what is done to correct for it [ 71 ]. To our knowledge, dissemination bias and especially its consequences in reviews of outcome prediction models are not studied yet. Most likely testimation bias [ 5 ], i.e. the predictors considered and the amount of predictors in relation to the effective sample size influence results more then publication bias. Therefore, we did not study dissemination bias on the review level.

With regard to the reporting of primary study characteristics in the systematic reviews, there is much room for improvement. We found that the methods of model development (e.g. the variables considered and the variable selection methods used) in the primary studies were not, or only vaguely reported in the included reviews. These methods are however important, because variable selection procedures can affect the composition of the multivariable model due to estimation bias, or may result in an increase in model uncertainty [ 72 – 74 ]. Furthermore, the predictive performance of the model can be biased by these methods [ 74 ]. We also found that only 5 of the reviews reported what kind of treatment the patients received in the primary studies. Although prescribed treatment is often not considered as a candidate predictor, it is likely to have a considerable impact on prognosis. Moreover, treatment may vary in relation to predictive variables [ 75 ], and although randomized controlled trials provide patients with similar treatment strategies, in cohort studies which are most often seen in prognosis research this is often not the case. Regardless of difficulties in defining groups that receive the same treatment, it is imperative to consider treatment in outcome prediction models. In order to ensure correct data-synthesis of the results, the primary studies not only should provide point estimates and estimates of dispersion of all the included variables, but also for non-significant findings. Whereas the results of positive or favourable findings are more often reported [ 75 – 78 ], the effects of predictive factors that do not reach statistical significance also need to be compared and summarized in a review. Imagine a variable being of statistical significance in one article, but not reported in others because of non-significance. It is likely that this one significant result is a spurious finding or that the others were underpowered. Without information about the non-significant findings in other studies, biased or even incorrect conclusions might be drawn. This means that reporting of the evidence of primary studies should be accompanied by the results of univariable and multivariable associations, regardless of their level of significance. Moreover, confidence intervals, or other estimates of dispersion are also needed in the review, and unfortunately these results were not presented in most of the reviews in our study. Some reviews considered differences in unadjusted and adjusted results, and the results of one review were sensibly stratified according to univariable and multivariable effects [ 38 ]. Other reviews merely reported multivariable results [ 31 ], or only univariable results if multivariable results were unavailable [ 58 ]. In addition to the multivariable results of a final prediction model, the predictive performance of these models is important for the assessment of clinical usefulness [ 79 ]. A prediction model in itself does not indicate how much variance in outcome is explained by the included variables. Unfortunately, in addition to the non-reporting of several primary study characteristics, the performance of the models was rarely reported in the reviews included in our overview.

Different stages can be distinguished in outcome prediction research [ 80 ]. Most outcome prediction models evaluated in the systematic reviews appeared to be in a developmental phase. Before implementation in daily practice, confirmation of the results in other studies is needed. With this type of validation studies underway, in future reviews we should acknowledge the difference between externally validated models and models from developmental studies, and analyze them separately.

In systematic reviews data can be combined quantitatively, i.e. a meta-analysis can be performed. This was done in 10 of the reviews. All of them combined point estimates (mostly odds ratios, but also a mix of odds ratios, hazard ratios and relative risks) and confidence intervals for single outcome prediction variables. This made it possible to calculate a pooled point estimate, often complemented with confidence intervals [ 81 ]. However, in outcome prediction research we are interested in the estimates of a combination of predictive factors, which makes it possible to calculate absolute risks or probabilities to predict an outcome in individuals [ 82 ]. Even if the relative risk of a variable is statistically significant, it does not provide information about the extent to which this variable is predictive for a particular outcome. The distribution of predictor values, outcome prevalence, and correlations between variables also influences the predictive value of variables within a model [ 83 ]. Effect sizes also provide no information about the amount of variation in outcomes that is explained. In summary: the current quantitative methods seem to be more of an explanatory way of summarizing the available evidence, instead of quantitatively summarizing complete outcome prediction models.

Medline was the only database that was searched for relevant reviews. Our intention was to provide an overview of recently published reviews and not to include all relevant outcome prediction reviews. Within Medline, some eligible reviews may have been missed if their titles and abstracts did not include relevant terms and information. An extensive search strategy was applied and abstracts were screened thoroughly and discussed in case of disagreement. Data-extraction was performed in pairs to prevent reading and interpretation errors. Disagreements mainly occurred when deciding on the objective of a review and the type of primary studies included, due to poor reporting in most of the reviews. This indicates a lack of clarity, explanation and reporting within reviews. Therefore, screening in pairs is a necessity, and standardized criteria should be developed and applied in future studies focusing on such reviews. Consistency in rating on the data-extraction form was enhanced by one review author rating all reviews, with one of the other review authors as second rater. Several items were scored as “no”, but we did not know whether this was a true negative (i.e. leading to bias) or that no information was reported about a particular item. For review authors it is especially difficult to summarize information about primary studies because there may be a lack of information in the studies [ 13 , 14 , 84 ].

Implications

There is still no available methodological procedure for a meta-analysis of regression coefficients of multivariable outcome prediction models. Some authors, such as Riley et al. and Altman [ 81 , 84 ], are of the opinion that it remains practically impossible, due to poor reporting, publication bias, and heterogeneity across studies. However, a considerable number of outcome prediction studies have been published, and it would be useful to integrate this body of evidence into one summary result. Moreover, there is an increase in the number of reviews that are being published. Therefore, there is a need to find the best strategy to integrate the results of primary outcome prediction studies. Consequently, until a method to quantitatively synthesize results has been developed, a sensible qualitative data-synthesis, which takes methodological differences between primary studies into account, is indicated. In summarizing the evidence, differences in methodological items and model-building strategies should be described and taken into account when assessing the overall evidence for outcome prediction. For example, univariable and multivariable results should be described separately, or subgroup analyses should be performed when they are combined. Other items that, in our opinion should be taken into consideration with regard to the data-synthesis are: study quality, variables used for model development, statistical methods used for variable selection procedures, the performance of models, and sufficient cases and non-cases to guarantee adequate study power. Regardless of whether or not these items are taken into consideration in the data-synthesis, we strongly recommend that in reviews they are described for all primary studies included so that readers can also take them into consideration.

In conclusion, poor reporting of relevant information and differences in methodology occur in primary outcome prediction research. Even the predictive ability of the models was rarely reported. This, together with our current inability to pool multivariable outcome prediction models, challenges review authors to make informative reviews of outcome prediction models.

Search strategy: 01-03-2011

Database: MEDLINE

((“systematic review”[tiab] OR “systematic reviews”[tiab] OR “Meta-Analysis as Topic”[Mesh] OR meta-analysis[tiab] OR “Meta-Analysis”[Publication Type]) AND (“2005/11/01”[EDat] : “3000”[EDat]) AND ((“Incidence”[Mesh] OR “Models, Statistical”[Mesh] OR “Mortality”[Mesh] OR “mortality ”[Subheading] OR “Follow-Up Studies”[Mesh] OR “Prognosis”[Mesh:noexp] OR “Disease-Free Survival”[Mesh] OR “Disease Progression”[Mesh:noexp] OR “Natural History”[Mesh] OR “Prospective Studies”[Mesh]) OR ((cohort*[tw] OR course*[tw] OR first episode*[tw] OR predict*[tw] OR predictor*[tw] OR prognos*[tw] OR follow-up stud*[tw] OR inciden*[tw]) NOT medline[sb]))) NOT ((“addresses”[Publication Type] OR “biography”[Publication Type] OR “case reports”[Publication Type] OR “comment”[Publication Type] OR “directory”[Publication Type] OR “editorial”[Publication Type] OR “festschrift”[Publication Type] OR “interview”[Publication Type] OR “lectures”[Publication Type] OR “legal cases”[Publication Type] OR “legislation”[Publication Type] OR “letter”[Publication Type] OR “news”[Publication Type] OR “newspaper article”[Publication Type] OR “patient education handout”[Publication Type] OR “popular works”[Publication Type] OR “congresses”[Publication Type] OR “consensus development conference”[Publication Type] OR “consensus development conference, nih”[Publication Type] OR “practice guideline”[Publication Type]) OR (“Animals”[Mesh] NOT (“Animals”[Mesh] AND “Humans”[Mesh]))).

Items used to assess the characteristics of analyses in outcome prediction primary studies and reviews:

Information about the review:

What type of studies are included?

Is(/are) the outcome(s) of interest clearly described?

Is information about the quality assessment method provided?

What method was used?

Did the review account for quality?

Information about the analysis of the primary studies:

Are the outcome measures clearly described?

Is the statistical method used for variable selection described?

Is there a description of treatments received provided?

Information about the results of the primary studies:

Are crude univariable associations and estimates of dispersion for all the variables of the primary studies presented?

Are all variables that were used for model development described?

Are the multivariable associations and estimates of dispersions presented?

Is model performance assessed and reported?

Is the number of predictors relative to the number of outcome events described?

Data-analysis and synthesis of the review:

Is the heterogeneity of primary studies described?

Is a qualitative synthesis presented?

Are methods for quantitative analysis described?

Is the statistical heterogeneity assessed?

What method is used to assess statistical heterogeneity?

If statistical heterogeneity exists, are sources of the heterogeneity investigated?

What method is used to investigate potential sources of heterogeneity?

Is a graphical presentation of the results provided?

Are sensitivity analysis performed?

On which level?

Harrell FEJ, Lee KL, Mark DB: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15: 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.

Article   PubMed   Google Scholar  

Hemingway H, Riley RD, Altman DG: Ten steps towards improving prognosis research. BMJ. 2009, 339: b4184-10.1136/bmj.b4184.

Moons KGM, Donders AR, Steyerberg EW, Harrell FE: Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epidemiol. 2004, 57: 1262-1270. 10.1016/j.jclinepi.2004.01.020.

Article   CAS   PubMed   Google Scholar  

Royston P, Altman DG, Sauerbrei W: Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006, 25: 127-141. 10.1002/sim.2331.

Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009, New York: Springer

Book   Google Scholar  

Royston P, Moons KGM, Altman DG, Vergouwe Y: Prognosis and prognostic research: Developing a prognostic model. BMJ. 2009, 338: b604-10.1136/bmj.b604.

Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG: Prognosis and prognostic research: what, why, and how?. BMJ. 2009, 338: b375-10.1136/bmj.b375.

Hayden JA, Dunn KM, van der Windt DA, Shaw WS: What is the prognosis of back pain?. Best Pract Res Clin Rheumatol. 2010, 24: 167-179. 10.1016/j.berh.2009.12.005.

Hayden JA, Chou R, Hogg-Johnson S, Bombardier C: Systematic reviews of low back pain prognosis had variable methods and results: guidance for future prognosis reviews. J Clin Epidemiol. 2009, 62: 781-796. 10.1016/j.jclinepi.2008.09.004.

Krasopoulos G, Brister SJ, Beattie WS, Buchanan MR: Aspirin “resistance” and risk of cardiovascular morbidity: systematic review and meta-analysis. BMJ. 2008, 336: 195-198. 10.1136/bmj.39430.529549.BE.

Article   PubMed   PubMed Central   Google Scholar  

Hayden JA, Cote P, Bombardier C: Evaluation of the quality of prognosis studies in systematic reviews. Ann Intern Med. 2006, 144: 427-437. 10.7326/0003-4819-144-6-200603210-00010.

Mallett S, Timmer A, Sauerbrei W, Altman DG: Reporting of prognostic studies of tumour markers: a review of published articles in relation to REMARK guidelines. Br J Cancer. 2010, 102: 173-180. 10.1038/sj.bjc.6605462.

Mallett S, Royston P, Waters R, Dutton S, Altman DG: Reporting performance of prognostic models in cancer: a review. BMC Med. 2010, 8: 21-10.1186/1741-7015-8-21.

Mallett S, Royston P, Dutton S, Waters R, Altman DG: Reporting methods in studies developing prognostic models in cancer: a review. BMC Med. 2010, 8: 20-10.1186/1741-7015-8-20.

Ingui BJ, Rogers MA: Searching for clinical prediction rules in MEDLINE. J Am Med Inform Assoc. 2001, 8: 391-397. 10.1136/jamia.2001.0080391.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Wilczynski NL: Natural History and Prognosis. PDQ, Evidence-Based Principles and Practice. Edited by: McKibbon KA, Wilczynski NL, Eady A, Marks S. 2009, Shelton, Connecticut: People’s Medical Publishing House

Google Scholar  

Austin PC, Tu JV: Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. J Clin Epidemiol. 2004, 57: 1138-1146. 10.1016/j.jclinepi.2004.04.003.

Lee M, Chodosh J: Dementia and life expectancy: what do we know?. J Am Med Dir Assoc. 2009, 10: 466-471. 10.1016/j.jamda.2009.03.014.

Gravante G, Garcea G, Ong S: Prediction of Mortality in Acute Pancreatitis: A Systematic Review of the Published Evidence. Pancreatology. 2009, 9: 601-614. 10.1159/000212097.

Celestin J, Edwards RR, Jamison RN: Pretreatment psychosocial variables as predictors of outcomes following lumbar surgery and spinal cord stimulation: a systematic review and literature synthesis. Pain Med. 2009, 10: 639-653. 10.1111/j.1526-4637.2009.00632.x.

Wright AA, Cook C, Abbott JH: Variables associated with the progression of hip osteoarthritis: a systematic review. Arthritis Rheum. 2009, 61: 925-936. 10.1002/art.24641.

Heitz C, Hilfiker R, Bachmann L: Comparison of risk factors predicting return to work between patients with subacute and chronic non-specific low back pain: systematic review. Eur Spine J. 2009, 18: 1829-35. 10.1007/s00586-009-1083-9.

Sansam K, Neumann V, O’Connor R, Bhakta B: Predicting walking ability following lower limb amputation: a systematic review of the literature. J Rehabil Med. 2009, 41: 593-603. 10.2340/16501977-0393.

Detaille SI, Heerkens YF, Engels JA, van der Gulden JWJ, van Dijk FJH: Common prognostic factors of work disability among employees with a chronic somatic disease: a systematic review of cohort studies. Scand J Work Environ Health. 2009, 35: 261-281. 10.5271/sjweh.1337.

Walton DM, Pretty J, MacDermid JC, Teasell RW: Risk factors for persistent problems following whiplash injury: results of a systematic review and meta-analysis. J Orthop Sports Phys Ther. 2009, 39: 334-350.

van Velzen JM, van Bennekom CAM, Edelaar MJA, Sluiter JK, Frings-Dresen MHW: Prognostic factors of return to work after acquired brain injury: a systematic review. Brain Inj. 2009, 23: 385-395. 10.1080/02699050902838165.

Borghuis MS, Lucassen PLBJ, van de Laar FA, Speckens AE, van Weel C, olde Hartman TC: Medically unexplained symptoms, somatisation disorder and hypochondriasis: course and prognosis. A systematic review. J Psychosom Res. 2009, 66: 363-377. 10.1016/j.jpsychores.2008.09.018.

Bramer JAM, van Linge JH, Grimer RJ, Scholten RJPM: Prognostic factors in localized extremity osteosarcoma: a systematic review. Eur J Surg Oncol. 2009, 35: 1030-1036. 10.1016/j.ejso.2009.01.011.

Tandon P, Garcia-Tsao G: Prognostic indicators in hepatocellular carcinoma: a systematic review of 72 studies. Liver Int. 2009, 29: 502-510. 10.1111/j.1478-3231.2008.01957.x.

Santaguida PL, Hawker GA, Hudak PL: Patient characteristics affecting the prognosis of total hip and knee joint arthroplasty: a systematic review. Can J Surg. 2008, 51: 428-436.

PubMed   PubMed Central   Google Scholar  

Elmunzer BJ, Young SD, Inadomi JM, Schoenfeld P, Laine L: Systematic review of the predictors of recurrent hemorrhage after endoscopic hemostatic therapy for bleeding peptic ulcers. Am J Gastroenterol. 2008, 103: 2625-2632. 10.1111/j.1572-0241.2008.02070.x.

Adamson SJ, Sellman JD, Frampton CMA: Patient predictors of alcohol treatment outcome: a systematic review. J Subst Abuse Treat. 2009, 36: 75-86. 10.1016/j.jsat.2008.05.007.

Paez JIG, Costa SF: Risk factors associated with mortality of infections caused by Stenotrophomonas maltophilia: a systematic review. J Hosp Infect. 2008, 70: 101-108. 10.1016/j.jhin.2008.05.020.

Johnson SR, Swiston JR, Granton JT: Prognostic factors for survival in scleroderma associated pulmonary arterial hypertension. J Rheumatol. 2008, 35: 1584-1590.

PubMed   Google Scholar  

Clarke SA, Eiser C, Skinner R: Health-related quality of life in survivors of BMT for paediatric malignancy: a systematic review of the literature. Bone Marrow Transplant. 2008, 42: 73-82. 10.1038/bmt.2008.156.

Kok M, Cnossen J, Gravendeel L, van der Post J, Opmeer B, Mol BW: Clinical factors to predict the outcome of external cephalic version: a metaanalysis. Am J Obstet Gynecol. 2008, 199: 630-637.

Stuart-Harris R, Caldas C, Pinder SE, Pharoah P: Proliferation markers and survival in early breast cancer: a systematic review and meta-analysis of 85 studies in 32,825 patients. Breast. 2008, 17: 323-334. 10.1016/j.breast.2008.02.002.

Kamper SJ, Rebbeck TJ, Maher CG, McAuley JH, Sterling M: Course and prognostic factors of whiplash: a systematic review and meta-analysis. Pain. 2008, 138: 617-629. 10.1016/j.pain.2008.02.019.

Nijrolder I, van der Horst H, van der Windt D: Prognosis of fatigue. A systematic review. J Psychosom Res. 2008, 64: 335-349. 10.1016/j.jpsychores.2007.11.001.

Williams M, Williamson E, Gates S, Lamb S, Cooke M: A systematic literature review of physical prognostic factors for the development of Late Whiplash Syndrome. Spine (Phila Pa 1976). 2007, 32: E764-E780. 10.1097/BRS.0b013e31815b6565.

Article   Google Scholar  

Willemse-van Son AHP, Ribbers GM, Verhagen AP, Stam HJ: Prognostic factors of long-term functioning and productivity after traumatic brain injury: a systematic review of prospective cohort studies. Clin Rehabil. 2007, 21: 1024-1037. 10.1177/0269215507077603.

Alvarez J, Wilkinson J, Lipshultz S: Outcome Predictors for Pediatric Dilated Cardiomyopathy: A Systematic Review. Prog Pediatr Cardiol. 2007, 23: 25-32. 10.1016/j.ppedcard.2007.05.009.

Mallen CD, Peat G, Thomas E, Dunn KM, Croft PR: Prognostic factors for musculoskeletal pain in primary care: a systematic review. Br J Gen Pract. 2007, 57: 655-661.

Stroke Risk in Atrial Fibrillation Working Group: Independent predictors of stroke in patients with atrial fibrillation: a systematic review. Neurology. 2007, 69: 546-554.

Kent PM, Keating JL: Can we predict poor recovery from recent-onset nonspecific low back pain? A systematic review. Man Ther. 2008, 13: 12-28. 10.1016/j.math.2007.05.009.

Tjang YS, van Hees Y, Korfer R, Grobbee DE, van der Heijden GJMG: Predictors of mortality after aortic valve replacement. Eur J Cardiothorac Surg. 2007, 32: 469-474. 10.1016/j.ejcts.2007.06.012.

Pfannschmidt J, Dienemann H, Hoffmann H: Surgical resection of pulmonary metastases from colorectal cancer: a systematic review of published series. Ann Thorac Surg. 2007, 84: 324-338. 10.1016/j.athoracsur.2007.02.093.

Williamson E, Williams M, Gates S, Lamb SE: A systematic literature review of psychological factors and the development of late whiplash syndrome. Pain. 2008, 135: 20-30. 10.1016/j.pain.2007.04.035.

Tas U, Verhagen AP, Bierma-Zeinstra SMA, Odding E, Koes BW: Prognostic factors of disability in older people: a systematic review. Br J Gen Pract. 2007, 57: 319-323.

Rassi AJ, Rassi A, Rassi SG: Predictors of mortality in chronic Chagas disease: a systematic review of observational studies. Circulation. 2007, 115: 1101-1108. 10.1161/CIRCULATIONAHA.106.627265.

Belo JN, Berger MY, Reijman M, Koes BW, Bierma-Zeinstra SMA: Prognostic factors of progression of osteoarthritis of the knee: a systematic review of observational studies. Arthritis Rheum. 2007, 57: 13-26. 10.1002/art.22475.

Langer-Gould A, Popat RA, Huang SM: Clinical and demographic predictors of long-term disability in patients with relapsing-remitting multiple sclerosis: a systematic review. Arch Neurol. 2006, 63: 1686-1691. 10.1001/archneur.63.12.1686.

Lamme B, Mahler CW, van Ruler O, Gouma DJ, Reitsma JB, Boermeester MA: Clinical predictors of ongoing infection in secondary peritonitis: systematic review. World J Surg. 2006, 30: 2170-2181. 10.1007/s00268-005-0333-1.

van Dijk GM, Dekker J, Veenhof C, van den Ende CHM: Course of functional status and pain in osteoarthritis of the hip or knee: a systematic review of the literature. Arthritis Rheum. 2006, 55: 779-785. 10.1002/art.22244.

Aalto TJ, Malmivaara A, Kovacs F: Preoperative predictors for postoperative clinical outcome in lumbar spinal stenosis: systematic review. Spine (Phila Pa 1976). 2006, 31: E648-E663. 10.1097/01.brs.0000231727.88477.da.

Hauser CA, Stockler MR, Tattersall MHN: Prognostic factors in patients with recently diagnosed incurable cancer: a systematic review. Support Care Cancer. 2006, 14: 999-1011. 10.1007/s00520-006-0079-9.

Bollen CW, Uiterwaal CSPM, van Vught AJ: Systematic review of determinants of mortality in high frequency oscillatory ventilation in acute respiratory distress syndrome. Crit Care. 2006, 10: R34-10.1186/cc4824.

Steenstra IA, Verbeek JH, Heymans MW, Bongers PM: Prognostic factors for duration of sick leave in patients sick listed with acute low back pain: a systematic review of the literature. Occup Environ Med. 2005, 62: 851-860. 10.1136/oem.2004.015842.

Bai M, Qi X, Yang Z: Predictors of hepatic encephalopathy after transjugular intrahepatic portosystemic shunt in cirrhotic patients: a systematic review. J Gastroenterol Hepatol. 2011, 26: 943-51. 10.1111/j.1440-1746.2011.06663.x.

Monteiro-Soares M, Boyko E, Ribeiro J, Ribeiro I, Dinis-Ribeiro M: Risk stratification systems for diabetic foot ulcers: a systematic review. Diabetologia. 2011, 54: 1190-1199. 10.1007/s00125-010-2030-3.

Lichtman JH, Leifheit-Limson EC, Jones SB: Predictors of hospital readmission after stroke: a systematic review. Stroke. 2010, 41: 2525-2533. 10.1161/STROKEAHA.110.599159.

Ronden RA, Houben AJ, Kessels AG, Stehouwer CD, de Leeuw PW, Kroon AA: Predictors of clinical outcome after stent placement in atherosclerotic renal artery stenosis: a systematic review and meta-analysis of prospective studies. J Hypertens. 2010, 28: 2370-2377.

de Jonge RCJ, van Furth AM, Wassenaar M, Gemke RJBJ, Terwee CB: Predicting sequelae and death after bacterial meningitis in childhood: a systematic review of prognostic studies. BMC Infect Dis. 2010, 10: 232-10.1186/1471-2334-10-232.

Colohan SM: Predicting prognosis in thermal burns with associated inhalational injury: a systematic review of prognostic factors in adult burn victims. J Burn Care Res. 2010, 31: 529-539. 10.1097/BCR.0b013e3181e4d680.

Clay FJ, Newstead SV, McClure RJ: A systematic review of early prognostic factors for return to work following acute orthopaedic trauma. Injury. 2010, 41: 787-803. 10.1016/j.injury.2010.04.005.

Brabrand M, Folkestad L, Clausen NG, Knudsen T, Hallas J: Risk scoring systems for adults admitted to the emergency department: a systematic review. Scand J Trauma Resusc Emerg Med. 2010, 18: 8-10.1186/1757-7241-18-8.

Montazeri A: Quality of life data as prognostic indicators of survival in cancer patients: an overview of the literature from 1982 to 2008. Health Qual Life Outcomes. 2009, 7: 102-10.1186/1477-7525-7-102.

Copas JB: Prediction and Shrinkage. J R Stat Soc Ser B (methodological). 1983, 45: 311-354.

Atkins D, Best D, Briss PA: Grading quality of evidence and strength of recommendations. BMJ. 2004, 328: 1490-

Deeks JJ, Dinnes J, D’Amico R: Evaluating non-randomised intervention studies. Health Technol Assess. 2003, 7: iii-173-

Parekh-Bhurke S, Kwok CS, Pang C: Uptake of methods to deal with publication bias in systematic reviews has increased over time, but there is still much scope for improvement. J Clin Epidemiol. 2011, 64: 349-57. 10.1016/j.jclinepi.2010.04.022.

Steyerberg EW: Selection of main effects. Clinicical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009, New York: Springer

Chatfield C: Model Uncertainty, Data Mining and Statistical Inference. J R Stat Soc Ser A. 1995, 158: 419-466. 10.2307/2983440.

Steyerberg EW, Eijkemans MJ, Habbema JD: Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999, 52: 935-942. 10.1016/S0895-4356(99)00103-1.

Altman DG: Systematic reviews of evaluations of prognostic variables. BMJ. 2001, 323: 224-228. 10.1136/bmj.323.7306.224.

Kyzas PA, Ioannidis JPA, axa-Kyza D: Quality of reporting of cancer prognostic marker studies: association with reported prognostic effect. J Natl Cancer Inst. 2007, 99: 236-243. 10.1093/jnci/djk032.

Kyzas PA, Ioannidis JPA, axa-Kyza D: Almost all articles on cancer prognostic markers report statistically significant results. Eur J Cancer. 2007, 43: 2559-2579. 10.1016/j.ejca.2007.08.030.

Rifai N, Altman DG, Bossuyt PM: Reporting bias in diagnostic and prognostic studies: time for action. Clin Chem. 2008, 54: 1101-1103. 10.1373/clinchem.2008.108993.

Vergouwe Y, Steyerberg EW, Eijkemans MJC, Habbema JD: Validity of prognostic models: when is a model clinically useful?. Semin Urol Oncol. 2002, 20: 96-107. 10.1053/suro.2002.32521.

Altman DG, Vergouwe Y, Royston P, Moons KGM: Prognosis and prognostic research: validating a prognostic model. BMJ. 2009, 338: b605-10.1136/bmj.b605.

Altman DG: Systematic reviews of evaluations of prognostic variables. Systematic Reviews in Health Care. Edited by: Egger M, Smith GD, Altman DG. 2001, London: BMJ Publishing Group, 228-47.

Chapter   Google Scholar  

Ware JH: The limitations of risk factors as prognostic tools. N Engl J Med. 2006, 355: 2615-2617. 10.1056/NEJMp068249.

Harrell FE: Multivariable modeling strategies. Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. 2001, New York: Springer,

Riley RD, Abrams KR, Sutton AJ: Reporting of prognostic markers: current problems and development of guidelines for evidence-based practice in the future. Br J Cancer. 2003, 88: 1191-1198. 10.1038/sj.bjc.6600886.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/13/42/prepub

Download references

Acknowledgment

We thank Ilse Jansma, MSc, for her contributions as a medical information specialist regarding the Medline search strategy. No compensation was received for her contribution.

No external funding was received for this study.

Author information

Authors and affiliations.

Department of Epidemiology and Biostatistics and the EMGO Institute for Health and Care Research, VU University Medical Centre, Amsterdam, The Netherlands

Tobias van den Berg, Martijn W Heymans & Henrica CW de Vet

Department of General Practice and the EMGO Institute for Health and Care Research, VU University Medical Centre, Amsterdam, The Netherlands

Stephanie S Leone & David Vergouw

Department of Community Health and Epidemiology, Dalhousie University, Halifax, Nova Scotia, Canada

Jill A Hayden

Department of General Practice, Erasmus Medical Centre, Rotterdam, The Netherlands

Arianne P Verhagen

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Tobias van den Berg .

Additional information

Competing interests.

All authors report no conflicts of interests.

Authors’ contributions

TvdB, had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: TvdB, MH, JH, AV, HdV. Acquisition of data: TvdB, MH, SL, DV, AV, HdV Analysis and interpretation of data: TvdB, MH, HdV. Drafting of the manuscript: TvdB, MH, HdV. Critical revision of the manuscript for important intellectual content: TvdB, MH, SL, DV, JH, AV, HdV. Statistical analysis: TvdB Study supervision: MH, HdV. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, rights and permissions.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

van den Berg, T., Heymans, M.W., Leone, S.S. et al. Overview of data-synthesis in systematic reviews of studies on outcome prediction models. BMC Med Res Methodol 13 , 42 (2013). https://doi.org/10.1186/1471-2288-13-42

Download citation

Received : 26 September 2012

Accepted : 04 March 2013

Published : 16 March 2013

DOI : https://doi.org/10.1186/1471-2288-13-42

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Meta-analysis
  • Forecasting

BMC Medical Research Methodology

ISSN: 1471-2288

what is data synthesis in systematic review

RMIT University

Teaching and Research guides

Systematic reviews.

  • Starting the review
  • About systematic reviews
  • Research question
  • Plan your search
  • Sources to search
  • Search example
  • Screen and analyse

What is synthesis?

Quantitative synthesis (meta-analysis), qualitative synthesis.

  • Further help

Synthesis is a stage in the systematic review process where extracted data (findings of individual studies) are combined and evaluated. The synthesis part of a systematic review will determine the outcomes of the review.

There are two commonly accepted methods of synthesis in systematic reviews:

  • Quantitative data synthesis
  • Qualitative data synthesis

The way the data is extracted from your studies and synthesised and presented depends on the type of data being handled.

If you have quantitative information, some of the more common tools used to summarise data include:

  • grouping of similar data, i.e. presenting the results in tables
  • charts, e.g. pie-charts
  • graphical displays such as forest plots

If you have qualitative information, some of the more common tools used to summarise data include:

  • textual descriptions, i.e. written words
  • thematic or content analysis

Whatever tool/s you use, the general purpose of extracting and synthesising data is to show the outcomes and effects of various studies and identify issues with methodology and quality. This means that your synthesis might reveal a number of elements, including:

  • overall level of evidence
  • the degree of consistency in the findings
  • what the positive effects of a drug or treatment are, and what these effects are based on
  • how many studies found a relationship or association between two things

In a quantitative systematic review, data is presented statistically. Typically, this is referred to as a meta-analysis . 

The usual method is to combine and evaluate data from multiple studies. This is normally done in order to draw conclusions about outcomes, effects, shortcomings of studies and/or applicability of findings.

Remember, the data you synthesise should relate to your research question and protocol (plan). In the case of quantitative analysis, the data extracted and synthesised will relate to whatever method was used to generate the research question (e.g. PICO method), and whatever quality appraisals were undertaken in the analysis stage.

One way of accurately representing all of your data is in the form of a f orest plot . A forest plot is a way of combining results of multiple clinical trials in order to show point estimates arising from different studies of the same condition or treatment. 

It is comprised of a graphical representation and often also a table. The graphical display shows the mean value for each trial and often with a confidence interval (the horizontal bars). Each mean is plotted relative to the vertical line of no difference.

  • Forest Plots - Understanding a Meta-Analysis in 5 Minutes or Less (5:38 min) In this video, Dr. Maureen Dobbins, Scientific Director of the National Collaborating Centre for Methods and Tools, uses an example from social health to explain how to construct a forest plot graphic.
  • How to interpret a forest plot (5:32 min) In this video, Terry Shaneyfelt, Clinician-educator at UAB School of Medicine, talks about how to interpret information contained in a typical forest plot, including table data.
  • An introduction to meta-analysis (13 mins) Dr Christopher J. Carpenter introduces the concept of meta-analysis, a statistical approach to finding patterns and trends among research studies on the same topic. Meta-analysis allows the researcher to weight study results based on size, moderating variables, and other factors.

Journal articles

  • Neyeloff, J. L., Fuchs, S. C., & Moreira, L. B. (2012). Meta-analyses and Forest plots using a microsoft excel spreadsheet: step-by-step guide focusing on descriptive data analysis. BMC Research Notes, 5(1), 52-57. https://doi.org/10.1186/1756-0500-5-52 Provides a step-by-step guide on how to use Excel to perform a meta-analysis and generate forest plots.
  • Ried, K. (2006). Interpreting and understanding meta-analysis graphs: a practical guide. Australian Family Physician, 35(8), 635- 638. This article provides a practical guide to appraisal of meta-analysis graphs, and has been developed as part of the Primary Health Care Research Evaluation Development (PHCRED) capacity building program for training general practitioners and other primary health care professionals in research methodology.

In a qualitative systematic review, data can be presented in a number of different ways. A typical procedure in the health sciences is  thematic analysis .

As explained by James Thomas and Angela Harden (2008) in an article for  BMC Medical Research Methodology : 

"Thematic synthesis has three stages:

  • the coding of text 'line-by-line'
  • the development of 'descriptive themes'
  • and the generation of 'analytical themes'

While the development of descriptive themes remains 'close' to the primary studies, the analytical themes represent a stage of interpretation whereby the reviewers 'go beyond' the primary studies and generate new interpretive constructs, explanations or hypotheses" (p. 45).

A good example of how to conduct a thematic analysis in a systematic review is the following journal article by Jorgensen et al. (2108) on cancer patients. In it, the authors go through the process of:

(a) identifying and coding information about the selected studies' methodologies and findings on patient care

(b) organising these codes into subheadings and descriptive categories

(c) developing these categories into analytical themes

Jørgensen, C. R., Thomsen, T. G., Ross, L., Dietz, S. M., Therkildsen, S., Groenvold, M., Rasmussen, C. L., & Johnsen, A. T. (2018). What facilitates “patient empowerment” in cancer patients during follow-up: A qualitative systematic review of the literature. Qualitative Health Research, 28(2), 292-304. https://doi.org/10.1177/1049732317721477

Thomas, J., & Harden, A. (2008). Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Medical Research Methodology, 8(1), 45-54. https://doi.org/10.1186/1471-2288-8-45

  • << Previous: Screen and analyse
  • Next: Write >>

Creative Commons license: CC-BY-NC.

  • Last Updated: Aug 30, 2024 4:17 PM
  • URL: https://rmit.libguides.com/systematicreviews

Ohio State nav bar

The Ohio State University

  • BuckeyeLink
  • Find People
  • Search Ohio State

Health Sciences Library

Systematic Reviews

  • What is a Systematic Review?
  • 1. Choose the Right Kind of Review
  • 2. Formulate Your Question
  • 3. Establish a Team
  • 4. Develop a Protocol
  • 5. Conduct the Search
  • 6. Select Studies
  • 7. Extract Data
  • 8. Synthesize Your Results

Synthesize Your Results

Qualtitative synthesis, quantitative synthesis (meta-analysis).

  • 9. Disseminate Your Report
  • Request a Librarian Consultation

Consult With a Librarian

what is data synthesis in systematic review

To make an appointment to consult with an HSL librarian on your systematic review, please read our Systematic Review Policy and submit a Systematic Review Consultation Request .

To ask a question or make an appointment for assistance with a narrative review, please complete the Ask a Librarian Form .

Your collected data must be combined into a coherent whole and accompanied by an analysis that conveys a deeper understanding of the body of evidence. All reviews should include a qualitative synthesis, and may or may not include a quantitative synthesis (also known as a meta-analysis).

A qualitative synthesis is a narrative, textual approach to summarizing, analyzing, and assessing the body of evidence included in your review.  It is a necessary part of all systematic reviews, even those with a focus on quantitative data.

Use the qualitative synthesis to:

  • Provide a general summary of the characteristics and findings of the included studies.
  • Analyze the relationships between studies, exploring patterns and investigating heterogeneity.
  • Discuss the applicability of the body of evidence to the review's question within the PICO structure.
  • Explain the meta-analysis (if one is conducted) and interpret and analyze the robustness of its results.
  • Critique the strengths and weaknesses of the body of evidence as a whole, including a cumulative assessment of the risk of bias across studies.
  • Discuss any gaps in the evidence, such as patient populations that have been inadequately studied or for whom results differ.
  • Compare the review's findings with current conventional wisdom when appropriate.

A quantitative synthesis, or meta-analysis, uses statistical techniques to combine and analyze the results of multiple studies. The feasibility and sensibility of including a meta-analysis as part of your systematic review will depend on the data available.

Requirements for quantitative synthesis:

  • Clinical and methodological similarity between compared studies
  • Consistent study quality among compared studies
  • Statistical expertise from a review team member or consultant
  • << Previous: 7. Extract Data
  • Next: 9. Disseminate Your Report >>
  • Last Updated: May 14, 2024 8:03 AM
  • URL: https://hslguides.osu.edu/systematic_reviews
  • Mayo Clinic Libraries
  • Evidence Synthesis Guide
  • Synthesis & Meta-Analysis

Evidence Synthesis Guide : Synthesis & Meta-Analysis

  • Review Types & Decision Tree
  • Standards & Reporting Results
  • Materials in the Mayo Clinic Libraries
  • Training Resources
  • Review Teams
  • Develop & Refine Your Research Question
  • Develop a Timeline
  • Project Management
  • Communication
  • PRISMA-P Checklist
  • Eligibility Criteria
  • Register your Protocol
  • Other Resources
  • Other Screening Tools
  • Grey Literature Searching
  • Citation Searching
  • Minimize Bias
  • Risk of Bias by Study Design
  • GRADE & GRADE-CERQual
  • Data Extraction Tools
  • Publishing your Review

Bringing It All Together

what is data synthesis in systematic review

Synthesis involves pooling the extracted data from the included studies and summarizing the findings based on the overall strength of the evidence and consistency of observed effects. All reviews should include a qualitative synthesis and may also include a quantitative synthesis (i.e. meta-analysis). Data from sufficiently comparable and reliable studies are weighted and evaluated to determine the cumulative outcome in a meta-analysis.  Tabulation and graphical display  of the results (e.g. forest plot showing the mean, range and variance from each study visually aligned) are typically included for most forms of synthesis. Generally, conclusions are drawn about the usefulness of an intervention or  the relevant body of literature with suggestions for future research directions.

An AHRQ guide  and c hapters  9 ,  10 ,  11 , and  12 of the Cochrane Handbook  and further address meta-analyses and other synthesis methods.

Consult Cochrane Interactive Learning Module 6: Analyzing the Data and Module 7. Interpreting the Findings for further information.  *Please note you will need to register for a Cochrane account while initially on the Mayo network. You'll receive an email message containing a link to create a password and activate your account.*

References & Recommended Reading

  • Morton SC, Murad MH, O’Connor E, Lee CS, Booth M, Vandermeer BW, Snowden JM, D’Anci KE, Fu R, Gartlehner G, Wang Z, Steele DW. Quantitative Synthesis—An Update . 2018 Feb 23. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US).
  • McKenzie, JE, et al. Chapter 9: Summarizing study characteristics and preparing for synthesis. In: Higgins JPT, et al. (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.2. Cochrane, 2021. Available from  www.training.cochrane.org/handbook See - Section 9 
  • Deeks, JJ, et al. Chapter 10: Analysing data and undertaking meta-analyses. In: Higgins JPT, et al. (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.2. Cochrane, 2021. Available from  www.training.cochrane.org/handbook See - Section 10 
  • Chaimani, A, et al. Chapter 11: Undertaking network meta-analyses. In: Higgins JPT, et al. (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.2. Cochrane, 2021. Available from  www.training.cochrane.org/handbook See - Section 11 
  • McKenzie, JE, Brennan SE. Chapter 12: Synthesizing and presenting findings using other methods. In: Higgins JPT, et al. (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.2. Cochrane, 2021. Available from  www.training.cochrane.org/handbook See - Section 12  
  • Campbell M, McKenzie JE, Sowden A, et al.  Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline.  BMJ (Clinical research ed). 2020;368:l6890.
  • Alavi M, Hunt GE, Visentin DC, Watson R, Thapa DK, Cleary M.  Seeing the forest for the trees: How to interpret a meta-analysis forest plot.  Journal of advanced nursing. 2021;77(3):1097-1101. doi:https://dx.doi.org/10.1111/jan.14721
  • << Previous: Data Extraction Tools
  • Next: Publishing your Review >>
  • Last Updated: Aug 30, 2024 2:14 PM
  • URL: https://libraryguides.mayo.edu/systematicreviewprocess

University of Texas

  • University of Texas Libraries
  • UT Libraries

Systematic Reviews & Evidence Synthesis Methods

  • Types of Reviews
  • Formulate Question
  • Find Existing Reviews & Protocols
  • Register a Protocol
  • Searching Systematically
  • Supplementary Searching
  • Managing Results
  • Deduplication
  • Critical Appraisal
  • Glossary of terms
  • Librarian Support
  • Video tutorials This link opens in a new window
  • Systematic Review & Evidence Synthesis Boot Camp

Once you have completed your analysis, you will want to both summarize and synthesize those results. You may have a qualitative synthesis, a quantitative synthesis, or both.

Qualitative Synthesis

In a qualitative synthesis, you describe for readers how the pieces of your work fit together. You will summarize, compare, and contrast the characteristics and findings, exploring the relationships between them. Further, you will discuss the relevance and applicability of the evidence to your research question. You will also analyze the strengths and weaknesses of the body of evidence. Focus on where the gaps are in the evidence and provide recommendations for further research.

Quantitative Synthesis

Whether or not your Systematic Review includes a full meta-analysis, there is typically some element of data analysis. The quantitative synthesis combines and analyzes the evidence using statistical techniques. This includes comparing methodological similarities and differences and potentially the quality of the studies conducted.

Summarizing vs. Synthesizing

In a systematic review, researchers do more than summarize findings from identified articles. You will synthesize the information you want to include.

While a summary is a way of concisely relating important themes and elements from a larger work or works in a condensed form, a synthesis takes the information from a variety of works and combines them together to create something new.

Synthesis :

"The goal of a systematic synthesis of qualitative research is to integrate or compare the results across studies in order to increase understanding of a particular phenomenon, not to add studies together. Typically the aim is to identify broader themes or new theories – qualitative syntheses usually result in a narrative summary of cross-cutting or emerging themes or constructs, and/or conceptual models."

Denner, J., Marsh, E. & Campe, S. (2017). Approaches to reviewing research in education. In D. Wyse, N. Selwyn, & E. Smith (Eds.), The BERA/SAGE Handbook of educational research (Vol. 2, pp. 143-164). doi: 10.4135/9781473983953.n7

  • Approaches to Reviewing Research in Education from Sage Knowledge

Data synthesis  (Collaboration for Environmental Evidence Guidebook)

Interpreting findings and and reporting conduct   (Collaboration for Environmental Evidence Guidebook)

Interpreting results and drawing conclusions  (Cochrane Handbook, Chapter 15)

Guidance on the conduct of narrative synthesis in systematic reviews  (ESRC Methods Programme)

  • Last Updated: Sep 6, 2024 12:39 PM
  • URL: https://guides.lib.utexas.edu/systematicreviews

Creative Commons License

Jump to navigation

Home

Cochrane Training

Chapter 12: synthesizing and presenting findings using other methods.

Joanne E McKenzie, Sue E Brennan

Key Points:

  • Meta-analysis of effect estimates has many advantages, but other synthesis methods may need to be considered in the circumstance where there is incompletely reported data in the primary studies.
  • Alternative synthesis methods differ in the completeness of the data they require, the hypotheses they address, and the conclusions and recommendations that can be drawn from their findings.
  • These methods provide more limited information for healthcare decision making than meta-analysis, but may be superior to a narrative description where some results are privileged above others without appropriate justification.
  • Tabulation and visual display of the results should always be presented alongside any synthesis, and are especially important for transparent reporting in reviews without meta-analysis.
  • Alternative synthesis and visual display methods should be planned and specified in the protocol. When writing the review, details of the synthesis methods should be described.
  • Synthesis methods that involve vote counting based on statistical significance have serious limitations and are unacceptable.

Cite this chapter as: McKenzie JE, Brennan SE. Chapter 12: Synthesizing and presenting findings using other methods [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from www.training.cochrane.org/handbook .

12.1 Why a meta-analysis of effect estimates may not be possible

Meta-analysis of effect estimates has many potential advantages (see Chapter 10 and Chapter 11 ). However, there are circumstances where it may not be possible to undertake a meta-analysis and other statistical synthesis methods may be considered (McKenzie and Brennan 2014).

Some common reasons why it may not be possible to undertake a meta-analysis are outlined in Table 12.1.a . Legitimate reasons include limited evidence; incompletely reported outcome/effect estimates, or different effect measures used across studies; and bias in the evidence. Other commonly cited reasons for not using meta-analysis are because of too much clinical or methodological diversity, or statistical heterogeneity (Achana et al 2014). However, meta-analysis methods should be considered in these circumstances, as they may provide important insights if undertaken and interpreted appropriately.

Table 12.1.a Scenarios that may preclude meta-analysis, with possible solutions

Limited evidence for a pre-specified comparison

Meta-analysis is not possible with no studies, or only one study. This circumstance may reflect the infancy of research in a particular area, or that the specified aims to address a narrow question.

Build contingencies into the analysis plan to group one or more of the PICO elements at a broader level ( ).

Incompletely reported outcome or effect estimate

Within a study, the intervention effects may be incompletely reported (e.g. effect estimate with no measure of precision; direction of effect with P value or statement of statistical significance; only the direction of effect).

Calculate the effect estimate and measure of precision from the available statistics if possible ( ).

Impute missing statistics (e.g. standard deviations) where possible ( ).

Different effect measures

Across studies, the same outcome could be treated differently (e.g. a time-to-event outcome has been dichotomized in some studies) or analysed using different methods. Both scenarios could lead to different effect measures (e.g. hazard ratios and odds ratios).

Calculate the effect estimate and measure of precision for the same effect measure from the available statistics if possible ( ).

Transform effect measures (e.g. convert standardized mean difference to an odds ratio) where possible ( ).

Bias in the evidence

Concerns about missing studies, missing outcomes within the studies ( ), or bias in the studies ( and ), are legitimate reasons for not undertaking a meta-analysis. These concerns similarly apply to other synthesis methods (Section ).

 

 

Incompletely reported outcomes/effects may bias meta-analyses, but not necessarily other synthesis methods.

Clinical and methodological diversity

Concerns about diversity in the populations, interventions, outcomes, study designs, are often cited reasons for not using meta-analysis (Ioannidis et al 2008). Arguments against using meta-analysis because of too much diversity equally apply to the other synthesis methods (Valentine et al 2010).

Modify planned comparisons, providing rationale for post-hoc changes ( ).

Statistical heterogeneity

Statistical heterogeneity is an often cited reason for not reporting the meta-analysis result (Ioannidis et al 2008). Presentation of an average combined effect in this circumstance can be misleading, particularly if the estimated effects across the studies are both harmful and beneficial.

Attempt to reduce heterogeneity (e.g. checking the data, correcting an inappropriate choice of effect measure) ( ).

Attempt to explain heterogeneity (e.g. using subgroup analysis) ( ).

Consider (if possible) presenting a prediction interval, which provides a predicted range for the true intervention effect in an individual study (Riley et al 2011), thus clearly demonstrating the uncertainty in the intervention effects.

*Italicized text indicates possible solutions discussed in this chapter.

12.2 Statistical synthesis when meta-analysis of effect estimates is not possible

A range of statistical synthesis methods are available, and these may be divided into three categories based on their preferability ( Table 12.2.a ). Preferable methods are the meta-analysis methods outlined in Chapter 10 and Chapter 11 , and are not discussed in detail here. This chapter focuses on methods that might be considered when a meta-analysis of effect estimates is not possible due to incompletely reported data in the primary studies. These methods divide into those that are ‘acceptable’ and ‘unacceptable’. The ‘acceptable’ methods differ in the data they require, the hypotheses they address, limitations around their use, and the conclusions and recommendations that can be drawn (see Section 12.2.1 ). The ‘unacceptable’ methods in common use are described (see Section 12.2.2 ), along with the reasons for why they are problematic.

Compared with meta-analysis methods, the ‘acceptable’ synthesis methods provide more limited information for healthcare decision making. However, these ‘acceptable’ methods may be superior to a narrative that describes results study by study, which comes with the risk that some studies or findings are privileged above others without appropriate justification. Further, in reviews with little or no synthesis, readers are left to make sense of the research themselves, which may result in the use of seemingly simple yet problematic synthesis methods such as vote counting based on statistical significance (see Section 12.2.2.1 ).

All methods first involve calculation of a ‘standardized metric’, followed by application of a synthesis method. In applying any of the following synthesis methods, it is important that only one outcome per study (or other independent unit, for example one comparison from a trial with multiple intervention groups) contributes to the synthesis. Chapter 9 outlines approaches for selecting an outcome when multiple have been measured. Similar to meta-analysis, sensitivity analyses can be undertaken to examine if the findings of the synthesis are robust to potentially influential decisions (see Chapter 10, Section 10.14 and Section 12.4 for examples).

Authors should report the specific methods used in lieu of meta-analysis (including approaches used for presentation and visual display), rather than stating that they have conducted a ‘narrative synthesis’ or ‘narrative summary’ without elaboration. The limitations of the chosen methods must be described, and conclusions worded with appropriate caution. The aim of reporting this detail is to make the synthesis process more transparent and reproducible, and help ensure use of appropriate methods and interpretation.

Table 12.2.a Summary of preferable and acceptable synthesis methods

 

 

 

 

Preferable

             

Meta-analysis of effect estimates and extensions ( and )

What is the common intervention effect?

What is the average intervention effect?

Which intervention, of multiple, is most effective?

What factors modify the magnitude of the intervention effects?

   

Can be used to synthesize results when effect estimates and their variances are reported (or can be calculated).

Provides a combined estimate of average intervention effect (random effects), and precision of this estimate (95% CI).

Can be used to synthesize evidence from multiple interventions, with the ability to rank them (network meta-analysis).

Can be used to detect, quantify and investigate heterogeneity (meta-regression/subgroup analysis).

forest plot, funnel plot, network diagram, rankogram plot

Requires effect estimates and their variances.

Extensions (network meta-analysis, meta-regression/subgroup analysis) require a reasonably large number of studies.

Meta-regression/subgroup analysis involves observational comparisons and requires careful interpretation. High risk of false positive conclusions for sources of heterogeneity.

Network meta-analysis is more complicated to undertake and requires careful assessment of the assumptions.

Acceptable

             

Summarizing effect estimates

What is the range and distribution of observed effects?

     

Can be used to synthesize results when it is difficult to undertake a meta-analysis (e.g. missing variances of effects, unit of analysis errors).

Provides information on the magnitude and range of effects (median, interquartile range, range).

box-and-whisker plot, bubble plot

Does not account for differences in the relative sizes of the studies.

Performance of these statistics applied in the context of summarizing effect estimates has not been evaluated.

Combining P values

Is there evidence that there is an effect in at least one study?

   

Can be used to synthesize results when studies report:

albatross plot

Provides no information on the magnitude of effects.

Does not distinguish between evidence from large studies with small effects and small studies with large effects.

Difficult to interpret the test results when statistically significant, since the null hypothesis can be rejected on the basis of an effect in only one study (Jones 1995).

When combining P values from few, small studies, failure to reject the null hypotheses should not be interpreted as evidence of no effect in all studies.

Vote counting based on direction of effect

Is there any evidence of an effect?

   

 

Can be used to synthesize results when only direction of effect is reported, or there is inconsistency in the effect measures or data reported across studies.

harvest plot, effect direction plot

Provides no information on the magnitude of effects (Borenstein et al 2009).

Does not account for differences in the relative sizes of the studies (Borenstein et al 2009).

Less powerful than methods used to combine P values.

                   

12.2.1 Acceptable synthesis methods

12.2.1.1 summarizing effect estimates.

Description of method Summarizing effect estimates might be considered in the circumstance where estimates of intervention effect are available (or can be calculated), but the variances of the effects are not reported or are incorrect (and cannot be calculated from other statistics, or reasonably imputed) (Grimshaw et al 2003). Incorrect calculation of variances arises more commonly in non-standard study designs that involve clustering or matching ( Chapter 23 ). While missing variances may limit the possibility of meta-analysis, the (standardized) effects can be summarized using descriptive statistics such as the median, interquartile range, and the range. Calculating these statistics addresses the question ‘What is the range and distribution of observed effects?’

Reporting of methods and results The statistics that will be used to summarize the effects (e.g. median, interquartile range) should be reported. Box-and-whisker or bubble plots will complement reporting of the summary statistics by providing a visual display of the distribution of observed effects (Section 12.3.3 ). Tabulation of the available effect estimates will provide transparency for readers by linking the effects to the studies (Section 12.3.1 ). Limitations of the method should be acknowledged ( Table 12.2.a ).

12.2.1.2 Combining P values

Description of method Combining P values can be considered in the circumstance where there is no, or minimal, information reported beyond P values and the direction of effect; the types of outcomes and statistical tests differ across the studies; or results from non-parametric tests are reported (Borenstein et al 2009). Combining P values addresses the question ‘Is there evidence that there is an effect in at least one study?’ There are several methods available (Loughin 2004), with the method proposed by Fisher outlined here (Becker 1994).

Fisher’s method combines the P values from statistical tests across k studies using the formula:

what is data synthesis in systematic review

One-sided P values are used, since these contain information about the direction of effect. However, these P values must reflect the same directional hypothesis (e.g. all testing if intervention A is more effective than intervention B). This is analogous to standardizing the direction of effects before undertaking a meta-analysis. Two-sided P values, which do not contain information about the direction, must first be converted to one-sided P values. If the effect is consistent with the directional hypothesis (e.g. intervention A is beneficial compared with B), then the one-sided P value is calculated as

what is data synthesis in systematic review

In studies that do not report an exact P value but report a conventional level of significance (e.g. P<0.05), a conservative option is to use the threshold (e.g. 0.05). The P values must have been computed from statistical tests that appropriately account for the features of the design, such as clustering or matching, otherwise they will likely be incorrect.

what is data synthesis in systematic review

Reporting of methods and results There are several methods for combining P values (Loughin 2004), so the chosen method should be reported, along with details of sensitivity analyses that examine if the results are sensitive to the choice of method. The results from the test should be reported alongside any available effect estimates (either individual results or meta-analysis results of a subset of studies) using text, tabulation and appropriate visual displays (Section 12.3 ). The albatross plot is likely to complement the analysis (Section 12.3.4 ). Limitations of the method should be acknowledged ( Table 12.2.a ).

12.2.1.3 Vote counting based on the direction of effect

Description of method Vote counting based on the direction of effect might be considered in the circumstance where the direction of effect is reported (with no further information), or there is no consistent effect measure or data reported across studies. The essence of vote counting is to compare the number of effects showing benefit to the number of effects showing harm for a particular outcome. However, there is wide variation in the implementation of the method due to differences in how ‘benefit’ and ‘harm’ are defined. Rules based on subjective decisions or statistical significance are problematic and should be avoided (see Section 12.2.2 ).

To undertake vote counting properly, each effect estimate is first categorized as showing benefit or harm based on the observed direction of effect alone, thereby creating a standardized binary metric. A count of the number of effects showing benefit is then compared with the number showing harm. Neither statistical significance nor the size of the effect are considered in the categorization. A sign test can be used to answer the question ‘is there any evidence of an effect?’ If there is no effect, the study effects will be distributed evenly around the null hypothesis of no difference. This is equivalent to testing if the true proportion of effects favouring the intervention (or comparator) is equal to 0.5 (Bushman and Wang 2009) (see Section 12.4.2.3 for guidance on implementing the sign test). An estimate of the proportion of effects favouring the intervention can be calculated ( p = u / n , where u = number of effects favouring the intervention, and n = number of studies) along with a confidence interval (e.g. using the Wilson or Jeffreys interval methods (Brown et al 2001)). Unless there are many studies contributing effects to the analysis, there will be large uncertainty in this estimated proportion.

Reporting of methods and results The vote counting method should be reported in the ‘Data synthesis’ section of the review. Failure to recognize vote counting as a synthesis method has led to it being applied informally (and perhaps unintentionally) to summarize results (e.g. through the use of wording such as ‘3 of 10 studies showed improvement in the outcome with intervention compared to control’; ‘most studies found’; ‘the majority of studies’; ‘few studies’ etc). In such instances, the method is rarely reported, and it may not be possible to determine whether an unacceptable (invalid) rule has been used to define benefit and harm (Section 12.2.2 ). The results from vote counting should be reported alongside any available effect estimates (either individual results or meta-analysis results of a subset of studies) using text, tabulation and appropriate visual displays (Section 12.3 ). The number of studies contributing to a synthesis based on vote counting may be larger than a meta-analysis, because only minimal statistical information (i.e. direction of effect) is required from each study to vote count. Vote counting results are used to derive the harvest and effect direction plots, although often using unacceptable methods of vote counting (see Section 12.3.5 ). Limitations of the method should be acknowledged ( Table 12.2.a ).

12.2.2 Unacceptable synthesis methods

12.2.2.1 vote counting based on statistical significance.

Conventional forms of vote counting use rules based on statistical significance and direction to categorize effects. For example, effects may be categorized into three groups: those that favour the intervention and are statistically significant (based on some predefined P value), those that favour the comparator and are statistically significant, and those that are statistically non-significant (Hedges and Vevea 1998). In a simpler formulation, effects may be categorized into two groups: those that favour the intervention and are statistically significant, and all others (Friedman 2001). Regardless of the specific formulation, when based on statistical significance, all have serious limitations and can lead to the wrong conclusion.

The conventional vote counting method fails because underpowered studies that do not rule out clinically important effects are counted as not showing benefit. Suppose, for example, the effect sizes estimated in two studies were identical. However, only one of the studies was adequately powered, and the effect in this study was statistically significant. Only this one effect (of the two identical effects) would be counted as showing ‘benefit’. Paradoxically, Hedges and Vevea showed that as the number of studies increases, the power of conventional vote counting tends to zero, except with large studies and at least moderate intervention effects (Hedges and Vevea 1998). Further, conventional vote counting suffers the same disadvantages as vote counting based on direction of effect, namely, that it does not provide information on the magnitude of effects and does not account for differences in the relative sizes of the studies.

12.2.2.2 Vote counting based on subjective rules

Subjective rules, involving a combination of direction, statistical significance and magnitude of effect, are sometimes used to categorize effects. For example, in a review examining the effectiveness of interventions for teaching quality improvement to clinicians, the authors categorized results as ‘beneficial effects’, ‘no effects’ or ‘detrimental effects’ (Boonyasai et al 2007). Categorization was based on direction of effect and statistical significance (using a predefined P value of 0.05) when available. If statistical significance was not reported, effects greater than 10% were categorized as ‘beneficial’ or ‘detrimental’, depending on their direction. These subjective rules often vary in the elements, cut-offs and algorithms used to categorize effects, and while detailed descriptions of the rules may provide a veneer of legitimacy, such rules have poor performance validity (Ioannidis et al 2008).

A further problem occurs when the rules are not described in sufficient detail for the results to be reproduced (e.g. ter Wee et al 2012, Thornicroft et al 2016). This lack of transparency does not allow determination of whether an acceptable or unacceptable vote counting method has been used (Valentine et al 2010).

12.3 Visual display and presentation of the data

Visual display and presentation of data is especially important for transparent reporting in reviews without meta-analysis, and should be considered irrespective of whether synthesis is undertaken (see Table 12.2.a for a summary of plots associated with each synthesis method). Tables and plots structure information to show patterns in the data and convey detailed information more efficiently than text. This aids interpretation and helps readers assess the veracity of the review findings.

12.3.1 Structured tabulation of results across studies

Ordering studies alphabetically by study ID is the simplest approach to tabulation; however, more information can be conveyed when studies are grouped in subpanels or ordered by a characteristic important for interpreting findings. The grouping of studies in tables should generally follow the structure of the synthesis presented in the text, which should closely reflect the review questions. This grouping should help readers identify the data on which findings are based and verify the review authors’ interpretation.

If the purpose of the table is comparative, grouping studies by any of following characteristics might be informative:

  • comparisons considered in the review, or outcome domains (according to the structure of the synthesis);
  • study characteristics that may reveal patterns in the data, for example potential effect modifiers including population subgroups, settings or intervention components.

If the purpose of the table is complete and transparent reporting of data, then ordering the studies to increase the prominence of the most relevant and trustworthy evidence should be considered. Possibilities include:

  • certainty of the evidence (synthesized result or individual studies if no synthesis);
  • risk of bias, study size or study design characteristics; and
  • characteristics that determine how directly a study addresses the review question, for example relevance and validity of the outcome measures.

One disadvantage of grouping by study characteristics is that it can be harder to locate specific studies than when tables are ordered by study ID alone, for example when cross-referencing between the text and tables. Ordering by study ID within categories may partly address this.

The value of standardizing intervention and outcome labels is discussed in Chapter 3, Section 3.2.2 and Section 3.2.4 ), while the importance and methods for standardizing effect estimates is described in Chapter 6 . These practices can aid readers’ interpretation of tabulated data, especially when the purpose of a table is comparative.

12.3.2 Forest plots

Forest plots and methods for preparing them are described elsewhere ( Chapter 10, Section 10.2 ). Some mention is warranted here of their importance for displaying study results when meta-analysis is not undertaken (i.e. without the summary diamond). Forest plots can aid interpretation of individual study results and convey overall patterns in the data, especially when studies are ordered by a characteristic important for interpreting results (e.g. dose and effect size, sample size). Similarly, grouping studies in subpanels based on characteristics thought to modify effects, such as population subgroups, variants of an intervention, or risk of bias, may help explore and explain differences across studies (Schriger et al 2010). These approaches to ordering provide important techniques for informally exploring heterogeneity in reviews without meta-analysis, and should be considered in preference to alphabetical ordering by study ID alone (Schriger et al 2010).

12.3.3 Box-and-whisker plots and bubble plots

Box-and-whisker plots (see Figure 12.4.a , Panel A) provide a visual display of the distribution of effect estimates (Section 12.2.1.1 ). The plot conventionally depicts five values. The upper and lower limits (or ‘hinges’) of the box, represent the 75th and 25th percentiles, respectively. The line within the box represents the 50th percentile (median), and the whiskers represent the extreme values (McGill et al 1978). Multiple box plots can be juxtaposed, providing a visual comparison of the distributions of effect estimates (Schriger et al 2006). For example, in a review examining the effects of audit and feedback on professional practice, the format of the feedback (verbal, written, both verbal and written) was hypothesized to be an effect modifier (Ivers et al 2012). Box-and-whisker plots of the risk differences were presented separately by the format of feedback, to allow visual comparison of the impact of format on the distribution of effects. When presenting multiple box-and-whisker plots, the width of the box can be varied to indicate the number of studies contributing to each. The plot’s common usage facilitates rapid and correct interpretation by readers (Schriger et al 2010). The individual studies contributing to the plot are not identified (as in a forest plot), however, and the plot is not appropriate when there are few studies (Schriger et al 2006).

A bubble plot (see Figure 12.4.a , Panel B) can also be used to provide a visual display of the distribution of effects, and is more suited than the box-and-whisker plot when there are few studies (Schriger et al 2006). The plot is a scatter plot that can display multiple dimensions through the location, size and colour of the bubbles. In a review examining the effects of educational outreach visits on professional practice, a bubble plot was used to examine visually whether the distribution of effects was modified by the targeted behaviour (O’Brien et al 2007). Each bubble represented the effect size (y-axis) and whether the study targeted a prescribing or other behaviour (x-axis). The size of the bubbles reflected the number of study participants. However, different formulations of the bubble plot can display other characteristics of the data (e.g. precision, risk-of-bias assessments).

12.3.4 Albatross plot

The albatross plot (see Figure 12.4.a , Panel C) allows approximate examination of the underlying intervention effect sizes where there is minimal reporting of results within studies (Harrison et al 2017). The plot only requires a two-sided P value, sample size and direction of effect (or equivalently, a one-sided P value and a sample size) for each result. The plot is a scatter plot of the study sample sizes against two-sided P values, where the results are separated by the direction of effect. Superimposed on the plot are ‘effect size contours’ (inspiring the plot’s name). These contours are specific to the type of data (e.g. continuous, binary) and statistical methods used to calculate the P values. The contours allow interpretation of the approximate effect sizes of the studies, which would otherwise not be possible due to the limited reporting of the results. Characteristics of studies (e.g. type of study design) can be identified using different colours or symbols, allowing informal comparison of subgroups.

The plot is likely to be more inclusive of the available studies than meta-analysis, because of its minimal data requirements. However, the plot should complement the results from a statistical synthesis, ideally a meta-analysis of available effects.

12.3.5 Harvest and effect direction plots

Harvest plots (see Figure 12.4.a , Panel D) provide a visual extension of vote counting results (Ogilvie et al 2008). In the plot, studies based on the categorization of their effects (e.g. ‘beneficial effects’, ‘no effects’ or ‘detrimental effects’) are grouped together. Each study is represented by a bar positioned according to its categorization. The bars can be ‘visually weighted’ (by height or width) and annotated to highlight study and outcome characteristics (e.g. risk-of-bias domains, proximal or distal outcomes, study design, sample size) (Ogilvie et al 2008, Crowther et al 2011). Annotation can also be used to identify the studies. A series of plots may be combined in a matrix that displays, for example, the vote counting results from different interventions or outcome domains.

The methods papers describing harvest plots have employed vote counting based on statistical significance (Ogilvie et al 2008, Crowther et al 2011). For the reasons outlined in Section 12.2.2.1 , this can be misleading. However, an acceptable approach would be to display the results based on direction of effect.

The effect direction plot is similar in concept to the harvest plot in the sense that both display information on the direction of effects (Thomson and Thomas 2013). In the first version of the effect direction plot, the direction of effects for each outcome within a single study are displayed, while the second version displays the direction of the effects for outcome domains across studies . In this second version, an algorithm is first applied to ‘synthesize’ the directions of effect for all outcomes within a domain (e.g. outcomes ‘sleep disturbed by wheeze’, ‘wheeze limits speech’, ‘wheeze during exercise’ in the outcome domain ‘respiratory’). This algorithm is based on the proportion of effects that are in a consistent direction and statistical significance. Arrows are used to indicate the reported direction of effect (for either outcomes or outcome domains). Features such as statistical significance, study design and sample size are denoted using size and colour. While this version of the plot conveys a large amount of information, it requires further development before its use can be recommended since the algorithm underlying the plot is likely to have poor performance validity.

12.4 Worked example

The example that follows uses four scenarios to illustrate methods for presentation and synthesis when meta-analysis is not possible. The first scenario contrasts a common approach to tabulation with alternative presentations that may enhance the transparency of reporting and interpretation of findings. Subsequent scenarios show the application of the synthesis approaches outlined in preceding sections of the chapter. Box 12.4.a summarizes the review comparisons and outcomes, and decisions taken by the review authors in planning their synthesis. While the example is loosely based on an actual review, the review description, scenarios and data are fabricated for illustration.

Box 12.4.a The review

The review used in this example examines the effects of midwife-led continuity models versus other models of care for childbearing women. One of the outcomes considered in the review, and of interest to many women choosing a care option, is maternal satisfaction with care. The review included 15 randomized trials, all of which reported a measure of satisfaction. Overall, 32 satisfaction outcomes were reported, with between one and 11 outcomes reported per study. There were differences in the concepts measured (e.g. global satisfaction; specific domains such as of satisfaction with information), the measurement period (i.e. antenatal, intrapartum, postpartum care), and the measurement tools (different scales; variable evidence of validity and reliability).

 

Before conducting their synthesis, the review authors did the following.

(see ). Five types of satisfaction outcomes were defined (global measures, satisfaction with information, satisfaction with decisions, satisfaction with care, sense of control), any of which would be grouped for synthesis since they all broadly reflect satisfaction with care. The review authors hypothesized that the period of care (antenatal, intrapartum, postpartum) might influence satisfaction with a model of care, so planned to analyse outcomes for each period separately. The review authors specified that outcomes would be synthesized across periods if data were sparse. ( ). For studies that reported multiple satisfaction outcomes per period, one outcome would be chosen by (i) selecting the most relevant outcome (a global measure > satisfaction with care > sense of control > satisfaction with decisions > satisfaction with information), and if there were two or more equally relevant outcomes, then (ii) selecting the measurement tool with best evidence of validity and reliability. ( ). All studies had similar models of care as a comparator. Satisfaction outcomes from each study were categorized into one of the five pre-specified categories, and then the decision rules were applied to select the most relevant outcome for synthesis. ( ). All measures of satisfaction were ordinal; however, outcomes were treated differently across studies (see , and ). In some studies, the outcome was dichotomized, while in others it was treated as ordinal or continuous. Based on their pre-specified synthesis methods, the review authors selected the preferred method for the available data. In this example, four scenarios, with progressively fewer data, are used to illustrate the application of alternative synthesis methods. . No changes were required to comparisons or outcome groupings.

12.4.1 Scenario 1: structured reporting of effects

We first address a scenario in which review authors have decided that the tools used to measure satisfaction measured concepts that were too dissimilar across studies for synthesis to be appropriate. Setting aside three of the 15 studies that reported on the birth partner’s satisfaction with care, a structured summary of effects is sought of the remaining 12 studies. To keep the example table short, only one outcome is shown per study for each of the measurement periods (antenatal, intrapartum or postpartum).

Table 12.4.a depicts a common yet suboptimal approach to presenting results. Note two features.

  • Studies are ordered by study ID, rather than grouped by characteristics that might enhance interpretation (e.g. risk of bias, study size, validity of the measures, certainty of the evidence (GRADE)).
  • Data reported are as extracted from each study; effect estimates were not calculated by the review authors and, where reported, were not standardized across studies (although data were available to do both).

Table 12.4.b shows an improved presentation of the same results. In line with best practice, here effect estimates have been calculated by the review authors for all outcomes, and a common metric computed to aid interpretation (in this case an odds ratio; see Chapter 6 for guidance on conversion of statistics to the desired format). Redundant information has been removed (‘statistical test’ and ‘P value’ columns). The studies have been re-ordered, first to group outcomes by period of care (intrapartum outcomes are shown here), and then by risk of bias. This re-ordering serves two purposes. Grouping by period of care aligns with the plan to consider outcomes for each period separately and ensures the table structure matches the order in which results are described in the text. Re-ordering by risk of bias increases the prominence of studies at lowest risk of bias, focusing attention on the results that should most influence conclusions. Had the review authors determined that a synthesis would be informative, then ordering to facilitate comparison across studies would be appropriate; for example, ordering by the type of satisfaction outcome (as pre-defined in the protocol, starting with global measures of satisfaction), or the comparisons made in the studies.

The results may also be presented in a forest plot, as shown in Figure 12.4.b . In both the table and figure, studies are grouped by risk of bias to focus attention on the most trustworthy evidence. The pattern of effects across studies is immediately apparent in Figure 12.4.b and can be described efficiently without having to interpret each estimate (e.g. difference between studies at low and high risk of bias emerge), although these results should be interpreted with caution in the absence of a formal test for subgroup differences (see Chapter 10, Section 10.11 ). Only outcomes measured during the intrapartum period are displayed, although outcomes from other periods could be added, maximizing the information conveyed.

An example description of the results from Scenario 1 is provided in Box 12.4.b . It shows that describing results study by study becomes unwieldy with more than a few studies, highlighting the importance of tables and plots. It also brings into focus the risk of presenting results without any synthesis, since it seems likely that the reader will try to make sense of the results by drawing inferences across studies. Since a synthesis was considered inappropriate, GRADE was applied to individual studies and then used to prioritize the reporting of results, focusing attention on the most relevant and trustworthy evidence. An alternative might be to report results at low risk of bias, an approach analogous to limiting a meta-analysis to studies at low risk of bias. Where possible, these and other approaches to prioritizing (or ordering) results from individual studies in text and tables should be pre-specified at the protocol stage.

Table 12.4.a Scenario 1: table ordered by study ID, data as reported by study authors

Barry 2005

% (N)

% (N)

       

Experience of labour

37% (246)

32% (223)

5% (RD)

   

P > 0.05

Biro 2000

n/N

n/N

       

Perception of care: labour/birth

260/344

192/287

1.13 (RR)

1.02 to 1.25

z = 2.36

0.018

Crowe 2010

Mean (SD) N

Mean (SD) N

       

Experience of antenatal care (0 to 24 points)

21.0 (5.6) 182

19.7 (7.3) 186

1.3 (MD)

–0.1 to 2.7

t = 1.88

0.061

Experience of labour/birth (0 to 18 points)

9.8 (3.1) 182

9.3 (3.3) 186

0.5 (MD)

–0.2 to 1.2

t = 1.50

0.135

Experience of postpartum care (0 to 18 points)

11.7 (2.9) 182

10.9 (4.2) 186

0.8 (MD)

0.1 to 1.5

t = 2.12

0.035

Flint 1989

n/N

n/N

       

Care from staff during labour

240/275

208/256

1.07 (RR)

1.00 to 1.16

z = 1.89

0.059

Frances 2000

           

Communication: labour/birth

   

0.90 (OR)

0.61 to 1.33

z = –0.52

0.606

Harvey 1996

Mean (SD) N

Mean (SD) N

       

Labour & Delivery Satisfaction Index
(37 to 222 points)

182 (14.2) 101

185 (30) 93

   

t = –0.90 for MD

0.369 for MD

Johns 2004

n/N

n/N

       

Satisfaction with intrapartum care

605/1163

363/826

8.1% (RD)

3.6 to 12.5

 

< 0.001

Mac Vicar 1993

n/N

n/N

       

Birth satisfaction

849/1163

496/826

13.0% (RD)

8.8 to 17.2

z = 6.04

0.000

Parr 2002

           

Experience of childbirth

   

0.85 (OR)

0.39 to 1.86

z = -0.41

0.685

Rowley 1995

           

Encouraged to ask questions

   

1.02 (OR)

0.66 to 1.58

z = 0.09

0.930

Turnbull 1996

Mean (SD) N

Mean (SD) N

       

Intrapartum care rating (–2 to 2 points)

1.2 (0.57) 35

0.93 (0.62) 30

     

P > 0.05

Zhang 2011

N

N

       

Perception of antenatal care

359

322

1.23 (POR)

0.68 to 2.21

z = 0.69

0.490

Perception of care: labour/birth

355

320

1.10 (POR)

0.91 to 1.34

z = 0.95

0.341

* All scales operate in the same direction; higher scores indicate greater satisfaction. CI = confidence interval; MD = mean difference; OR = odds ratio; POR = proportional odds ratio; RD = risk difference; RR = risk ratio.

Table 12.4.b Scenario 1: intrapartum outcome table ordered by risk of bias, standardized effect estimates calculated for all studies


 

       

Barry 2005

n/N

n/N

   

Experience of labour

90/246

72/223

 

1.21 (0.82 to 1.79)

Frances 2000

n/N

n/N

   

Communication: labour/birth

     

0.90 (0.61 to 1.34)

Rowley 1995

n/N

n/N

   

Encouraged to ask questions [during labour/birth]

     

1.02 (0.66 to 1.58)

       

Biro 2000

n/N

n/N

   

Perception of care: labour/birth

260/344

192/287

 

1.54 (1.08 to 2.19)

Crowe 2010

Mean (SD) N

Mean (SD) N

   

Experience of labour/birth (0 to 18 points)

9.8 (3.1) 182

9.3 (3.3) 186

0.5 (–0.15 to 1.15)

1.32 (0.91 to 1.92)

Harvey 1996

Mean (SD) N

Mean (SD) N

   

Labour & Delivery Satisfaction Index
(37 to 222 points)

182 (14.2) 101

185 (30) 93

–3 (–10 to 4)

0.79 (0.48 to 1.32)

Johns 2004

n/N

n/N

   

Satisfaction with intrapartum care

605/1163

363/826

 

1.38 (1.15 to 1.64)

Parr 2002

n/N

n/N

   

Experience of childbirth

     

0.85 (0.39 to 1.87)

Zhang 2011

n/N

n/N

   

Perception of care: labour and birth

N = 355

N = 320

 

POR 1.11 (0.91 to 1.34)

       

Flint 1989

n/N

n/N

   

Care from staff during labour

240/275

208/256

 

1.58 (0.99 to 2.54)

Mac Vicar 1993

n/N

n/N

   

Birth satisfaction

849/1163

496/826

 

1.80 (1.48 to 2.19)

Turnbull 1996

Mean (SD) N

Mean (SD) N

   

Intrapartum care rating (–2 to 2 points)

1.2 (0.57) 35

0.93 (0.62) 30

0.27 (–0.03 to 0.57)

2.27 (0.92 to 5.59)

* Outcomes operate in the same direction. A higher score, or an event, indicates greater satisfaction. ** Mean difference calculated for studies reporting continuous outcomes. † For binary outcomes, odds ratios were calculated from the reported summary statistics or were directly extracted from the study. For continuous outcomes, standardized mean differences were calculated and converted to odds ratios (see Chapter 6 ). CI = confidence interval; POR = proportional odds ratio.

Figure 12.4.b Forest plot depicting standardized effect estimates (odds ratios) for satisfaction

what is data synthesis in systematic review

Box 12.4.b How to describe the results from this structured summary

Structured reporting of effects (no synthesis)

 

and present results for the 12 included studies that reported a measure of maternal satisfaction with care during labour and birth (hereafter ‘satisfaction’). Results from these studies were not synthesized for the reasons reported in the data synthesis methods. Here, we summarize results from studies providing high or moderate certainty evidence (based on GRADE) for which results from a valid measure of global satisfaction were available. Barry 2015 found a small increase in satisfaction with midwife-led care compared to obstetrician-led care (4 more women per 100 were satisfied with care; 95% CI 4 fewer to 15 more per 100 women; 469 participants, 1 study; moderate certainty evidence). Harvey 1996 found a small possibly unimportant decrease in satisfaction with midwife-led care compared with obstetrician-led care (3-point reduction on a 185-point LADSI scale, higher scores are more satisfied; 95% CI 10 points lower to 4 higher; 367 participants, 1 study; moderate certainty evidence). The remaining 10 studies reported specific aspects of satisfaction (Frances 2000, Rowley 1995, …), used tools with little or no evidence of validity and reliability (Parr 2002, …) or provided low or very low certainty evidence (Turnbull 1996, …).

12.4.2 Overview of scenarios 2–4: synthesis approaches

We now address three scenarios in which review authors have decided that the outcomes reported in the 15 studies all broadly reflect satisfaction with care. While the measures were quite diverse, a synthesis is sought to help decision makers understand whether women and their birth partners were generally more satisfied with the care received in midwife-led continuity models compared with other models. The three scenarios differ according to the data available (see Table 12.4.c ), with each reflecting progressively less complete reporting of the effect estimates. The data available determine the synthesis method that can be applied.

  • Scenario 2: effect estimates available without measures of precision (illustrating synthesis of summary statistics).
  • Scenario 3: P values available (illustrating synthesis of P values).
  • Scenario 4: directions of effect available (illustrating synthesis using vote-counting based on direction of effect).

For studies that reported multiple satisfaction outcomes, one result is selected for synthesis using the decision rules in Box 12.4.a (point 2).

Table 12.4.c Scenarios 2, 3 and 4: available data for the selected outcome from each study

     

Summary statistics

Combining P values

Vote counting

Study ID

Outcome (scale details*)

Overall RoB judgement

Available data**

Stand. metric

OR (SMD)

Available data**

(2-sided P value)

Stand. metric

(1-sided P value)

Available data**

Stand. metric

Continuous

   

Mean (SD)

         

Crowe 2010

Expectation of labour/birth (0 to 18 points)

Some concerns

Intervention 9.8 (3.1); Control 9.3 (3.3)

1.3 (0.16)

Favours intervention,
P = 0.135, N = 368

0.068

NS

Finn 1997

Experience of labour/birth (0 to 24 points)

Some concerns

Intervention 21 (5.6); Control 19.7 (7.3)

1.4 (0.20)

Favours intervention,
P = 0.061, N = 351

0.030

MD 1.3, NS

1

Harvey 1996

Labour & Delivery Satisfaction Index (37 to 222 points)

Some concerns

Intervention 182 (14.2); Control 185 (30)

0.8 (–0.13)

MD –3, P = 0.368, N = 194

0.816

MD –3, NS

0

Kidman 2007

Control during labour/birth (0 to 18 points)

High

Intervention 11.7 (2.9); Control 10.9 (4.2)

1.5 (0.22)

MD 0.8, P = 0.035, N = 368

0.017

MD 0.8 (95% CI 0.1 to 1.5)

1

Turnbull 1996

Intrapartum care rating (–2 to 2 points)

High

Intervention 1.2 (0.57); Control 0.93 (0.62)

2.3 (0.45)

MD 0.27, P = 0.072, N = 65

0.036

MD 0.27 (95% CI0.03 to 0.57)

1

Binary

               

Barry 2005

Experience of labour

Low

Intervention 90/246;
Control 72/223

1.21

NS

RR 1.13, NS

1

Biro 2000

Perception of care: labour/birth

Some concerns

Intervention 260/344;
Control 192/287

1.53

RR 1.13, P = 0.018

0.009

RR 1.13, P < 0.05

1

Flint 1989

Care from staff during labour

High

Intervention 240/275;
Control 208/256

1.58

Favours intervention,
P = 0.059

0.029

RR 1.07 (95% CI 1.00 to 1.16)

1

Frances 2000

Communication: labour/birth

Low

OR 0.90

0.90

Favours control,
P = 0.606

0.697

Favours control, NS

0

Johns 2004

Satisfaction with intrapartum care

Some concerns

Intervention 605/1163;
Control 363/826

1.38

Favours intervention,
P < 0.001

0.0005

RD 8.1% (95% CI 3.6% to 12.5%)

1

Mac Vicar 1993

Birth satisfaction

High

OR 1.80, P < 0.001

1.80

Favours intervention,
P < 0.001

0.0005

RD 13.0% (95% CI 8.8% to 17.2%)

1

Parr 2002

Experience of childbirth

Some concerns

OR 0.85

0.85

OR 0.85, P = 0.685

0.658

NS

Rowley 1995

Encouraged to ask questions

Low

OR 1.02, NS

1.02

P = 0.685

NS

Ordinal

               

Waldenstrom 2001

Perception of intrapartum care

Low

POR 1.23, P = 0.490

1.23

POR 1.23,
P = 0.490

0.245

POR 1.23, NS

1

Zhang 2011

Perception of care: labour/birth

Low

POR 1.10, P > 0.05

1.10

POR 1.1, P = 0.341

0.170

Favours intervention

1

* All scales operate in the same direction. Higher scores indicate greater satisfaction. ** For a particular scenario, the ‘available data’ column indicates the data that were directly reported, or were calculated from the reported statistics, in terms of: effect estimate, direction of effect, confidence interval, precise P value, or statement regarding statistical significance (either statistically significant, or not). CI = confidence interval; direction = direction of effect reported or can be calculated; MD = mean difference; NS = not statistically significant; OR = odds ratio; RD = risk difference; RoB = risk of bias; RR = risk ratio; sig. = statistically significant; SMD = standardized mean difference; Stand. = standardized.

12.4.2.1 Scenario 2: summarizing effect estimates

In Scenario 2, effect estimates are available for all outcomes. However, for most studies, a measure of variance is not reported, or cannot be calculated from the available data. We illustrate how the effect estimates may be summarized using descriptive statistics. In this scenario, it is possible to calculate odds ratios for all studies. For the continuous outcomes, this involves first calculating a standardized mean difference, and then converting this to an odds ratio ( Chapter 10, Section 10.6 ). The median odds ratio is 1.32 with an interquartile range of 1.02 to 1.53 (15 studies). Box-and-whisker plots may be used to display these results and examine informally whether the distribution of effects differs by the overall risk-of-bias assessment ( Figure 12.4.a , Panel A). However, because there are relatively few effects, a reasonable alternative would be to present bubble plots ( Figure 12.4.a , Panel B).

An example description of the results from the synthesis is provided in Box 12.4.c .

Box 12.4.c How to describe the results from this synthesis

Synthesis of summary statistics

 

‘The median odds ratio of satisfaction was 1.32 for midwife-led models of care compared with other models (interquartile range 1.02 to 1.53; 15 studies). Only five of the 15 effects were judged to be at a low risk of bias, and informal visual examination suggested the size of the odds ratios may be smaller in this group.’

12.4.2.2 Scenario 3: combining P values

In Scenario 3, there is minimal reporting of the data, and the type of data and statistical methods and tests vary. However, 11 of the 15 studies provide a precise P value and direction of effect, and a further two report a P value less than a threshold (<0.001) and direction. We use this scenario to illustrate a synthesis of P values. Since the reported P values are two-sided ( Table 12.4.c , column 6), they must first be converted to one-sided P values, which incorporate the direction of effect ( Table 12.4.c , column 7).

Fisher’s method for combining P values involved calculating the following statistic:

what is data synthesis in systematic review

The combination of P values suggests there is strong evidence of benefit of midwife-led models of care in at least one study (P < 0.001 from a Chi 2 test, 13 studies). Restricting this analysis to those studies judged to be at an overall low risk of bias (sensitivity analysis), there is no longer evidence to reject the null hypothesis of no benefit of midwife-led model of care in any studies (P = 0.314, 3 studies). For the five studies reporting continuous satisfaction outcomes, sufficient data (precise P value, direction, total sample size) are reported to construct an albatross plot ( Figure 12.4.a , Panel C). The location of the points relative to the standardized mean difference contours indicate that the likely effects of the intervention in these studies are small.

An example description of the results from the synthesis is provided in Box 12.4.d .

Box 12.4.d How to describe the results from this synthesis

Synthesis of P values

 

‘There was strong evidence of benefit of midwife-led models of care in at least one study (P < 0.001, 13 studies). However, a sensitivity analysis restricted to studies with an overall low risk of bias suggested there was no effect of midwife-led models of care in any of the trials (P = 0.314, 3 studies). Estimated standardized mean differences for five of the outcomes were small (ranging from –0.13 to 0.45) ( , Panel C).’

12.4.2.3 Scenario 4: vote counting based on direction of effect

In Scenario 4, there is minimal reporting of the data, and the type of effect measure (when used) varies across the studies (e.g. mean difference, proportional odds ratio). Of the 15 results, only five report data suitable for meta-analysis (effect estimate and measure of precision; Table 12.4.c , column 8), and no studies reported precise P values. We use this scenario to illustrate vote counting based on direction of effect. For each study, the effect is categorized as beneficial or harmful based on the direction of effect (indicated as a binary metric; Table 12.4.c , column 9).

Of the 15 studies, we exclude three because they do not provide information on the direction of effect, leaving 12 studies to contribute to the synthesis. Of these 12, 10 effects favour midwife-led models of care (83%). The probability of observing this result if midwife-led models of care are truly ineffective is 0.039 (from a binomial probability test, or equivalently, the sign test). The 95% confidence interval for the percentage of effects favouring midwife-led care is wide (55% to 95%).

The binomial test can be implemented using standard computer spreadsheet or statistical packages. For example, the two-sided P value from the binomial probability test presented can be obtained from Microsoft Excel by typing =2*BINOM.DIST(2, 12, 0.5, TRUE) into any cell in the spreadsheet. The syntax requires the smaller of the ‘number of effects favouring the intervention’ or ‘the number of effects favouring the control’ (here, the smaller of these counts is 2), the number of effects (here 12), and the null value (true proportion of effects favouring the intervention = 0.5). In Stata, the bitest command could be used (e.g. bitesti 12 10 0.5 ).

A harvest plot can be used to display the results ( Figure 12.4.a , Panel D), with characteristics of the studies represented using different heights and shading. A sensitivity analysis might be considered, restricting the analysis to those studies judged to be at an overall low risk of bias. However, only four studies were judged to be at a low risk of bias (of which, three favoured midwife-led models of care), precluding reasonable interpretation of the count.

An example description of the results from the synthesis is provided in Box 12.4.e .

Box 12.4.e How to describe the results from this synthesis

Synthesis using vote counting based on direction of effects

 

‘There was evidence that midwife-led models of care had an effect on satisfaction, with 10 of 12 studies favouring the intervention (83% (95% CI 55% to 95%), P = 0.039) ( , Panel D). Four of the 12 studies were judged to be at a low risk of bias, and three of these favoured the intervention. The available effect estimates are presented in [review] Table X.’

Figure 12.4.a Possible graphical displays of different types of data. (A) Box-and-whisker plots of odds ratios for all outcomes and separately by overall risk of bias. (B) Bubble plot of odds ratios for all outcomes and separately by the model of care. The colours of the bubbles represent the overall risk of bias judgement (green = low risk of bias; yellow = some concerns; red = high risk of bias). (C) Albatross plot of the study sample size against P values (for the five continuous outcomes in Table 12.4.c , column 6). The effect contours represent standardized mean differences. (D) Harvest plot (height depicts overall risk of bias judgement (tall = low risk of bias; medium = some concerns; short = high risk of bias), shading depicts model of care (light grey = caseload; dark grey = team), alphabet characters represent the studies)

(A)

(B)

(C)

(D)

12.5 Chapter information

Authors: Joanne E McKenzie, Sue E Brennan

Acknowledgements: Sections of this chapter build on chapter 9 of version 5.1 of the Handbook , with editors Jonathan J Deeks, Julian PT Higgins and Douglas G Altman.

We are grateful to the following for commenting helpfully on earlier drafts: Miranda Cumpston, Jamie Hartmann-Boyce, Tianjing Li, Rebecca Ryan and Hilary Thomson.

Funding: JEM is supported by an Australian National Health and Medical Research Council (NHMRC) Career Development Fellowship (1143429). SEB’s position is supported by the NHMRC Cochrane Collaboration Funding Program.

12.6 References

Achana F, Hubbard S, Sutton A, Kendrick D, Cooper N. An exploration of synthesis methods in public health evaluations of interventions concludes that the use of modern statistical methods would be beneficial. Journal of Clinical Epidemiology 2014; 67 : 376–390.

Becker BJ. Combining significance levels. In: Cooper H, Hedges LV, editors. A handbook of research synthesis . New York (NY): Russell Sage; 1994. p. 215–235.

Boonyasai RT, Windish DM, Chakraborti C, Feldman LS, Rubin HR, Bass EB. Effectiveness of teaching quality improvement to clinicians: a systematic review. JAMA 2007; 298 : 1023–1037.

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Meta-Analysis methods based on direction and p-values. Introduction to Meta-Analysis . Chichester (UK): John Wiley & Sons, Ltd; 2009. pp. 325–330.

Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Statistical Science 2001; 16 : 101–117.

Bushman BJ, Wang MC. Vote-counting procedures in meta-analysis. In: Cooper H, Hedges LV, Valentine JC, editors. Handbook of Research Synthesis and Meta-Analysis . 2nd ed. New York (NY): Russell Sage Foundation; 2009. p. 207–220.

Crowther M, Avenell A, MacLennan G, Mowatt G. A further use for the Harvest plot: a novel method for the presentation of data synthesis. Research Synthesis Methods 2011; 2 : 79–83.

Friedman L. Why vote-count reviews don’t count. Biological Psychiatry 2001; 49 : 161–162.

Grimshaw J, McAuley LM, Bero LA, Grilli R, Oxman AD, Ramsay C, Vale L, Zwarenstein M. Systematic reviews of the effectiveness of quality improvement strategies and programmes. Quality and Safety in Health Care 2003; 12 : 298–303.

Harrison S, Jones HE, Martin RM, Lewis SJ, Higgins JPT. The albatross plot: a novel graphical tool for presenting results of diversely reported studies in a systematic review. Research Synthesis Methods 2017; 8 : 281–289.

Hedges L, Vevea J. Fixed- and random-effects models in meta-analysis. Psychological Methods 1998; 3 : 486–504.

Ioannidis JP, Patsopoulos NA, Rothstein HR. Reasons or excuses for avoiding meta-analysis in forest plots. BMJ 2008; 336 : 1413–1415.

Ivers N, Jamtvedt G, Flottorp S, Young JM, Odgaard-Jensen J, French SD, O’Brien MA, Johansen M, Grimshaw J, Oxman AD. Audit and feedback: effects on professional practice and healthcare outcomes. Cochrane Database of Systematic Reviews 2012; 6 : CD000259.

Jones DR. Meta-analysis: weighing the evidence. Statistics in Medicine 1995; 14 : 137–149.

Loughin TM. A systematic comparison of methods for combining p-values from independent tests. Computational Statistics & Data Analysis 2004; 47 : 467–485.

McGill R, Tukey JW, Larsen WA. Variations of box plots. The American Statistician 1978; 32 : 12–16.

McKenzie JE, Brennan SE. Complex reviews: methods and considerations for summarising and synthesising results in systematic reviews with complexity. Report to the Australian National Health and Medical Research Council. 2014.

O’Brien MA, Rogers S, Jamtvedt G, Oxman AD, Odgaard-Jensen J, Kristoffersen DT, Forsetlund L, Bainbridge D, Freemantle N, Davis DA, Haynes RB, Harvey EL. Educational outreach visits: effects on professional practice and health care outcomes. Cochrane Database of Systematic Reviews 2007; 4 : CD000409.

Ogilvie D, Fayter D, Petticrew M, Sowden A, Thomas S, Whitehead M, Worthy G. The harvest plot: a method for synthesising evidence about the differential effects of interventions. BMC Medical Research Methodology 2008; 8 : 8.

Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011; 342 : d549.

Schriger DL, Sinha R, Schroter S, Liu PY, Altman DG. From submission to publication: a retrospective review of the tables and figures in a cohort of randomized controlled trials submitted to the British Medical Journal. Annals of Emergency Medicine 2006; 48 : 750–756, 756 e751–721.

Schriger DL, Altman DG, Vetter JA, Heafner T, Moher D. Forest plots in reports of systematic reviews: a cross-sectional study reviewing current practice. International Journal of Epidemiology 2010; 39 : 421–429.

ter Wee MM, Lems WF, Usan H, Gulpen A, Boonen A. The effect of biological agents on work participation in rheumatoid arthritis patients: a systematic review. Annals of the Rheumatic Diseases 2012; 71 : 161–171.

Thomson HJ, Thomas S. The effect direction plot: visual display of non-standardised effects across multiple outcome domains. Research Synthesis Methods 2013; 4 : 95–101.

Thornicroft G, Mehta N, Clement S, Evans-Lacko S, Doherty M, Rose D, Koschorke M, Shidhaye R, O’Reilly C, Henderson C. Evidence for effective interventions to reduce mental-health-related stigma and discrimination. Lancet 2016; 387 : 1123–1132.

Valentine JC, Pigott TD, Rothstein HR. How many studies do you need?: a primer on statistical power for meta-analysis. Journal of Educational and Behavioral Statistics 2010; 35 : 215–247.

For permission to re-use material from the Handbook (either academic or commercial), please see here for full details.

X

Library Services

UCL LIBRARY SERVICES

  • Guides and databases
  • Library skills
  • Systematic reviews

Synthesis and systematic maps

  • What are systematic reviews?
  • Types of systematic reviews
  • Formulating a research question
  • Identifying studies
  • Searching databases
  • Describing and appraising studies
  • Software for systematic reviews
  • Online training and support
  • Live and face to face training
  • Individual support
  • Further help

Searching for information

On this page:

Types of synthesis

  • Systematic evidence map

Synthesis is the process of combining the findings of research studies. A synthesis is also the product and output of the combined studies. This output may be a written narrative, a table, or graphical plots, including statistical meta-analysis. The process of combining studies and the way the output is reported varies according to the research question of the review.

In primary research there are many research questions and many different methods to address them. The same is true of systematic reviews. Two common and different types of review are those asking about the evidence of impact (effectiveness) of an intervention and those asking about ways of understanding a social phenomena.

If a systematic review question is about the effectiveness of an intervention, then the included studies are likely to be experimental studies that test whether an intervention is effective or not. These studies report evidence of the relative effect of an intervention compared to control conditions.

A synthesis of these types of studies aggregates the findings of the studies together. This produces an overall measure of effect of the intervention (after taking into account the sample sizes of the studies). This is a type of quantitative synthesis that is testing a hypothesis (that an intervention is effective) and the review methods are described in advance (using a deductive a priori paradigm).

  • Ongoing developments in meta-analytic and quantitative synthesis methods: Broadening the types of research questions that can be addressed O'Mara-Eves, A. and Thomas, J. (2016). This paper discusses different types of quantitative synthesis in education research.

If a systematic review question is about ways of understanding a social phenomena, it iteratively analyses the findings of studies to develop overarching concepts, theories or themes. The included studies are likely to provide theories, concepts or insights about a phenomena. This might, for example, be studies trying to explain why patients do not always take the medicines provided to them by doctors.

A synthesis of these types of studies is an arrangement or configuration of the concepts from individual studies. It provides overall ‘meta’ concepts to help understand the phenomena under study.  This type of qualitative or conceptual synthesis is more exploratory and some of the detailed methods may develop during the process of the review (using an inductive iterative paradigm).

  • Methods for the synthesis of qualitative research: a critical review ​Barnett-Page and Thomas, (2009). This paper summarises some of the different approaches to qualitative synthesis.

There are also multi-component reviews that ask broad question with sub-questions using different review methods.

  • Teenage pregnancy and social disadvantage: systematic review integrating controlled trials and qualitative studies. Harden et al (2009). An example of a review that combines two types of synthesis. It develops: 1) a statistical meta-analysis of controlled trials on interventions for early parenthood; and 2) a thematic synthesis of qualitative studies of young people views of early parenthood.

Systematic evidence maps

Systematic evidence maps are a product that describe the nature of research in an area. This is in contrast to a synthesis that provides uses research findings to make a statement about an evidence base. A 'systematic map' can both explain what has been studied and also indicate what has not been studied and where there are gaps in the research (gap maps). They can be useful to compare trends and differences across sets of studies.

Systematic maps can be a standalone finished product of research, without a synthesis, or may also be a component a systematic review that will synthesise studies. 

A systematic map can help to plan a synthesis. It may be that the map shows that the studies to be synthesised are very different from each other, and it may be more appropriate to use a subset of the studies. Where a subset of studies is used in the synthesis, the review question and the boundaries of the review will need to be narrowed in order to provide a rigorous approach for selecting the sub-set of studies from the map. The studies in the map that are not synthesised can help with interpreting the synthesis and drawing conclusions. Please note that, confusingly, the 'scoping review' is sometimes used by people to describe systematic evidence maps and at other times to refer to reviews that are quick, selective scopes of the nature and size of literature in an area.

A systematic map may be published in different formats, such as a written report or database. Increasingly, maps are published as databases with interactive visualisations to enable the user to investigate and visualise different parts of the map. Living systematic maps are regularly updated so the evidence stays current.

Some examples of different maps are shown here:

  • Women in Wage Labour: An evidence map of what works to increase female wage labour market participation in LMICs Filters Example of a systematic evidence map from the Africa Centre for Evidence.
  • Acceptability and uptake of vaccines: Rapid map of systematic reviews Example of a map of systematic reviews.
  • COVID-19: a living systematic map of the evidence Example of a living map of health research on COVID-19.

Meta-analysis

  • What is a meta-analysis? Helpful resource from the University of Nottingham.
  • MetaLight: software for teaching and learning meta-analysis Software tool that can help in learning about meta-analysis.
  • KTDRR Research Evidence Training: An Overview of Effect Sizes and Meta-analysis Webcast video (56 mins). Overview of effect sizes and meta-analysis.
  • << Previous: Describing and appraising studies
  • Next: Software for systematic reviews >>
  • Last Updated: Aug 2, 2024 9:22 AM
  • URL: https://library-guides.ucl.ac.uk/systematic-reviews

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

How to Do a Systematic Review: A Best Practice Guide for Conducting and Reporting Narrative Reviews, Meta-Analyses, and Meta-Syntheses

Affiliations.

  • 1 Behavioural Science Centre, Stirling Management School, University of Stirling, Stirling FK9 4LA, United Kingdom; email: [email protected].
  • 2 Department of Psychological and Behavioural Science, London School of Economics and Political Science, London WC2A 2AE, United Kingdom.
  • 3 Department of Statistics, Northwestern University, Evanston, Illinois 60208, USA; email: [email protected].
  • PMID: 30089228
  • DOI: 10.1146/annurev-psych-010418-102803

Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question. The best reviews synthesize studies to draw broad theoretical conclusions about what a literature means, linking theory to evidence and evidence to theory. This guide describes how to plan, conduct, organize, and present a systematic review of quantitative (meta-analysis) or qualitative (narrative review, meta-synthesis) information. We outline core standards and principles and describe commonly encountered problems. Although this guide targets psychological scientists, its high level of abstraction makes it potentially relevant to any subject area or discipline. We argue that systematic reviews are a key methodology for clarifying whether and how research findings replicate and for explaining possible inconsistencies, and we call for researchers to conduct systematic reviews to help elucidate whether there is a replication crisis.

Keywords: evidence; guide; meta-analysis; meta-synthesis; narrative; systematic review; theory.

PubMed Disclaimer

Similar articles

  • The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Soll RF, et al. Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12. Early Hum Dev. 2020. PMID: 33036834
  • Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. Aromataris E, Fernandez R, Godfrey CM, Holly C, Khalil H, Tungpunkom P. Aromataris E, et al. Int J Evid Based Healthc. 2015 Sep;13(3):132-40. doi: 10.1097/XEB.0000000000000055. Int J Evid Based Healthc. 2015. PMID: 26360830
  • RAMESES publication standards: meta-narrative reviews. Wong G, Greenhalgh T, Westhorp G, Buckingham J, Pawson R. Wong G, et al. BMC Med. 2013 Jan 29;11:20. doi: 10.1186/1741-7015-11-20. BMC Med. 2013. PMID: 23360661 Free PMC article.
  • A Primer on Systematic Reviews and Meta-Analyses. Nguyen NH, Singh S. Nguyen NH, et al. Semin Liver Dis. 2018 May;38(2):103-111. doi: 10.1055/s-0038-1655776. Epub 2018 Jun 5. Semin Liver Dis. 2018. PMID: 29871017 Review.
  • Publication Bias and Nonreporting Found in Majority of Systematic Reviews and Meta-analyses in Anesthesiology Journals. Hedin RJ, Umberham BA, Detweiler BN, Kollmorgen L, Vassar M. Hedin RJ, et al. Anesth Analg. 2016 Oct;123(4):1018-25. doi: 10.1213/ANE.0000000000001452. Anesth Analg. 2016. PMID: 27537925 Review.
  • The Association between Emotional Intelligence and Prosocial Behaviors in Children and Adolescents: A Systematic Review and Meta-Analysis. Cao X, Chen J. Cao X, et al. J Youth Adolesc. 2024 Aug 28. doi: 10.1007/s10964-024-02062-y. Online ahead of print. J Youth Adolesc. 2024. PMID: 39198344
  • The impact of chemical pollution across major life transitions: a meta-analysis on oxidative stress in amphibians. Martin C, Capilla-Lasheras P, Monaghan P, Burraco P. Martin C, et al. Proc Biol Sci. 2024 Aug;291(2029):20241536. doi: 10.1098/rspb.2024.1536. Epub 2024 Aug 28. Proc Biol Sci. 2024. PMID: 39191283 Free PMC article.
  • Target mechanisms of mindfulness-based programmes and practices: a scoping review. Maloney S, Kock M, Slaghekke Y, Radley L, Lopez-Montoyo A, Montero-Marin J, Kuyken W. Maloney S, et al. BMJ Ment Health. 2024 Aug 24;27(1):e300955. doi: 10.1136/bmjment-2023-300955. BMJ Ment Health. 2024. PMID: 39181568 Free PMC article. Review.
  • Bridging disciplines-key to success when implementing planetary health in medical training curricula. Malmqvist E, Oudin A. Malmqvist E, et al. Front Public Health. 2024 Aug 6;12:1454729. doi: 10.3389/fpubh.2024.1454729. eCollection 2024. Front Public Health. 2024. PMID: 39165783 Free PMC article. Review.
  • Strength of evidence for five happiness strategies. Puterman E, Zieff G, Stoner L. Puterman E, et al. Nat Hum Behav. 2024 Aug 12. doi: 10.1038/s41562-024-01954-0. Online ahead of print. Nat Hum Behav. 2024. PMID: 39134738 No abstract available.
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Ingenta plc
  • Ovid Technologies, Inc.

Other Literature Sources

  • scite Smart Citations

Miscellaneous

  • NCI CPTAC Assay Portal
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • - Google Chrome

Intended for healthcare professionals

  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Synthesis without meta...

Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline

Linked opinion.

Grasping the nettle of narrative synthesis

  • Related content
  • Peer review
  • Joanne E McKenzie , associate professor 2 ,
  • Amanda Sowden , professor 3 ,
  • Srinivasa Vittal Katikireddi , clinical senior research fellow 1 ,
  • Sue E Brennan , research fellow 2 ,
  • Simon Ellis , associate director 4 ,
  • Jamie Hartmann-Boyce , senior researcher 5 ,
  • Rebecca Ryan , senior esearch fellow 6 ,
  • Sasha Shepperd , professor 7 ,
  • James Thomas , professor 8 ,
  • Vivian Welch , associate professor 9 ,
  • Hilary Thomson , senior research fellow 1
  • 1 MRC/CSO Social and Public Health Sciences Unit, University of Glasgow, UK
  • 2 School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
  • 3 Centre for Reviews and Dissemination, University of York, York, UK
  • 4 Centre for Guidelines, National Institute for Health and Care Excellence, London, UK
  • 5 Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
  • 6 School of Psychology and Public Health, La Trobe University, Melbourne, Australia
  • 7 Nuffield Department of Population Health, University of Oxford, Oxford, UK
  • 8 Evidence for Policy and Practice Information and Coordinating Centre, University College London, London, UK
  • 9 Bruyere Research Institute, Ottawa, Canada
  • Correspondence to: M Campbell Mhairi.Campbell{at}glasgow.ac.uk
  • Accepted 8 October 2019

In systematic reviews that lack data amenable to meta-analysis, alternative synthesis methods are commonly used, but these methods are rarely reported. This lack of transparency in the methods can cast doubt on the validity of the review findings. The Synthesis Without Meta-analysis (SWiM) guideline has been developed to guide clear reporting in reviews of interventions in which alternative synthesis methods to meta-analysis of effect estimates are used. This article describes the development of the SWiM guideline for the synthesis of quantitative data of intervention effects and presents the nine SWiM reporting items with accompanying explanations and examples.

Summary points

Systematic reviews of health related interventions often use alternative methods of synthesis to meta-analysis of effect estimates, methods often described as “narrative synthesis”

Serious shortcomings in reviews that use “narrative synthesis” have been identified, including a lack of description of the methods used; unclear links between the included data, the synthesis, and the conclusions; and inadequate reporting of the limitations of the synthesis

The Synthesis Without Meta-analysis (SWiM) guideline is a nine item checklist to promote transparent reporting for reviews of interventions that use alternative synthesis methods

The SWiM items prompt users to report how studies are grouped, the standardised metric used for the synthesis, the synthesis method, how data are presented, a summary of the synthesis findings, and limitations of the synthesis

The SWiM guideline has been developed using a best practice approach, involving extensive consultation and formal consensus

Decision makers consider systematic reviews to be an essential source of evidence. 1 Complete and transparent reporting of the methods and results of reviews allows users to assess the validity of review findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; http://www.prisma-statement.org/ ) statement, consisting of a 27 item checklist, was developed to facilitate improved reporting of systematic reviews. 2 Extensions are available for different approaches to conducting reviews (for example, scoping reviews 3 ), reviews with a particular focus (for example, harms 4 ), and reviews that use specific methods (for example, network meta-analysis. 5 ) However, PRISMA provides limited guidance on reporting certain aspects of the review, such as the methods for presentation and synthesis, and no reporting guideline exists for synthesis without meta-analysis of effect estimates. We estimate that 32% of health related systematic reviews of interventions do not do meta-analysis, 6 7 8 instead using alternative approaches to synthesis that typically rely on textual description of effects and are often referred to as narrative synthesis. 9 Recent work highlights serious shortcomings in the reporting of narrative synthesis, including a lack of description of the methods used, lack of transparent links between study level data and the text reporting the synthesis and its conclusions, and inadequate reporting of the limitations of the synthesis. 7 This suggests widespread lack of familiarity and misunderstanding around the requirements for transparent reporting of synthesis when meta-analysis is not used and indicates the need for a reporting guideline.

Scope of SWiM reporting guideline

This paper presents the Synthesis Without Meta-analysis (SWiM) reporting guideline. The SWiM guideline is intended for use in systematic reviews examining the quantitative effects of interventions for which meta-analysis of effect estimates is not possible, or not appropriate, for a least some outcomes. 10 Such situations may arise when effect estimates are incompletely reported or because characteristics of studies (such as study designs, intervention types, or outcomes) are too diverse to yield a meaningful summary estimate of effect. 11 In these reviews, alternative presentation and synthesis methods may be adopted, (for example, calculating summary statistics of intervention effect estimates, vote counting based on direction of effect, and combining P values), and SWiM provides guidance for reporting these methods and results. 11 Specifically, the SWiM guideline expands guidance on “synthesis of results” items currently available, such as PRISMA (items 14 and 21) and RAMESES (items 11, 14, and 15). 2 12 13 SWiM covers reporting of the key features of synthesis including how studies are grouped, synthesis methods used, presentation of data and summary text, and limitations of the synthesis.

SWiM is not intended for use in reviews that synthesise qualitative data, for which reporting guidelines are already available, including ENTREQ for qualitative evidence synthesis and eMERGe for meta-ethnography. 14 15

Development of SWiM reporting guideline

A protocol for the project is available, 10 and the guideline development was registered with the EQUATOR Network, after confirmation that no similar guideline was in development. All of the SWiM project team are experienced systematic reviewers, and one was a co-author on guidance on the conduct of narrative synthesis (AS). 9 A project advisory group was convened to provide greater diversity in expertise. The project advisory group included representatives from collaborating Cochrane review groups, the Campbell Collaboration, and the UK National Institute for Health and Care Excellence (see supplementary file 1).

The project was informed by recommendations for developing guidelines for reporting of health research. 16 We assessed current practice in reporting synthesis of effect estimates without meta-analysis and used the findings to devise an initial checklist of reporting items in consultation with the project advisory group. We invited 91 people, all systematic review methodologists or authors of reviews that synthesised results from studies without using meta-analysis, to participate in a three round Delphi exercise, with a response rate of 48% (n=44/91) in round one, 54% (n=37/68) in round two, and 82% (n=32/39) in round three. The results were discussed at a consensus meeting of an expert panel (the project advisory group plus one additional methodological expert) (see supplementary file 1). After the meeting, we piloted the revised guideline to assess ease of use and face validity. Eight systematic reviewers with varying levels of experience, who had not been involved in the Delphi exercise, were asked to read and apply the guideline. We conducted short interviews with the pilot participants to identify any clarification needed in the items or their explanations. We subsequently revised the items and circulated them for comment among the expert panel, before finalising them. Full methodological details of the SWiM guideline development process are provided in supplementary file 1.

Synthesis without meta-analysis reporting items

We identified nine items to guide the reporting of synthesis without meta-analysis. Table 1 shows these SWiM reporting items. An online version is available at www.equator-network.org/reporting-guidelines . An explanation and elaboration for each of the reporting items is provided below. Examples to illustrate the reporting items and explanations are provided in supplementary file 2.

Synthesis Without Meta-analysis (SWiM) items: SWiM is intended to complement and be used as an extension to PRISMA

  • View inline

Item 1: grouping studies for synthesis

1a) description.

Provide a description of, and rationale for, the groups used in the synthesis (for example, groupings of interventions, population, outcomes, study design).

1a) Explanation

Methodological and clinical or conceptual diversity may occur (for example, owing to inclusion of diverse study designs, outcomes, interventions, contexts, populations), and it is necessary to clearly report how these study characteristics are grouped for the synthesis, along with the rationale for the groups (see Cochrane Handbook Chapter 3 17 ). Although reporting the grouping of study characteristics in all reviews is important, it is particularly important in reviews without meta-analysis, as the groupings may be less evident than when meta-analysis is used.

Providing the rationale, or theory of change, for how the intervention is expected to work and affect the outcome(s) will inform authors’ and review users’ decisions about the appropriateness and usefulness of the groupings. A diagram, or logic model, 18 19 can be used to visually articulate the underlying theory of change used in the review. If the theory of change for the intervention is provided in full elsewhere (for example, in the protocol), this should be referenced. In Cochrane reviews, the rationale for the groups can be outlined in the section “How the intervention is expected to work.”

1b) Description

Detail and provide rationale for any changes made subsequent to the protocol in the groups used in the synthesis.

1b) Explanation

Decisions about the planned groups for the syntheses may need to be changed following study selection and data extraction. This may occur as a result of important variations in the population, intervention, comparison, and/or outcomes identified after the data are collected, or where limited data are available for the pre-specified groupings, and the groupings may need to be modified to facilitate synthesis (Cochrane Handbook Chapter 2 20 ). Reporting changes to the planned groups, and the reason(s) for these, is important for transparency, as this allows readers to assess whether the changes may have been influenced by study findings. Furthermore, grouping at a broader level of (any or multiple) intervention, population, or outcome will have implications for the interpretation of the synthesis findings (see item 8).

Item 2: describe the standardised metric and transformation method used

Description.

Describe the standardised metric for each outcome. Explain why the metric(s) was chosen, and describe any methods used to transform the intervention effects, as reported in the study, to the standardised metric, citing any methodological guidance used.

Explanation

The term “standardised metric” refers to the metric that is used to present intervention effects across the studies for the purpose of synthesis or interpretation, or both. Examples of standardised metrics include measures of intervention effect (for example, risk ratios, odds ratios, risk differences, mean differences, standardised mean differences, ratio of means), direction of effect, or P values. An example of a statistical method to convert an odds ratio to a standardised mean difference is that proposed by Chinn (2000). 21 For other methods and metrics, see Cochrane Handbook Chapter 6. 22

Item 3: describe the synthesis methods

Describe and justify the methods used to synthesise the effects for each outcome when it was not possible to undertake a meta-analysis of effect estimates.

For various reasons, it may not be possible to do a meta-analysis of effect estimates. In these circumstances, other synthesis methods need to be considered and specified. Examples include combining P values, calculating summary statistics of intervention effect estimates (for example, median, interquartile range) or vote counting based on direction of effect. See table 2 for a summary of possible synthesis methods (for further details, see McKenzie and Brennan 2019 11 ). Justification should be provided for the chosen synthesis method.

Questions answered according to types of synthesis methods and types of data used

Item 4: criteria used to prioritise results for summary and synthesis

Where applicable, provide the criteria used, with supporting justification, to select particular studies, or a particular study, for the main synthesis or to draw conclusions from the synthesis (for example, based on study design, risk of bias assessments, directness in relation to the review question).

Criteria may be used to prioritise the reporting of some study findings over others or to restrict the synthesis to a subset of studies. Examples of criteria include the type of study design (for example, only randomised trials), risk of bias assessment (for example, only studies at a low risk of bias), sample size, the relevance of the evidence (outcome, population/context, or intervention) pertaining to the review question, or the certainty of the evidence. Pre-specification of these criteria provides transparency as to why certain studies are prioritised and limits the risk of selective reporting of study findings.

Item 5: investigation of heterogeneity in reported effects

State the method(s) used to examine heterogeneity in reported effects when it is not possible to do a meta-analysis of effect estimates and its extensions to investigate heterogeneity.

Informal methods to investigate heterogeneity in the findings may be considered when a formal statistical investigation using methods such as subgroup analysis and meta-regression is not possible. Informal methods could involve ordering tables or structuring figures by hypothesised modifiers such as methodological characteristics (for example, study design), subpopulations (for example, sex, age), intervention components, and/or contextual/setting factors (see Cochrane Handbook Chapter 12 11 ). The methods used and justification for the chosen methods should be reported. Investigations of heterogeneity should be limited, as they are rarely definitive; this is more likely to be the case when informal methods are used. It should also be noted if the investigation of heterogeneity was not pre-specified.

Item 6: certainty of evidence

Describe the methods used to assess the certainty of the synthesis findings.

The assessment of the certainty of the evidence should aim to take into consideration the precision of the synthesis finding (confidence interval if available), the number of studies and participants, the consistency of effects across studies, the risk of bias of the studies, how directly the included studies address the planned question (directness), and the risk of publication bias. GRADE (Grading of Recommendations, Assessment, Development and Evaluations) is the most widely used framework for assessing certainty (Cochrane Handbook Chapter 14 23 ). However, depending on the synthesis method used, assessing some domains (for example, consistency of effects when vote counting is undertaken) may be difficult.

Item 7: data presentation methods

Describe the graphical and tabular methods used to present the effects (for example, tables, forest plots, harvest plots).

Specify key study characteristics (for example, study design, risk of bias) used to order the studies, in the text and any tables or graphs, clearly referencing the studies included

Study findings presented in tables or graphs should be ordered in the same way as the syntheses are reported in the narrative text to facilitate the comparison of findings from each included study. Key characteristics, such as study design, sample size, and risk of bias, which may affect interpretation of the data, should also be presented. Examples of visual displays include forest plots, 24 harvest plots, 25 effect direction plots, 26 albatross plots, 27 bubble plots, 28 and box and whisker plots. 29 McKenzie and Brennan (2019) provide a description of these plots, when they should be used, and their pros and cons. 11

Item 8: reporting results

For each comparison and outcome, provide a description of the synthesised findings and the certainty of the findings. Describe the result in language that is consistent with the question the synthesis addresses and indicate which studies contribute to the synthesis.

For each comparison and outcome, a description of the synthesis findings should be provided, making clear which studies contribute to each synthesis (for example, listing in the text or tabulated). In describing these findings, authors should be clear about the nature of the question(s) addressed (see table 2 , column 1), the metric and synthesis method used, the number of studies and participants, and the key characteristics of the included studies (population/settings, interventions, outcomes). When possible, the synthesis finding should be accompanied by a confidence interval.An assessment of the certainty of the effect should be reported.

Results of any investigation of heterogeneity should be described, noting if it was not pre-planned and avoiding over-interpretation of the findings.

If a pre-specified logic model was used, authors may report any changes made to the logic model during the review or as a result of the review findings. 30

Item 9: limitations of the synthesis

Report the limitations of the synthesis methods used and/or the groupings used in the synthesis and how these affect the conclusions that can be drawn in relation to the original review question.

When reporting limitations of the synthesis, factors to consider are the standardised metric(s) used, the synthesis method used, and any reconfiguration of the groups used to structure the synthesis (comparison, intervention, population, outcome).

The choice of metric and synthesis method will affect the question addressed (see table 2 ). For example, if the standardised metric is direction of effect, and vote counting is used, the question will ask “is there any evidence of an effect?” rather than “what is the average intervention effect?” had a random effects meta-analysis been used.

Limitations of the synthesis might arise from post-protocol changes in how the synthesis was structured and the synthesis method selected. These changes may occur because of limited evidence, or incompletely reported outcome or effect estimates, or if different effect measures are used across the included studies. These limitations may affect the ability of the synthesis to answer the planned review question—for example, when a meta-analysis of effect estimates was planned but was not possible.

The SWiM reporting guideline is intended to facilitate transparent reporting of the synthesis of effect estimates when meta-analysis is not used. The guideline relates specifically to transparently reporting synthesis and presentation methods and results, and it is likely to be of greatest relevance to reviews that incorporate diverse sources of data that are not amenable to meta-analysis. The SWiM guideline should be used in conjunction with other reporting guidelines that cover other aspects of the conduct of reviews, such as PRISMA. 31 We intend SWiM to be a resource for authors of reviews and to support journal editors and readers in assessing the conduct of a review and the validity of its findings.

The SWiM reporting items are intended to cover aspects of presentation and synthesis of study findings that are often left unreported when methods other than meta-analysis have been used. 7 These include reporting of the synthesis structure and comparison groupings (items 1, 4, 5, and 6), the standardised metric used for the synthesis (item 2), the synthesis method (items 3 and 9), presentation of data (item 7), and a summary of the synthesis findings that is clearly linked to supporting data (item 8). Although the SWiM items have been developed specifically for the many reviews that do not include meta-analysis, SWiM promotes the core principles needed for transparent reporting of all synthesis methods including meta-analysis. Therefore, the SWiM items are relevant when reporting synthesis of quantitative effect data regardless of the method used.

Reporting guidelines are sometimes interpreted as providing guidance on conduct or used to assess the quality of a study or review; this is not an appropriate application of a reporting guideline, and SWiM should not be used to guide the conduct of the synthesis. For guidance on how to conduct synthesis using the methods referred to in SWiM, we direct readers to the second edition of the Cochrane Handbook for Systematic Reviews of Interventions, specifically chapter 12. 11 Although an overlap inevitably exists between reporting and conduct, the SWiM reporting guideline is not intended to be prescriptive about choice of methods, and the level of detail for each item should be appropriate. For example, investigation of heterogeneity (item 5) may not always be necessary or useful. In relation to SWiM, we anticipate that the forthcoming update of PRISMA will include new items covering a broader range of synthesis methods, 32 but it will not provide detailed guidance and examples on synthesis without meta-analysis.

The SWiM reporting guideline emerged from a project aiming to improve the transparency and conduct of narrative synthesis (ICONS-Quant: Improving the CONduct and reporting of Narrative Synthesis). 10 Avoidance of the term “narrative synthesis” in SWiM is a deliberate move to promote clarity in the methods used in reviews in which the synthesis does not rely on meta-analysis. The use of narrative is ubiquitous across all research and can serve a valuable purpose in the development of a coherent story from diverse data. 33 34 However, within the field of evidence synthesis, narrative approaches to synthesis of quantitative effect estimates are characterised by a lack of transparency, making assessment of the validity of their findings difficult. 7 Together with the recently published guidance on conduct of alternative methods of synthesis, 11 the SWiM guideline aims to improve the transparency of, and subsequently trust in, the many reviews that synthesise quantitative data without meta-analysis, particularly for reviews of intervention effects.

Acknowledgments

We thank the participants of the Delphi survey and colleagues who informally piloted the guideline.

Contributors: All authors contributed to the development of SWiM. HT had the idea for the study. HT, SVK, AS, JEM, and MC designed the study methods. JT, JHB, RR, SB, SE, SS, and VW contributed to the consensus meeting and finalising the guideline items. MC prepared the first draft of the manuscript, and all authors critically reviewed and approved the final manuscript. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. HT is the guarantor.

Project advisory group members: Simon Ellis, Jamie Hartmann-Boyce, Mark Petticrew, Rebecca Ryan, Sasha Shepperd, James Thomas, Vivian Welch.

Expert panel members: Sue Brennan, Simon Ellis, Jamie Hartmann-Boyce, Rebecca Ryan, Sasha Shepperd, James Thomas, Vivian Welch.

Funding: This project was supported by funds provided by the Cochrane Methods Innovation Fund. MC, HT, and SVK receive funding from the UK Medical Research Council (MC_UU_12017-13 and MC_UU_12017-15) and the Scottish Government Chief Scientist Office (SPHSU13 and SPHSU15). SVK is supported by an NHS Research Scotland senior clinical fellowship (SCAF/15/02). JEM is supported by an NHMRC career development fellowship (1143429). RR’s position is funded by the NHMRC Cochrane Collaboration Funding Program (2017-2010). The views expressed in this article are those of the authors and not necessarily those of their employer/host organisations or of Cochrane or its registered entities, committees, or working groups.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: funding for the project as described above; HT is co-ordinating editor for Cochrane Public Health; SVK, SE, JHB, RR, and SS are Cochrane editors; JEM is co-convenor of the Cochrane Statistical Methods Group; JT is a senior editor of the second edition of the Cochrane Handbook; VW is editor in chief of the Campbell Collaboration and an associate scientific editor of the second edition of the Cochrane Handbook; SB is a research fellow at Cochrane Australia; no other relationships or activities that could appear to have influenced the submitted work.

Ethical approval: Ethical approval was obtained from the University of Glasgow College of Social Sciences Ethics Committee (reference number 400170060).

Transparency: The lead author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Patient and public involvement: This research was done without patient involvement. Patients were not invited to comment on the study design and were not consulted to develop outcomes or interpret the results.

Dissemination to participants and related patient and public communities: The authors plan to disseminate the research through peer reviewed publications, national and international conferences, webinars, and an online training module and by establishing an email discussion group.

This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/ .

  • Donnelly CA ,
  • Campbell P ,
  • Liberati A ,
  • Altman DG ,
  • Tetzlaff J ,
  • Tricco AC ,
  • Zorzela L ,
  • Ioannidis JP ,
  • PRISMAHarms Group
  • Salanti G ,
  • Caldwell DM ,
  • Shamseer L ,
  • Campbell M ,
  • Katikireddi SV ,
  • Katikireddi S ,
  • Roberts H ,
  • Higgins J ,
  • Chandler J ,
  • McKenzie J ,
  • Greenhalgh T ,
  • Westhorp G ,
  • Buckingham J ,
  • Flemming K ,
  • McInnes E ,
  • France EF ,
  • Cunningham M ,
  • Schulz KF ,
  • Higgins JPT ,
  • McKenzie JE ,
  • Brennan SE ,
  • Anderson LM ,
  • Petticrew M ,
  • Rehfuess E ,
  • Schünemann HJ ,
  • Ogilvie D ,
  • Thomson HJ ,
  • Harrison S ,
  • Martin RM ,
  • Higgins JPT
  • Schriger DL ,
  • Schroter S ,
  • Rehfuess EA ,
  • Brereton L ,
  • PRISMA Group
  • ↵ Page M, McKenzie J, Bossuyt P, et al. Updating the PRISMA reporting guideline for systematic reviews and meta-analyses: study protocol. 2018. https://osf.io/xfg5n .
  • Melendez-Torres GJ ,
  • O’Mara-Eves A ,
  • Petticrew M

what is data synthesis in systematic review

UCI Libraries Mobile Site

  • Langson Library
  • Science Library
  • Grunigen Medical Library
  • Law Library
  • Connect From Off-Campus
  • Accessibility
  • Gateway Study Center

Libaries home page

Email this link

Systematic reviews & evidence synthesis methods.

  • Schedule a Consultation / Meet our Team
  • What is Evidence Synthesis?
  • Types of Evidence Synthesis
  • Evidence Synthesis Across Disciplines
  • Finding and Appraising Existing Systematic Reviews
  • 0. Preliminary Searching
  • 1. Develop a Protocol
  • 2. Draft your Research Question
  • 3. Select Databases
  • 4. Select Grey Literature Sources
  • 5. Write a Search Strategy
  • 6. Register a Protocol
  • 7. Translate Search Strategies
  • 8. Citation Management
  • 9. Article Screening
  • 10. Risk of Bias Assessment
  • 11. Data Extraction
  • 12. Synthesize, Map, or Describe the Results
  • Evidence Synthesis Resources & Tools

Data Extraction

Whether you plan to perform a meta-analysis or not, you will need to establish a regimented approach to extracting data. Researchers often use a form or table to capture the data they will then summarize or analyze. The amount and types of data you collect, as well as the number of collaborators who will be extracting it, will dictate which extraction tools are best for your project. Programs like Excel or Google Spreadsheets may be the best option for smaller or more straightforward projects, while systematic review software platforms can provide more robust support for larger or more complicated data.

It is recommended that you pilot your data extraction tool, especially if you will code your data, to determine if fields should be added or clarified, or if the review team needs guidance in collecting and coding data.

Data Extraction Tools

  • Excel Excel is the most basic tool for the management of the screening and data extraction stages of the systematic review process. Customized workbooks and spreadsheets can be designed for the review process.
  • Covidence Covidence is a software platform built specifically for managing each step of a systematic review project, including data extraction. Read more about how Covidence can help you customize extraction tables and export your extracted data.
  • RevMan RevMan is free software used to manage Cochrane reviews. For more information on RevMan, including an explanation of how it may be used to extract and analyze data, watch Introduction to RevMan - a guided tour .
  • SRDR SRDR (Systematic Review Data Repository) is a Web-based tool for the extraction and management of data for systematic review or meta-analysis. It is also an open and searchable archive of systematic reviews and their data. Access the help page for more information.
  • DistillerSR DistillerSR is a systematic review management software program, similar to Covidence. It guides reviewers in creating project-specific forms, extracting, and analyzing data.
  • Sumari JBI Sumari (the Joanna Briggs Institute System for the United Management, Assessment and Review of Information) is a systematic review software platform geared toward fields such as health, social sciences, and humanities. Among the other steps of a review project, it facilitates data extraction and data synthesis. View their short introductions to data extraction and analysis for more information.
  • The Systematic Review Toolbox The SR Toolbox is a community-driven, searchable, web-based catalogue of tools that support the systematic review process across multiple domains. Use the advanced search option to restrict to tools specific to data extraction.

Additional Information

These resources offer additional information and examples of data extraction forms:​

Brown, S. A., Upchurch, S. L., & Acton, G. J. (2003). A framework for developing a coding scheme for meta-analysis.  Western Journal of Nursing Research ,  25 (2), 205–222. https://doi.org/10.1177/0193945902250038

Elamin, M. B., Flynn, D. N., Bassler, D., Briel, M., Alonso-Coello, P., Karanicolas, P. J., … Montori, V. M. (2009). Choice of data extraction tools for systematic reviews depends on resources and review complexity.  Journal of Clinical Epidemiology ,  62 (5), 506–510. https://doi.org/10.1016/j.jclinepi.2008.10.016

Higgins, J.P.T., & Thomas, J. (Eds.) (2022). Cochrane handbook for systematic reviews of interventions   Version 6.3. The Cochrane Collaboration. Available from https://training.cochrane.org/handbook/current (see Part 2: Core Methods, Chapters 4, 5)

Research guide from the George Washington University Himmelfarb Health Sciences Library.

  • << Previous: 10. Risk of Bias Assessment
  • Next: 12. Synthesize, Map, or Describe the Results >>
  • Last Updated: Sep 4, 2024 10:30 AM
  • URL: https://guides.lib.uci.edu/evidence-synthesis

Off-campus? Please use the Software VPN and choose the group UCIFull to access licensed content. For more information, please Click here

Software VPN is not available for guests, so they may not have access to some content when connecting from off-campus.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC10248995

Logo of sysrev

Guidance to best tools and practices for systematic reviews

Kat kolaski.

1 Departments of Orthopaedic Surgery, Pediatrics, and Neurology, Wake Forest School of Medicine, Winston-Salem, NC USA

Lynne Romeiser Logan

2 Department of Physical Medicine and Rehabilitation, SUNY Upstate Medical University, Syracuse, NY USA

John P. A. Ioannidis

3 Departments of Medicine, of Epidemiology and Population Health, of Biomedical Data Science, and of Statistics, and Meta-Research Innovation Center at Stanford (METRICS), Stanford University School of Medicine, Stanford, CA USA

Associated Data

Data continue to accumulate indicating that many systematic reviews are methodologically flawed, biased, redundant, or uninformative. Some improvements have occurred in recent years based on empirical methods research and standardization of appraisal tools; however, many authors do not routinely or consistently apply these updated methods. In addition, guideline developers, peer reviewers, and journal editors often disregard current methodological standards. Although extensively acknowledged and explored in the methodological literature, most clinicians seem unaware of these issues and may automatically accept evidence syntheses (and clinical practice guidelines based on their conclusions) as trustworthy.

A plethora of methods and tools are recommended for the development and evaluation of evidence syntheses. It is important to understand what these are intended to do (and cannot do) and how they can be utilized. Our objective is to distill this sprawling information into a format that is understandable and readily accessible to authors, peer reviewers, and editors. In doing so, we aim to promote appreciation and understanding of the demanding science of evidence synthesis among stakeholders. We focus on well-documented deficiencies in key components of evidence syntheses to elucidate the rationale for current standards. The constructs underlying the tools developed to assess reporting, risk of bias, and methodological quality of evidence syntheses are distinguished from those involved in determining overall certainty of a body of evidence. Another important distinction is made between those tools used by authors to develop their syntheses as opposed to those used to ultimately judge their work.

Exemplar methods and research practices are described, complemented by novel pragmatic strategies to improve evidence syntheses. The latter include preferred terminology and a scheme to characterize types of research evidence. We organize best practice resources in a Concise Guide that can be widely adopted and adapted for routine implementation by authors and journals. Appropriate, informed use of these is encouraged, but we caution against their superficial application and emphasize their endorsement does not substitute for in-depth methodological training. By highlighting best practices with their rationale, we hope this guidance will inspire further evolution of methods and tools that can advance the field.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13643-023-02255-9.

Part 1. The state of evidence synthesis

Evidence syntheses are commonly regarded as the foundation of evidence-based medicine (EBM). They are widely accredited for providing reliable evidence and, as such, they have significantly influenced medical research and clinical practice. Despite their uptake throughout health care and ubiquity in contemporary medical literature, some important aspects of evidence syntheses are generally overlooked or not well recognized. Evidence syntheses are mostly retrospective exercises, they often depend on weak or irreparably flawed data, and they may use tools that have acknowledged or yet unrecognized limitations. They are complicated and time-consuming undertakings prone to bias and errors. Production of a good evidence synthesis requires careful preparation and high levels of organization in order to limit potential pitfalls [ 1 ]. Many authors do not recognize the complexity of such an endeavor and the many methodological challenges they may encounter. Failure to do so is likely to result in research and resource waste.

Given their potential impact on people’s lives, it is crucial for evidence syntheses to correctly report on the current knowledge base. In order to be perceived as trustworthy, reliable demonstration of the accuracy of evidence syntheses is equally imperative [ 2 ]. Concerns about the trustworthiness of evidence syntheses are not recent developments. From the early years when EBM first began to gain traction until recent times when thousands of systematic reviews are published monthly [ 3 ] the rigor of evidence syntheses has always varied. Many systematic reviews and meta-analyses had obvious deficiencies because original methods and processes had gaps, lacked precision, and/or were not widely known. The situation has improved with empirical research concerning which methods to use and standardization of appraisal tools. However, given the geometrical increase in the number of evidence syntheses being published, a relatively larger pool of unreliable evidence syntheses is being published today.

Publication of methodological studies that critically appraise the methods used in evidence syntheses is increasing at a fast pace. This reflects the availability of tools specifically developed for this purpose [ 4 – 6 ]. Yet many clinical specialties report that alarming numbers of evidence syntheses fail on these assessments. The syntheses identified report on a broad range of common conditions including, but not limited to, cancer, [ 7 ] chronic obstructive pulmonary disease, [ 8 ] osteoporosis, [ 9 ] stroke, [ 10 ] cerebral palsy, [ 11 ] chronic low back pain, [ 12 ] refractive error, [ 13 ] major depression, [ 14 ] pain, [ 15 ] and obesity [ 16 , 17 ]. The situation is even more concerning with regard to evidence syntheses included in clinical practice guidelines (CPGs) [ 18 – 20 ]. Astonishingly, in a sample of CPGs published in 2017–18, more than half did not apply even basic systematic methods in the evidence syntheses used to inform their recommendations [ 21 ].

These reports, while not widely acknowledged, suggest there are pervasive problems not limited to evidence syntheses that evaluate specific kinds of interventions or include primary research of a particular study design (eg, randomized versus non-randomized) [ 22 ]. Similar concerns about the reliability of evidence syntheses have been expressed by proponents of EBM in highly circulated medical journals [ 23 – 26 ]. These publications have also raised awareness about redundancy, inadequate input of statistical expertise, and deficient reporting. These issues plague primary research as well; however, there is heightened concern for the impact of these deficiencies given the critical role of evidence syntheses in policy and clinical decision-making.

Methods and guidance to produce a reliable evidence synthesis

Several international consortiums of EBM experts and national health care organizations currently provide detailed guidance (Table ​ (Table1). 1 ). They draw criteria from the reporting and methodological standards of currently recommended appraisal tools, and regularly review and update their methods to reflect new information and changing needs. In addition, they endorse the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for rating the overall quality of a body of evidence [ 27 ]. These groups typically certify or commission systematic reviews that are published in exclusive databases (eg, Cochrane, JBI) or are used to develop government or agency sponsored guidelines or health technology assessments (eg, National Institute for Health and Care Excellence [NICE], Scottish Intercollegiate Guidelines Network [SIGN], Agency for Healthcare Research and Quality [AHRQ]). They offer developers of evidence syntheses various levels of methodological advice, technical and administrative support, and editorial assistance. Use of specific protocols and checklists are required for development teams within these groups, but their online methodological resources are accessible to any potential author.

Guidance for development of evidence syntheses

 Cochrane (formerly Cochrane Collaboration)
 JBI (formerly Joanna Briggs Institute)
 National Institute for Health and Care Excellence (NICE)—United Kingdom
 Scottish Intercollegiate Guidelines Network (SIGN) —Scotland
 Agency for Healthcare Research and Quality (AHRQ)—United States

Notably, Cochrane is the largest single producer of evidence syntheses in biomedical research; however, these only account for 15% of the total [ 28 ]. The World Health Organization requires Cochrane standards be used to develop evidence syntheses that inform their CPGs [ 29 ]. Authors investigating questions of intervention effectiveness in syntheses developed for Cochrane follow the Methodological Expectations of Cochrane Intervention Reviews [ 30 ] and undergo multi-tiered peer review [ 31 , 32 ]. Several empirical evaluations have shown that Cochrane systematic reviews are of higher methodological quality compared with non-Cochrane reviews [ 4 , 7 , 9 , 11 , 14 , 32 – 35 ]. However, some of these assessments have biases: they may be conducted by Cochrane-affiliated authors, and they sometimes use scales and tools developed and used in the Cochrane environment and by its partners. In addition, evidence syntheses published in the Cochrane database are not subject to space or word restrictions, while non-Cochrane syntheses are often limited. As a result, information that may be relevant to the critical appraisal of non-Cochrane reviews is often removed or is relegated to online-only supplements that may not be readily or fully accessible [ 28 ].

Influences on the state of evidence synthesis

Many authors are familiar with the evidence syntheses produced by the leading EBM organizations but can be intimidated by the time and effort necessary to apply their standards. Instead of following their guidance, authors may employ methods that are discouraged or outdated 28]. Suboptimal methods described in in the literature may then be taken up by others. For example, the Newcastle–Ottawa Scale (NOS) is a commonly used tool for appraising non-randomized studies [ 36 ]. Many authors justify their selection of this tool with reference to a publication that describes the unreliability of the NOS and recommends against its use [ 37 ]. Obviously, the authors who cite this report for that purpose have not read it. Authors and peer reviewers have a responsibility to use reliable and accurate methods and not copycat previous citations or substandard work [ 38 , 39 ]. Similar cautions may potentially extend to automation tools. These have concentrated on evidence searching [ 40 ] and selection given how demanding it is for humans to maintain truly up-to-date evidence [ 2 , 41 ]. Cochrane has deployed machine learning to identify randomized controlled trials (RCTs) and studies related to COVID-19, [ 2 , 42 ] but such tools are not yet commonly used [ 43 ]. The routine integration of automation tools in the development of future evidence syntheses should not displace the interpretive part of the process.

Editorials about unreliable or misleading systematic reviews highlight several of the intertwining factors that may contribute to continued publication of unreliable evidence syntheses: shortcomings and inconsistencies of the peer review process, lack of endorsement of current standards on the part of journal editors, the incentive structure of academia, industry influences, publication bias, and the lure of “predatory” journals [ 44 – 48 ]. At this juncture, clarification of the extent to which each of these factors contribute remains speculative, but their impact is likely to be synergistic.

Over time, the generalized acceptance of the conclusions of systematic reviews as incontrovertible has affected trends in the dissemination and uptake of evidence. Reporting of the results of evidence syntheses and recommendations of CPGs has shifted beyond medical journals to press releases and news headlines and, more recently, to the realm of social media and influencers. The lay public and policy makers may depend on these outlets for interpreting evidence syntheses and CPGs. Unfortunately, communication to the general public often reflects intentional or non-intentional misrepresentation or “spin” of the research findings [ 49 – 52 ] News and social media outlets also tend to reduce conclusions on a body of evidence and recommendations for treatment to binary choices (eg, “do it” versus “don’t do it”) that may be assigned an actionable symbol (eg, red/green traffic lights, smiley/frowning face emoji).

Strategies for improvement

Many authors and peer reviewers are volunteer health care professionals or trainees who lack formal training in evidence synthesis [ 46 , 53 ]. Informing them about research methodology could increase the likelihood they will apply rigorous methods [ 25 , 33 , 45 ]. We tackle this challenge, from both a theoretical and a practical perspective, by offering guidance applicable to any specialty. It is based on recent methodological research that is extensively referenced to promote self-study. However, the information presented is not intended to be substitute for committed training in evidence synthesis methodology; instead, we hope to inspire our target audience to seek such training. We also hope to inform a broader audience of clinicians and guideline developers influenced by evidence syntheses. Notably, these communities often include the same members who serve in different capacities.

In the following sections, we highlight methodological concepts and practices that may be unfamiliar, problematic, confusing, or controversial. In Part 2, we consider various types of evidence syntheses and the types of research evidence summarized by them. In Part 3, we examine some widely used (and misused) tools for the critical appraisal of systematic reviews and reporting guidelines for evidence syntheses. In Part 4, we discuss how to meet methodological conduct standards applicable to key components of systematic reviews. In Part 5, we describe the merits and caveats of rating the overall certainty of a body of evidence. Finally, in Part 6, we summarize suggested terminology, methods, and tools for development and evaluation of evidence syntheses that reflect current best practices.

Part 2. Types of syntheses and research evidence

A good foundation for the development of evidence syntheses requires an appreciation of their various methodologies and the ability to correctly identify the types of research potentially available for inclusion in the synthesis.

Types of evidence syntheses

Systematic reviews have historically focused on the benefits and harms of interventions; over time, various types of systematic reviews have emerged to address the diverse information needs of clinicians, patients, and policy makers [ 54 ] Systematic reviews with traditional components have become defined by the different topics they assess (Table 2.1 ). In addition, other distinctive types of evidence syntheses have evolved, including overviews or umbrella reviews, scoping reviews, rapid reviews, and living reviews. The popularity of these has been increasing in recent years [ 55 – 58 ]. A summary of the development, methods, available guidance, and indications for these unique types of evidence syntheses is available in Additional File 2 A.

Types of traditional systematic reviews

Review typeTopic assessedElements of research question (mnemonic)
Intervention [ , ]Benefits and harms of interventions used in healthcare. opulation, ntervention, omparator, utcome ( )
Diagnostic test accuracy [ ]How well a diagnostic test performs in diagnosing and detecting a particular disease. opulation, ndex test(s), and arget condition ( )
Qualitative
 Cochrane [ ]Questions are designed to improve understanding of intervention complexity, contextual variations, implementation, and stakeholder preferences and experiences.

etting, erspective, ntervention or Phenomenon of nterest, omparison, valuation ( )

ample, henomenon of nterest, esign, valuation, esearch type ( )

spective, etting, henomena of interest/Problem, nvironment, omparison (optional), me/timing, indings ( )

 JBI [ ]Questions inform meaningfulness and appropriateness of care and the impact of illness through documentation of stakeholder experiences, preferences, and priorities. opulation, the Phenomena of nterest, and the ntext
Prognostic [ ]Probable course or future outcome(s) of people with a health problem. opulation, ntervention (model), omparator, utcomes, iming, etting ( )
Etiology and risk [ ]The relationship (association) between certain factors (e.g., genetic, environmental) and the development of a disease or condition or other health outcome. opulation or groups at risk, xposure(s), associated utcome(s) (disease, symptom, or health condition of interest), the context/location or the time period and the length of time when relevant ( )
Measurement properties [ , ]What is the most suitable instrument to measure a construct of interest in a specific study population? opulation, nstrument, onstruct, utcomes ( )
Prevalence and incidence [ ]The frequency, distribution and determinants of specific factors, health states or conditions in a defined population: eg, how common is a particular disease or condition in a specific group of individuals?Factor, disease, symptom or health ndition of interest, the epidemiological indicator used to measure its frequency (prevalence, incidence), the ulation or groups at risk as well as the ntext/location and time period where relevant ( )

Both Cochrane [ 30 , 59 ] and JBI [ 60 ] provide methodologies for many types of evidence syntheses; they describe these with different terminology, but there is obvious overlap (Table 2.2 ). The majority of evidence syntheses published by Cochrane (96%) and JBI (62%) are categorized as intervention reviews. This reflects the earlier development and dissemination of their intervention review methodologies; these remain well-established [ 30 , 59 , 61 ] as both organizations continue to focus on topics related to treatment efficacy and harms. In contrast, intervention reviews represent only about half of the total published in the general medical literature, and several non-intervention review types contribute to a significant proportion of the other half.

Evidence syntheses published by Cochrane and JBI

Intervention857296.3Effectiveness43561.5
Diagnostic1761.9Diagnostic Test Accuracy91.3
Overview640.7Umbrella40.6
Methodology410.45Mixed Methods20.3
Qualitative170.19Qualitative15922.5
Prognostic110.12Prevalence and Incidence60.8
Rapid110.12Etiology and Risk71.0
Prototype 80.08Measurement Properties30.4
Economic60.6
Text and Opinion10.14
Scoping436.0
Comprehensive 324.5
Total = 8900Total = 707

a Data from https://www.cochranelibrary.com/cdsr/reviews . Accessed 17 Sep 2022

b Data obtained via personal email communication on 18 Sep 2022 with Emilie Francis, editorial assistant, JBI Evidence Synthesis

c Includes the following categories: prevalence, scoping, mixed methods, and realist reviews

d This methodology is not supported in the current version of the JBI Manual for Evidence Synthesis

Types of research evidence

There is consensus on the importance of using multiple study designs in evidence syntheses; at the same time, there is a lack of agreement on methods to identify included study designs. Authors of evidence syntheses may use various taxonomies and associated algorithms to guide selection and/or classification of study designs. These tools differentiate categories of research and apply labels to individual study designs (eg, RCT, cross-sectional). A familiar example is the Design Tree endorsed by the Centre for Evidence-Based Medicine [ 70 ]. Such tools may not be helpful to authors of evidence syntheses for multiple reasons.

Suboptimal levels of agreement and accuracy even among trained methodologists reflect challenges with the application of such tools [ 71 , 72 ]. Problematic distinctions or decision points (eg, experimental or observational, controlled or uncontrolled, prospective or retrospective) and design labels (eg, cohort, case control, uncontrolled trial) have been reported [ 71 ]. The variable application of ambiguous study design labels to non-randomized studies is common, making them especially prone to misclassification [ 73 ]. In addition, study labels do not denote the unique design features that make different types of non-randomized studies susceptible to different biases, including those related to how the data are obtained (eg, clinical trials, disease registries, wearable devices). Given this limitation, it is important to be aware that design labels preclude the accurate assignment of non-randomized studies to a “level of evidence” in traditional hierarchies [ 74 ].

These concerns suggest that available tools and nomenclature used to distinguish types of research evidence may not uniformly apply to biomedical research and non-health fields that utilize evidence syntheses (eg, education, economics) [ 75 , 76 ]. Moreover, primary research reports often do not describe study design or do so incompletely or inaccurately; thus, indexing in PubMed and other databases does not address the potential for misclassification [ 77 ]. Yet proper identification of research evidence has implications for several key components of evidence syntheses. For example, search strategies limited by index terms using design labels or study selection based on labels applied by the authors of primary studies may cause inconsistent or unjustified study inclusions and/or exclusions [ 77 ]. In addition, because risk of bias (RoB) tools consider attributes specific to certain types of studies and study design features, results of these assessments may be invalidated if an inappropriate tool is used. Appropriate classification of studies is also relevant for the selection of a suitable method of synthesis and interpretation of those results.

An alternative to these tools and nomenclature involves application of a few fundamental distinctions that encompass a wide range of research designs and contexts. While these distinctions are not novel, we integrate them into a practical scheme (see Fig. ​ Fig.1) 1 ) designed to guide authors of evidence syntheses in the basic identification of research evidence. The initial distinction is between primary and secondary studies. Primary studies are then further distinguished by: 1) the type of data reported (qualitative or quantitative); and 2) two defining design features (group or single-case and randomized or non-randomized). The different types of studies and study designs represented in the scheme are described in detail in Additional File 2 B. It is important to conceptualize their methods as complementary as opposed to contrasting or hierarchical [ 78 ]; each offers advantages and disadvantages that determine their appropriateness for answering different kinds of research questions in an evidence synthesis.

An external file that holds a picture, illustration, etc.
Object name is 13643_2023_2255_Fig1_HTML.jpg

Distinguishing types of research evidence

Application of these basic distinctions may avoid some of the potential difficulties associated with study design labels and taxonomies. Nevertheless, debatable methodological issues are raised when certain types of research identified in this scheme are included in an evidence synthesis. We briefly highlight those associated with inclusion of non-randomized studies, case reports and series, and a combination of primary and secondary studies.

Non-randomized studies

When investigating an intervention’s effectiveness, it is important for authors to recognize the uncertainty of observed effects reported by studies with high RoB. Results of statistical analyses that include such studies need to be interpreted with caution in order to avoid misleading conclusions [ 74 ]. Review authors may consider excluding randomized studies with high RoB from meta-analyses. Non-randomized studies of intervention (NRSI) are affected by a greater potential range of biases and thus vary more than RCTs in their ability to estimate a causal effect [ 79 ]. If data from NRSI are synthesized in meta-analyses, it is helpful to separately report their summary estimates [ 6 , 74 ].

Nonetheless, certain design features of NRSI (eg, which parts of the study were prospectively designed) may help to distinguish stronger from weaker ones. Cochrane recommends that authors of a review including NRSI focus on relevant study design features when determining eligibility criteria instead of relying on non-informative study design labels [ 79 , 80 ] This process is facilitated by a study design feature checklist; guidance on using the checklist is included with developers’ description of the tool [ 73 , 74 ]. Authors collect information about these design features during data extraction and then consider it when making final study selection decisions and when performing RoB assessments of the included NRSI.

Case reports and case series

Correctly identified case reports and case series can contribute evidence not well captured by other designs [ 81 ]; in addition, some topics may be limited to a body of evidence that consists primarily of uncontrolled clinical observations. Murad and colleagues offer a framework for how to include case reports and series in an evidence synthesis [ 82 ]. Distinguishing between cohort studies and case series in these syntheses is important, especially for those that rely on evidence from NRSI. Additional data obtained from studies misclassified as case series can potentially increase the confidence in effect estimates. Mathes and Pieper provide authors of evidence syntheses with specific guidance on distinguishing between cohort studies and case series, but emphasize the increased workload involved [ 77 ].

Primary and secondary studies

Synthesis of combined evidence from primary and secondary studies may provide a broad perspective on the entirety of available literature on a topic. This is, in fact, the recommended strategy for scoping reviews that may include a variety of sources of evidence (eg, CPGs, popular media). However, except for scoping reviews, the synthesis of data from primary and secondary studies is discouraged unless there are strong reasons to justify doing so.

Combining primary and secondary sources of evidence is challenging for authors of other types of evidence syntheses for several reasons [ 83 ]. Assessments of RoB for primary and secondary studies are derived from conceptually different tools, thus obfuscating the ability to make an overall RoB assessment of a combination of these study types. In addition, authors who include primary and secondary studies must devise non-standardized methods for synthesis. Note this contrasts with well-established methods available for updating existing evidence syntheses with additional data from new primary studies [ 84 – 86 ]. However, a new review that synthesizes data from primary and secondary studies raises questions of validity and may unintentionally support a biased conclusion because no existing methodological guidance is currently available [ 87 ].

Recommendations

We suggest that journal editors require authors to identify which type of evidence synthesis they are submitting and reference the specific methodology used for its development. This will clarify the research question and methods for peer reviewers and potentially simplify the editorial process. Editors should announce this practice and include it in the instructions to authors. To decrease bias and apply correct methods, authors must also accurately identify the types of research evidence included in their syntheses.

Part 3. Conduct and reporting

The need to develop criteria to assess the rigor of systematic reviews was recognized soon after the EBM movement began to gain international traction [ 88 , 89 ]. Systematic reviews rapidly became popular, but many were very poorly conceived, conducted, and reported. These problems remain highly prevalent [ 23 ] despite development of guidelines and tools to standardize and improve the performance and reporting of evidence syntheses [ 22 , 28 ]. Table 3.1  provides some historical perspective on the evolution of tools developed specifically for the evaluation of systematic reviews, with or without meta-analysis.

Tools specifying standards for systematic reviews with and without meta-analysis

 Quality of Reporting of Meta-analyses (QUOROM) StatementMoher 1999 [ ]
 Meta-analyses Of Observational Studies in Epidemiology (MOOSE)Stroup 2000 [ ]
 Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)Moher 2009 [ ]
 PRISMA 2020 Page 2021 [ ]
 Overview Quality Assessment Questionnaire (OQAQ)Oxman and Guyatt 1991 [ ]
 Systematic Review Critical Appraisal SheetCentre for Evidence-based Medicine 2005 [ ]
 A Measurement Tool to Assess Systematic Reviews (AMSTAR)Shea 2007 [ ]
 AMSTAR-2 Shea 2017 [ ]
 Risk of Bias in Systematic Reviews (ROBIS) Whiting 2016 [ ]

a Currently recommended

b Validated tool for systematic reviews of interventions developed for use by authors of overviews or umbrella reviews

These tools are often interchangeably invoked when referring to the “quality” of an evidence synthesis. However, quality is a vague term that is frequently misused and misunderstood; more precisely, these tools specify different standards for evidence syntheses. Methodological standards address how well a systematic review was designed and performed [ 5 ]. RoB assessments refer to systematic flaws or limitations in the design, conduct, or analysis of research that distort the findings of the review [ 4 ]. Reporting standards help systematic review authors describe the methodology they used and the results of their synthesis in sufficient detail [ 92 ]. It is essential to distinguish between these evaluations: a systematic review may be biased, it may fail to report sufficient information on essential features, or it may exhibit both problems; a thoroughly reported systematic evidence synthesis review may still be biased and flawed while an otherwise unbiased one may suffer from deficient documentation.

We direct attention to the currently recommended tools listed in Table 3.1  but concentrate on AMSTAR-2 (update of AMSTAR [A Measurement Tool to Assess Systematic Reviews]) and ROBIS (Risk of Bias in Systematic Reviews), which evaluate methodological quality and RoB, respectively. For comparison and completeness, we include PRISMA 2020 (update of the 2009 Preferred Reporting Items for Systematic Reviews of Meta-Analyses statement), which offers guidance on reporting standards. The exclusive focus on these three tools is by design; it addresses concerns related to the considerable variability in tools used for the evaluation of systematic reviews [ 28 , 88 , 96 , 97 ]. We highlight the underlying constructs these tools were designed to assess, then describe their components and applications. Their known (or potential) uptake and impact and limitations are also discussed.

Evaluation of conduct

Development.

AMSTAR [ 5 ] was in use for a decade prior to the 2017 publication of AMSTAR-2; both provide a broad evaluation of methodological quality of intervention systematic reviews, including flaws arising through poor conduct of the review [ 6 ]. ROBIS, published in 2016, was developed to specifically assess RoB introduced by the conduct of the review; it is applicable to systematic reviews of interventions and several other types of reviews [ 4 ]. Both tools reflect a shift to a domain-based approach as opposed to generic quality checklists. There are a few items unique to each tool; however, similarities between items have been demonstrated [ 98 , 99 ]. AMSTAR-2 and ROBIS are recommended for use by: 1) authors of overviews or umbrella reviews and CPGs to evaluate systematic reviews considered as evidence; 2) authors of methodological research studies to appraise included systematic reviews; and 3) peer reviewers for appraisal of submitted systematic review manuscripts. For authors, these tools may function as teaching aids and inform conduct of their review during its development.

Description

Systematic reviews that include randomized and/or non-randomized studies as evidence can be appraised with AMSTAR-2 and ROBIS. Other characteristics of AMSTAR-2 and ROBIS are summarized in Table 3.2 . Both tools define categories for an overall rating; however, neither tool is intended to generate a total score by simply calculating the number of responses satisfying criteria for individual items [ 4 , 6 ]. AMSTAR-2 focuses on the rigor of a review’s methods irrespective of the specific subject matter. ROBIS places emphasis on a review’s results section— this suggests it may be optimally applied by appraisers with some knowledge of the review’s topic as they may be better equipped to determine if certain procedures (or lack thereof) would impact the validity of a review’s findings [ 98 , 100 ]. Reliability studies show AMSTAR-2 overall confidence ratings strongly correlate with the overall RoB ratings in ROBIS [ 100 , 101 ].

Comparison of AMSTAR-2 and ROBIS

Characteristic
ExtensiveExtensive
InterventionIntervention, diagnostic, etiology, prognostic
7 critical, 9 non-critical4
 Total number1629
 Response options

Items # 1, 3, 5, 6, 10, 13, 14, 16: rated or

Items # 2, 4, 7, 8, 9 : rated or

Items # 11 , 12, 15: rated or

24 assessment items: rated

5 items regarding level of concern: rated

 ConstructConfidence based on weaknesses in critical domainsLevel of concern for risk of bias
 CategoriesHigh, moderate, low, critically lowLow, high, unclear

a ROBIS includes an optional first phase to assess the applicability of the review to the research question of interest. The tool may be applicable to other review types in addition to the four specified, although modification of this initial phase will be needed (Personal Communication via email, Penny Whiting, 28 Jan 2022)

b AMSTAR-2 item #9 and #11 require separate responses for RCTs and NRSI

Interrater reliability has been shown to be acceptable for AMSTAR-2 [ 6 , 11 , 102 ] and ROBIS [ 4 , 98 , 103 ] but neither tool has been shown to be superior in this regard [ 100 , 101 , 104 , 105 ]. Overall, variability in reliability for both tools has been reported across items, between pairs of raters, and between centers [ 6 , 100 , 101 , 104 ]. The effects of appraiser experience on the results of AMSTAR-2 and ROBIS require further evaluation [ 101 , 105 ]. Updates to both tools should address items shown to be prone to individual appraisers’ subjective biases and opinions [ 11 , 100 ]; this may involve modifications of the current domains and signaling questions as well as incorporation of methods to make an appraiser’s judgments more explicit. Future revisions of these tools may also consider the addition of standards for aspects of systematic review development currently lacking (eg, rating overall certainty of evidence, [ 99 ] methods for synthesis without meta-analysis [ 105 ]) and removal of items that assess aspects of reporting that are thoroughly evaluated by PRISMA 2020.

Application

A good understanding of what is required to satisfy the standards of AMSTAR-2 and ROBIS involves study of the accompanying guidance documents written by the tools’ developers; these contain detailed descriptions of each item’s standards. In addition, accurate appraisal of a systematic review with either tool requires training. Most experts recommend independent assessment by at least two appraisers with a process for resolving discrepancies as well as procedures to establish interrater reliability, such as pilot testing, a calibration phase or exercise, and development of predefined decision rules [ 35 , 99 – 101 , 103 , 104 , 106 ]. These methods may, to some extent, address the challenges associated with the diversity in methodological training, subject matter expertise, and experience using the tools that are likely to exist among appraisers.

The standards of AMSTAR, AMSTAR-2, and ROBIS have been used in many methodological studies and epidemiological investigations. However, the increased publication of overviews or umbrella reviews and CPGs has likely been a greater influence on the widening acceptance of these tools. Critical appraisal of the secondary studies considered evidence is essential to the trustworthiness of both the recommendations of CPGs and the conclusions of overviews. Currently both Cochrane [ 55 ] and JBI [ 107 ] recommend AMSTAR-2 and ROBIS in their guidance for authors of overviews or umbrella reviews. However, ROBIS and AMSTAR-2 were released in 2016 and 2017, respectively; thus, to date, limited data have been reported about the uptake of these tools or which of the two may be preferred [ 21 , 106 ]. Currently, in relation to CPGs, AMSTAR-2 appears to be overwhelmingly popular compared to ROBIS. A Google Scholar search of this topic (search terms “AMSTAR 2 AND clinical practice guidelines,” “ROBIS AND clinical practice guidelines” 13 May 2022) found 12,700 hits for AMSTAR-2 and 1,280 for ROBIS. The apparent greater appeal of AMSTAR-2 may relate to its longer track record given the original version of the tool was in use for 10 years prior to its update in 2017.

Barriers to the uptake of AMSTAR-2 and ROBIS include the real or perceived time and resources necessary to complete the items they include and appraisers’ confidence in their own ratings [ 104 ]. Reports from comparative studies available to date indicate that appraisers find AMSTAR-2 questions, responses, and guidance to be clearer and simpler compared with ROBIS [ 11 , 101 , 104 , 105 ]. This suggests that for appraisal of intervention systematic reviews, AMSTAR-2 may be a more practical tool than ROBIS, especially for novice appraisers [ 101 , 103 – 105 ]. The unique characteristics of each tool, as well as their potential advantages and disadvantages, should be taken into consideration when deciding which tool should be used for an appraisal of a systematic review. In addition, the choice of one or the other may depend on how the results of an appraisal will be used; for example, a peer reviewer’s appraisal of a single manuscript versus an appraisal of multiple systematic reviews in an overview or umbrella review, CPG, or systematic methodological study.

Authors of overviews and CPGs report results of AMSTAR-2 and ROBIS appraisals for each of the systematic reviews they include as evidence. Ideally, an independent judgment of their appraisals can be made by the end users of overviews and CPGs; however, most stakeholders, including clinicians, are unlikely to have a sophisticated understanding of these tools. Nevertheless, they should at least be aware that AMSTAR-2 and ROBIS ratings reported in overviews and CPGs may be inaccurate because the tools are not applied as intended by their developers. This can result from inadequate training of the overview or CPG authors who perform the appraisals, or to modifications of the appraisal tools imposed by them. The potential variability in overall confidence and RoB ratings highlights why appraisers applying these tools need to support their judgments with explicit documentation; this allows readers to judge for themselves whether they agree with the criteria used by appraisers [ 4 , 108 ]. When these judgments are explicit, the underlying rationale used when applying these tools can be assessed [ 109 ].

Theoretically, we would expect an association of AMSTAR-2 with improved methodological rigor and an association of ROBIS with lower RoB in recent systematic reviews compared to those published before 2017. To our knowledge, this has not yet been demonstrated; however, like reports about the actual uptake of these tools, time will tell. Additional data on user experience is also needed to further elucidate the practical challenges and methodological nuances encountered with the application of these tools. This information could potentially inform the creation of unifying criteria to guide and standardize the appraisal of evidence syntheses [ 109 ].

Evaluation of reporting

Complete reporting is essential for users to establish the trustworthiness and applicability of a systematic review’s findings. Efforts to standardize and improve the reporting of systematic reviews resulted in the 2009 publication of the PRISMA statement [ 92 ] with its accompanying explanation and elaboration document [ 110 ]. This guideline was designed to help authors prepare a complete and transparent report of their systematic review. In addition, adherence to PRISMA is often used to evaluate the thoroughness of reporting of published systematic reviews [ 111 ]. The updated version, PRISMA 2020 [ 93 ], and its guidance document [ 112 ] were published in 2021. Items on the original and updated versions of PRISMA are organized by the six basic review components they address (title, abstract, introduction, methods, results, discussion). The PRISMA 2020 update is a considerably expanded version of the original; it includes standards and examples for the 27 original and 13 additional reporting items that capture methodological advances and may enhance the replicability of reviews [ 113 ].

The original PRISMA statement fostered the development of various PRISMA extensions (Table 3.3 ). These include reporting guidance for scoping reviews and reviews of diagnostic test accuracy and for intervention reviews that report on the following: harms outcomes, equity issues, the effects of acupuncture, the results of network meta-analyses and analyses of individual participant data. Detailed reporting guidance for specific systematic review components (abstracts, protocols, literature searches) is also available.

PRISMA extensions

PRISMA for systematic reviews with a focus on health equity [ ]PRISMA-E2012
Reporting systematic reviews in journal and conference abstracts [ ]PRISMA for Abstracts2015; 2020
PRISMA for systematic review protocols [ ]PRISMA-P2015
PRISMA for Network Meta-Analyses [ ]PRISMA-NMA2015
PRISMA for Individual Participant Data [ ]PRISMA-IPD2015
PRISMA for reviews including harms outcomes [ ]PRISMA-Harms2016
PRISMA for diagnostic test accuracy [ ]PRISMA-DTA2018
PRISMA for scoping reviews [ ]PRISMA-ScR2018
PRISMA for acupuncture [ ]PRISMA-A2019
PRISMA for reporting literature searches [ ]PRISMA-S2021

PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses

a Note the abstract reporting checklist is now incorporated into PRISMA 2020 [ 93 ]

Uptake and impact

The 2009 PRISMA standards [ 92 ] for reporting have been widely endorsed by authors, journals, and EBM-related organizations. We anticipate the same for PRISMA 2020 [ 93 ] given its co-publication in multiple high-impact journals. However, to date, there is a lack of strong evidence for an association between improved systematic review reporting and endorsement of PRISMA 2009 standards [ 43 , 111 ]. Most journals require a PRISMA checklist accompany submissions of systematic review manuscripts. However, the accuracy of information presented on these self-reported checklists is not necessarily verified. It remains unclear which strategies (eg, authors’ self-report of checklists, peer reviewer checks) might improve adherence to the PRISMA reporting standards; in addition, the feasibility of any potentially effective strategies must be taken into consideration given the structure and limitations of current research and publication practices [ 124 ].

Pitfalls and limitations of PRISMA, AMSTAR-2, and ROBIS

Misunderstanding of the roles of these tools and their misapplication may be widespread problems. PRISMA 2020 is a reporting guideline that is most beneficial if consulted when developing a review as opposed to merely completing a checklist when submitting to a journal; at that point, the review is finished, with good or bad methodological choices. However, PRISMA checklists evaluate how completely an element of review conduct was reported, but do not evaluate the caliber of conduct or performance of a review. Thus, review authors and readers should not think that a rigorous systematic review can be produced by simply following the PRISMA 2020 guidelines. Similarly, it is important to recognize that AMSTAR-2 and ROBIS are tools to evaluate the conduct of a review but do not substitute for conceptual methodological guidance. In addition, they are not intended to be simple checklists. In fact, they have the potential for misuse or abuse if applied as such; for example, by calculating a total score to make a judgment about a review’s overall confidence or RoB. Proper selection of a response for the individual items on AMSTAR-2 and ROBIS requires training or at least reference to their accompanying guidance documents.

Not surprisingly, it has been shown that compliance with the PRISMA checklist is not necessarily associated with satisfying the standards of ROBIS [ 125 ]. AMSTAR-2 and ROBIS were not available when PRISMA 2009 was developed; however, they were considered in the development of PRISMA 2020 [ 113 ]. Therefore, future studies may show a positive relationship between fulfillment of PRISMA 2020 standards for reporting and meeting the standards of tools evaluating methodological quality and RoB.

Choice of an appropriate tool for the evaluation of a systematic review first involves identification of the underlying construct to be assessed. For systematic reviews of interventions, recommended tools include AMSTAR-2 and ROBIS for appraisal of conduct and PRISMA 2020 for completeness of reporting. All three tools were developed rigorously and provide easily accessible and detailed user guidance, which is necessary for their proper application and interpretation. When considering a manuscript for publication, training in these tools can sensitize peer reviewers and editors to major issues that may affect the review’s trustworthiness and completeness of reporting. Judgment of the overall certainty of a body of evidence and formulation of recommendations rely, in part, on AMSTAR-2 or ROBIS appraisals of systematic reviews. Therefore, training on the application of these tools is essential for authors of overviews and developers of CPGs. Peer reviewers and editors considering an overview or CPG for publication must hold their authors to a high standard of transparency regarding both the conduct and reporting of these appraisals.

Part 4. Meeting conduct standards

Many authors, peer reviewers, and editors erroneously equate fulfillment of the items on the PRISMA checklist with superior methodological rigor. For direction on methodology, we refer them to available resources that provide comprehensive conceptual guidance [ 59 , 60 ] as well as primers with basic step-by-step instructions [ 1 , 126 , 127 ]. This section is intended to complement study of such resources by facilitating use of AMSTAR-2 and ROBIS, tools specifically developed to evaluate methodological rigor of systematic reviews. These tools are widely accepted by methodologists; however, in the general medical literature, they are not uniformly selected for the critical appraisal of systematic reviews [ 88 , 96 ].

To enable their uptake, Table 4.1  links review components to the corresponding appraisal tool items. Expectations of AMSTAR-2 and ROBIS are concisely stated, and reasoning provided.

Systematic review components linked to appraisal with AMSTAR-2 and ROBIS a

Table Table
Methods for study selection#5#2.5All three components must be done in duplicate, and methods fully described.Helps to mitigate CoI and bias; also may improve accuracy.
Methods for data extraction#6#3.1
Methods for RoB assessmentNA#3.5
Study description#8#3.2Research design features, components of research question (eg, PICO), setting, funding sources.Allows readers to understand the individual studies in detail.
Sources of funding#10NAIdentified for all included studies.Can reveal CoI or bias.
Publication bias#15*#4.5Explored, diagrammed, and discussed.Publication and other selective reporting biases are major threats to the validity of systematic reviews.
Author CoI#16NADisclosed, with management strategies described.If CoI is identified, management strategies must be described to ensure confidence in the review.

CoI conflict of interest, MA meta-analysis, NA not addressed, PICO participant, intervention, comparison, outcome, PRISMA-P Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols, RoB risk of bias

a Components shown in bold are chosen for elaboration in Part 4 for one (or both) of two reasons: 1) the component has been identified as potentially problematic for systematic review authors; and/or 2) the component is evaluated by standards of an AMSTAR-2 “critical” domain

b Critical domains of AMSTAR-2 are indicated by *

Issues involved in meeting the standards for seven review components (identified in bold in Table 4.1 ) are addressed in detail. These were chosen for elaboration for one (or both) of two reasons: 1) the component has been identified as potentially problematic for systematic review authors based on consistent reports of their frequent AMSTAR-2 or ROBIS deficiencies [ 9 , 11 , 15 , 88 , 128 , 129 ]; and/or 2) the review component is judged by standards of an AMSTAR-2 “critical” domain. These have the greatest implications for how a systematic review will be appraised: if standards for any one of these critical domains are not met, the review is rated as having “critically low confidence.”

Research question

Specific and unambiguous research questions may have more value for reviews that deal with hypothesis testing. Mnemonics for the various elements of research questions are suggested by JBI and Cochrane (Table 2.1 ). These prompt authors to consider the specialized methods involved for developing different types of systematic reviews; however, while inclusion of the suggested elements makes a review compliant with a particular review’s methods, it does not necessarily make a research question appropriate. Table 4.2  lists acronyms that may aid in developing the research question. They include overlapping concepts of importance in this time of proliferating reviews of uncertain value [ 130 ]. If these issues are not prospectively contemplated, systematic review authors may establish an overly broad scope, or develop runaway scope allowing them to stray from predefined choices relating to key comparisons and outcomes.

Research question development

AcronymMeaning
feasible, interesting, novel, ethical, and relevant
specific, measurable, attainable, relevant, timely
time, outcomes, population, intervention, context, study design, plus (effect) moderators

a Cummings SR, Browner WS, Hulley SB. Conceiving the research question and developing the study plan. In: Hulley SB, Cummings SR, Browner WS, editors. Designing clinical research: an epidemiological approach; 4th edn. Lippincott Williams & Wilkins; 2007. p. 14–22

b Doran, GT. There’s a S.M.A.R.T. way to write management’s goals and objectives. Manage Rev. 1981;70:35-6.

c Johnson BT, Hennessy EA. Systematic reviews and meta-analyses in the health sciences: best practice methods for research syntheses. Soc Sci Med. 2019;233:237–51

Once a research question is established, searching on registry sites and databases for existing systematic reviews addressing the same or a similar topic is necessary in order to avoid contributing to research waste [ 131 ]. Repeating an existing systematic review must be justified, for example, if previous reviews are out of date or methodologically flawed. A full discussion on replication of intervention systematic reviews, including a consensus checklist, can be found in the work of Tugwell and colleagues [ 84 ].

Protocol development is considered a core component of systematic reviews [ 125 , 126 , 132 ]. Review protocols may allow researchers to plan and anticipate potential issues, assess validity of methods, prevent arbitrary decision-making, and minimize bias that can be introduced by the conduct of the review. Registration of a protocol that allows public access promotes transparency of the systematic review’s methods and processes and reduces the potential for duplication [ 132 ]. Thinking early and carefully about all the steps of a systematic review is pragmatic and logical and may mitigate the influence of the authors’ prior knowledge of the evidence [ 133 ]. In addition, the protocol stage is when the scope of the review can be carefully considered by authors, reviewers, and editors; this may help to avoid production of overly ambitious reviews that include excessive numbers of comparisons and outcomes or are undisciplined in their study selection.

An association with attainment of AMSTAR standards in systematic reviews with published prospective protocols has been reported [ 134 ]. However, completeness of reporting does not seem to be different in reviews with a protocol compared to those without one [ 135 ]. PRISMA-P [ 116 ] and its accompanying elaboration and explanation document [ 136 ] can be used to guide and assess the reporting of protocols. A final version of the review should fully describe any protocol deviations. Peer reviewers may compare the submitted manuscript with any available pre-registered protocol; this is required if AMSTAR-2 or ROBIS are used for critical appraisal.

There are multiple options for the recording of protocols (Table 4.3 ). Some journals will peer review and publish protocols. In addition, many online sites offer date-stamped and publicly accessible protocol registration. Some of these are exclusively for protocols of evidence syntheses; others are less restrictive and offer researchers the capacity for data storage, sharing, and other workflow features. These sites document protocol details to varying extents and have different requirements [ 137 ]. The most popular site for systematic reviews, the International Prospective Register of Systematic Reviews (PROSPERO), for example, only registers reviews that report on an outcome with direct relevance to human health. The PROSPERO record documents protocols for all types of reviews except literature and scoping reviews. Of note, PROSPERO requires authors register their review protocols prior to any data extraction [ 133 , 138 ]. The electronic records of most of these registry sites allow authors to update their protocols and facilitate transparent tracking of protocol changes, which are not unexpected during the progress of the review [ 139 ].

Options for protocol registration of evidence syntheses

 BMJ Open
 BioMed Central
 JMIR Research Protocols
 World Journal of Meta-analysis
 Cochrane
 JBI
 PROSPERO

 Research Registry-

 Registry of Systematic Reviews/Meta-Analyses

 International Platform of Registered Systematic Review and Meta-analysis Protocols (INPLASY)
 Center for Open Science
 Protocols.io
 Figshare
 Open Science Framework
 Zenodo

a Authors are advised to contact their target journal regarding submission of systematic review protocols

b Registration is restricted to approved review projects

c The JBI registry lists review projects currently underway by JBI-affiliated entities. These records include a review’s title, primary author, research question, and PICO elements. JBI recommends that authors register eligible protocols with PROSPERO

d See Pieper and Rombey [ 137 ] for detailed characteristics of these five registries

e See Pieper and Rombey [ 137 ] for other systematic review data repository options

Study design inclusion

For most systematic reviews, broad inclusion of study designs is recommended [ 126 ]. This may allow comparison of results between contrasting study design types [ 126 ]. Certain study designs may be considered preferable depending on the type of review and nature of the research question. However, prevailing stereotypes about what each study design does best may not be accurate. For example, in systematic reviews of interventions, randomized designs are typically thought to answer highly specific questions while non-randomized designs often are expected to reveal greater information about harms or real-word evidence [ 126 , 140 , 141 ]. This may be a false distinction; randomized trials may be pragmatic [ 142 ], they may offer important (and more unbiased) information on harms [ 143 ], and data from non-randomized trials may not necessarily be more real-world-oriented [ 144 ].

Moreover, there may not be any available evidence reported by RCTs for certain research questions; in some cases, there may not be any RCTs or NRSI. When the available evidence is limited to case reports and case series, it is not possible to test hypotheses nor provide descriptive estimates or associations; however, a systematic review of these studies can still offer important insights [ 81 , 145 ]. When authors anticipate that limited evidence of any kind may be available to inform their research questions, a scoping review can be considered. Alternatively, decisions regarding inclusion of indirect as opposed to direct evidence can be addressed during protocol development [ 146 ]. Including indirect evidence at an early stage of intervention systematic review development allows authors to decide if such studies offer any additional and/or different understanding of treatment effects for their population or comparison of interest. Issues of indirectness of included studies are accounted for later in the process, during determination of the overall certainty of evidence (see Part 5 for details).

Evidence search

Both AMSTAR-2 and ROBIS require systematic and comprehensive searches for evidence. This is essential for any systematic review. Both tools discourage search restrictions based on language and publication source. Given increasing globalism in health care, the practice of including English-only literature should be avoided [ 126 ]. There are many examples in which language bias (different results in studies published in different languages) has been documented [ 147 , 148 ]. This does not mean that all literature, in all languages, is equally trustworthy [ 148 ]; however, the only way to formally probe for the potential of such biases is to consider all languages in the initial search. The gray literature and a search of trials may also reveal important details about topics that would otherwise be missed [ 149 – 151 ]. Again, inclusiveness will allow review authors to investigate whether results differ in gray literature and trials [ 41 , 151 – 153 ].

Authors should make every attempt to complete their review within one year as that is the likely viable life of a search. (1) If that is not possible, the search should be updated close to the time of completion [ 154 ]. Different research topics may warrant less of a delay, for example, in rapidly changing fields (as in the case of the COVID-19 pandemic), even one month may radically change the available evidence.

Excluded studies

AMSTAR-2 requires authors to provide references for any studies excluded at the full text phase of study selection along with reasons for exclusion; this allows readers to feel confident that all relevant literature has been considered for inclusion and that exclusions are defensible.

Risk of bias assessment of included studies

The design of the studies included in a systematic review (eg, RCT, cohort, case series) should not be equated with appraisal of its RoB. To meet AMSTAR-2 and ROBIS standards, systematic review authors must examine RoB issues specific to the design of each primary study they include as evidence. It is unlikely that a single RoB appraisal tool will be suitable for all research designs. In addition to tools for randomized and non-randomized studies, specific tools are available for evaluation of RoB in case reports and case series [ 82 ] and single-case experimental designs [ 155 , 156 ]. Note the RoB tools selected must meet the standards of the appraisal tool used to judge the conduct of the review. For example, AMSTAR-2 identifies four sources of bias specific to RCTs and NRSI that must be addressed by the RoB tool(s) chosen by the review authors. The Cochrane RoB-2 [ 157 ] tool for RCTs and ROBINS-I [ 158 ] for NRSI for RoB assessment meet the AMSTAR-2 standards. Appraisers on the review team should not modify any RoB tool without complete transparency and acknowledgment that they have invalidated the interpretation of the tool as intended by its developers [ 159 ]. Conduct of RoB assessments is not addressed AMSTAR-2; to meet ROBIS standards, two independent reviewers should complete RoB assessments of included primary studies.

Implications of the RoB assessments must be explicitly discussed and considered in the conclusions of the review. Discussion of the overall RoB of included studies may consider the weight of the studies at high RoB, the importance of the sources of bias in the studies being summarized, and if their importance differs in relationship to the outcomes reported. If a meta-analysis is performed, serious concerns for RoB of individual studies should be accounted for in these results as well. If the results of the meta-analysis for a specific outcome change when studies at high RoB are excluded, readers will have a more accurate understanding of this body of evidence. However, while investigating the potential impact of specific biases is a useful exercise, it is important to avoid over-interpretation, especially when there are sparse data.

Synthesis methods for quantitative data

Syntheses of quantitative data reported by primary studies are broadly categorized as one of two types: meta-analysis, and synthesis without meta-analysis (Table 4.4 ). Before deciding on one of these methods, authors should seek methodological advice about whether reported data can be transformed or used in other ways to provide a consistent effect measure across studies [ 160 , 161 ].

Common methods for quantitative synthesis

Aggregate data

Individual

participant data

Weighted average of effect estimates

Pairwise comparisons of effect estimates, CI

Overall effect estimate, CI, value

Evaluation of heterogeneity

Forest plot with summary statistic for average effect estimate
Network Variable The interventions, which are compared directly indirectlyNetwork diagram or graph, tabular presentations
Comparisons of relative effects between any pair of interventionsEffect estimates for intervention pairings
Summary relative effects for pair-wise comparisons with evaluations of inconsistency and heterogeneityForest plot, other methods
Treatment rankings (ie, probability that an intervention is among the best options)Rankogram plot
Summarizing effect estimates from separate studies (without combination that would provide an average effect estimate)Range and distribution of observed effects such as median, interquartile range, range

Box-and-whisker plot, bubble plot

Forest plot (without summary effect estimate)

Combining valuesCombined value, number of studiesAlbatross plot (study sample size against values per outcome)
Vote counting by direction of effect (eg, favors intervention over the comparator)Proportion of studies with an effect in the direction of interest, CI, valueHarvest plot, effect direction plot

CI confidence interval (or credible interval, if analysis is done in Bayesian framework)

a See text for descriptions of the types of data combined in each of these approaches

b See Additional File 4  for guidance on the structure and presentation of forest plots

c General approach is similar to aggregate data meta-analysis but there are substantial differences relating to data collection and checking and analysis [ 162 ]. This approach to syntheses is applicable to intervention, diagnostic, and prognostic systematic reviews [ 163 ]

d Examples include meta-regression, hierarchical and multivariate approaches [ 164 ]

e In-depth guidance and illustrations of these methods are provided in Chapter 12 of the Cochrane Handbook [ 160 ]

Meta-analysis

Systematic reviews that employ meta-analysis should not be referred to simply as “meta-analyses.” The term meta-analysis strictly refers to a specific statistical technique used when study effect estimates and their variances are available, yielding a quantitative summary of results. In general, methods for meta-analysis involve use of a weighted average of effect estimates from two or more studies. If considered carefully, meta-analysis increases the precision of the estimated magnitude of effect and can offer useful insights about heterogeneity and estimates of effects. We refer to standard references for a thorough introduction and formal training [ 165 – 167 ].

There are three common approaches to meta-analysis in current health care–related systematic reviews (Table 4.4 ). Aggregate meta-analyses is the most familiar to authors of evidence syntheses and their end users. This standard meta-analysis combines data on effect estimates reported by studies that investigate similar research questions involving direct comparisons of an intervention and comparator. Results of these analyses provide a single summary intervention effect estimate. If the included studies in a systematic review measure an outcome differently, their reported results may be transformed to make them comparable [ 161 ]. Forest plots visually present essential information about the individual studies and the overall pooled analysis (see Additional File 4  for details).

Less familiar and more challenging meta-analytical approaches used in secondary research include individual participant data (IPD) and network meta-analyses (NMA); PRISMA extensions provide reporting guidelines for both [ 117 , 118 ]. In IPD, the raw data on each participant from each eligible study are re-analyzed as opposed to the study-level data analyzed in aggregate data meta-analyses [ 168 ]. This may offer advantages, including the potential for limiting concerns about bias and allowing more robust analyses [ 163 ]. As suggested by the description in Table 4.4 , NMA is a complex statistical approach. It combines aggregate data [ 169 ] or IPD [ 170 ] for effect estimates from direct and indirect comparisons reported in two or more studies of three or more interventions. This makes it a potentially powerful statistical tool; while multiple interventions are typically available to treat a condition, few have been evaluated in head-to-head trials [ 171 ]. Both IPD and NMA facilitate a broader scope, and potentially provide more reliable and/or detailed results; however, compared with standard aggregate data meta-analyses, their methods are more complicated, time-consuming, and resource-intensive, and they have their own biases, so one needs sufficient funding, technical expertise, and preparation to employ them successfully [ 41 , 172 , 173 ].

Several items in AMSTAR-2 and ROBIS address meta-analysis; thus, understanding the strengths, weaknesses, assumptions, and limitations of methods for meta-analyses is important. According to the standards of both tools, plans for a meta-analysis must be addressed in the review protocol, including reasoning, description of the type of quantitative data to be synthesized, and the methods planned for combining the data. This should not consist of stock statements describing conventional meta-analysis techniques; rather, authors are expected to anticipate issues specific to their research questions. Concern for the lack of training in meta-analysis methods among systematic review authors cannot be overstated. For those with training, the use of popular software (eg, RevMan [ 174 ], MetaXL [ 175 ], JBI SUMARI [ 176 ]) may facilitate exploration of these methods; however, such programs cannot substitute for the accurate interpretation of the results of meta-analyses, especially for more complex meta-analytical approaches.

Synthesis without meta-analysis

There are varied reasons a meta-analysis may not be appropriate or desirable [ 160 , 161 ]. Syntheses that informally use statistical methods other than meta-analysis are variably referred to as descriptive, narrative, or qualitative syntheses or summaries; these terms are also applied to syntheses that make no attempt to statistically combine data from individual studies. However, use of such imprecise terminology is discouraged; in order to fully explore the results of any type of synthesis, some narration or description is needed to supplement the data visually presented in tabular or graphic forms [ 63 , 177 ]. In addition, the term “qualitative synthesis” is easily confused with a synthesis of qualitative data in a qualitative or mixed methods review. “Synthesis without meta-analysis” is currently the preferred description of other ways to combine quantitative data from two or more studies. Use of this specific terminology when referring to these types of syntheses also implies the application of formal methods (Table 4.4 ).

Methods for syntheses without meta-analysis involve structured presentations of the data in any tables and plots. In comparison to narrative descriptions of each study, these are designed to more effectively and transparently show patterns and convey detailed information about the data; they also allow informal exploration of heterogeneity [ 178 ]. In addition, acceptable quantitative statistical methods (Table 4.4 ) are formally applied; however, it is important to recognize these methods have significant limitations for the interpretation of the effectiveness of an intervention [ 160 ]. Nevertheless, when meta-analysis is not possible, the application of these methods is less prone to bias compared with an unstructured narrative description of included studies [ 178 , 179 ].

Vote counting is commonly used in systematic reviews and involves a tally of studies reporting results that meet some threshold of importance applied by review authors. Until recently, it has not typically been identified as a method for synthesis without meta-analysis. Guidance on an acceptable vote counting method based on direction of effect is currently available [ 160 ] and should be used instead of narrative descriptions of such results (eg, “more than half the studies showed improvement”; “only a few studies reported adverse effects”; “7 out of 10 studies favored the intervention”). Unacceptable methods include vote counting by statistical significance or magnitude of effect or some subjective rule applied by the authors.

AMSTAR-2 and ROBIS standards do not explicitly address conduct of syntheses without meta-analysis, although AMSTAR-2 items 13 and 14 might be considered relevant. Guidance for the complete reporting of syntheses without meta-analysis for systematic reviews of interventions is available in the Synthesis without Meta-analysis (SWiM) guideline [ 180 ] and methodological guidance is available in the Cochrane Handbook [ 160 , 181 ].

Familiarity with AMSTAR-2 and ROBIS makes sense for authors of systematic reviews as these appraisal tools will be used to judge their work; however, training is necessary for authors to truly appreciate and apply methodological rigor. Moreover, judgment of the potential contribution of a systematic review to the current knowledge base goes beyond meeting the standards of AMSTAR-2 and ROBIS. These tools do not explicitly address some crucial concepts involved in the development of a systematic review; this further emphasizes the need for author training.

We recommend that systematic review authors incorporate specific practices or exercises when formulating a research question at the protocol stage, These should be designed to raise the review team’s awareness of how to prevent research and resource waste [ 84 , 130 ] and to stimulate careful contemplation of the scope of the review [ 30 ]. Authors’ training should also focus on justifiably choosing a formal method for the synthesis of quantitative and/or qualitative data from primary research; both types of data require specific expertise. For typical reviews that involve syntheses of quantitative data, statistical expertise is necessary, initially for decisions about appropriate methods, [ 160 , 161 ] and then to inform any meta-analyses [ 167 ] or other statistical methods applied [ 160 ].

Part 5. Rating overall certainty of evidence

Report of an overall certainty of evidence assessment in a systematic review is an important new reporting standard of the updated PRISMA 2020 guidelines [ 93 ]. Systematic review authors are well acquainted with assessing RoB in individual primary studies, but much less familiar with assessment of overall certainty across an entire body of evidence. Yet a reliable way to evaluate this broader concept is now recognized as a vital part of interpreting the evidence.

Historical systems for rating evidence are based on study design and usually involve hierarchical levels or classes of evidence that use numbers and/or letters to designate the level/class. These systems were endorsed by various EBM-related organizations. Professional societies and regulatory groups then widely adopted them, often with modifications for application to the available primary research base in specific clinical areas. In 2002, a report issued by the AHRQ identified 40 systems to rate quality of a body of evidence [ 182 ]. A critical appraisal of systems used by prominent health care organizations published in 2004 revealed limitations in sensibility, reproducibility, applicability to different questions, and usability to different end users [ 183 ]. Persistent use of hierarchical rating schemes to describe overall quality continues to complicate the interpretation of evidence. This is indicated by recent reports of poor interpretability of systematic review results by readers [ 184 – 186 ] and misleading interpretations of the evidence related to the “spin” systematic review authors may put on their conclusions [ 50 , 187 ].

Recognition of the shortcomings of hierarchical rating systems raised concerns that misleading clinical recommendations could result even if based on a rigorous systematic review. In addition, the number and variability of these systems were considered obstacles to quick and accurate interpretations of the evidence by clinicians, patients, and policymakers [ 183 ]. These issues contributed to the development of the GRADE approach. An international working group, that continues to actively evaluate and refine it, first introduced GRADE in 2004 [ 188 ]. Currently more than 110 organizations from 19 countries around the world have endorsed or are using GRADE [ 189 ].

GRADE approach to rating overall certainty

GRADE offers a consistent and sensible approach for two separate processes: rating the overall certainty of a body of evidence and the strength of recommendations. The former is the expected conclusion of a systematic review, while the latter is pertinent to the development of CPGs. As such, GRADE provides a mechanism to bridge the gap from evidence synthesis to application of the evidence for informed clinical decision-making [ 27 , 190 ]. We briefly examine the GRADE approach but only as it applies to rating overall certainty of evidence in systematic reviews.

In GRADE, use of “certainty” of a body of evidence is preferred over the term “quality.” [ 191 ] Certainty refers to the level of confidence systematic review authors have that, for each outcome, an effect estimate represents the true effect. The GRADE approach to rating confidence in estimates begins with identifying the study type (RCT or NRSI) and then systematically considers criteria to rate the certainty of evidence up or down (Table 5.1 ).

GRADE criteria for rating certainty of evidence

[ ]
Risk of bias [ ]Large magnitude of effect
Imprecision [ ]Dose–response gradient
Inconsistency [ ]All residual confounding would decrease magnitude of effect (in situations with an effect)
Indirectness [ ]
Publication bias [ ]

a Applies to randomized studies

b Applies to non-randomized studies

This process results in assignment of one of the four GRADE certainty ratings to each outcome; these are clearly conveyed with the use of basic interpretation symbols (Table 5.2 ) [ 192 ]. Notably, when multiple outcomes are reported in a systematic review, each outcome is assigned a unique certainty rating; thus different levels of certainty may exist in the body of evidence being examined.

GRADE certainty ratings and their interpretation symbols a

 ⊕  ⊕  ⊕  ⊕ High: We are very confident that the true effect lies close to that of the estimate of the effect
 ⊕  ⊕  ⊕ Moderate: We are moderately confident in the effect estimate: the true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different
 ⊕  ⊕ Low: Our confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect
 ⊕ Very low: We have very little confidence in the effect estimate: the true effect is likely to be substantially different from the estimate of effect

a From the GRADE Handbook [ 192 ]

GRADE’s developers acknowledge some subjectivity is involved in this process [ 193 ]. In addition, they emphasize that both the criteria for rating evidence up and down (Table 5.1 ) as well as the four overall certainty ratings (Table 5.2 ) reflect a continuum as opposed to discrete categories [ 194 ]. Consequently, deciding whether a study falls above or below the threshold for rating up or down may not be straightforward, and preliminary overall certainty ratings may be intermediate (eg, between low and moderate). Thus, the proper application of GRADE requires systematic review authors to take an overall view of the body of evidence and explicitly describe the rationale for their final ratings.

Advantages of GRADE

Outcomes important to the individuals who experience the problem of interest maintain a prominent role throughout the GRADE process [ 191 ]. These outcomes must inform the research questions (eg, PICO [population, intervention, comparator, outcome]) that are specified a priori in a systematic review protocol. Evidence for these outcomes is then investigated and each critical or important outcome is ultimately assigned a certainty of evidence as the end point of the review. Notably, limitations of the included studies have an impact at the outcome level. Ultimately, the certainty ratings for each outcome reported in a systematic review are considered by guideline panels. They use a different process to formulate recommendations that involves assessment of the evidence across outcomes [ 201 ]. It is beyond our scope to describe the GRADE process for formulating recommendations; however, it is critical to understand how these two outcome-centric concepts of certainty of evidence in the GRADE framework are related and distinguished. An in-depth illustration using examples from recently published evidence syntheses and CPGs is provided in Additional File 5 A (Table AF5A-1).

The GRADE approach is applicable irrespective of whether the certainty of the primary research evidence is high or very low; in some circumstances, indirect evidence of higher certainty may be considered if direct evidence is unavailable or of low certainty [ 27 ]. In fact, most interventions and outcomes in medicine have low or very low certainty of evidence based on GRADE and there seems to be no major improvement over time [ 202 , 203 ]. This is still a very important (even if sobering) realization for calibrating our understanding of medical evidence. A major appeal of the GRADE approach is that it offers a common framework that enables authors of evidence syntheses to make complex judgments about evidence certainty and to convey these with unambiguous terminology. This prevents some common mistakes made by review authors, including overstating results (or under-reporting harms) [ 187 ] and making recommendations for treatment. This is illustrated in Table AF5A-2 (Additional File 5 A), which compares the concluding statements made about overall certainty in a systematic review with and without application of the GRADE approach.

Theoretically, application of GRADE should improve consistency of judgments about certainty of evidence, both between authors and across systematic reviews. In one empirical evaluation conducted by the GRADE Working Group, interrater reliability of two individual raters assessing certainty of the evidence for a specific outcome increased from ~ 0.3 without using GRADE to ~ 0.7 by using GRADE [ 204 ]. However, others report variable agreement among those experienced in GRADE assessments of evidence certainty [ 190 ]. Like any other tool, GRADE requires training in order to be properly applied. The intricacies of the GRADE approach and the necessary subjectivity involved suggest that improving agreement may require strict rules for its application; alternatively, use of general guidance and consensus among review authors may result in less consistency but provide important information for the end user [ 190 ].

GRADE caveats

Simply invoking “the GRADE approach” does not automatically ensure GRADE methods were employed by authors of a systematic review (or developers of a CPG). Table 5.3 lists the criteria the GRADE working group has established for this purpose. These criteria highlight the specific terminology and methods that apply to rating the certainty of evidence for outcomes reported in a systematic review [ 191 ], which is different from rating overall certainty across outcomes considered in the formulation of recommendations [ 205 ]. Modifications of standard GRADE methods and terminology are discouraged as these may detract from GRADE’s objectives to minimize conceptual confusion and maximize clear communication [ 206 ].

Criteria for using GRADE in a systematic review a

1. The certainty in the evidence (also known as quality of evidence or confidence in the estimates) should be defined consistently with the definitions used by the GRADE Working Group.
2. Explicit consideration should be given to each of the GRADE domains for assessing the certainty in the evidence (although different terminology may be used).
3. The overall certainty in the evidence should be assessed for each important outcome using four or three categories (such as high, moderate, low and/or very low) and definitions for each category that are consistent with the definitions used by the GRADE Working Group.
4. Evidence summaries … should be used as the basis for judgments about the certainty in the evidence.

a Adapted from the GRADE working group [ 206 ]; this list does not contain the additional criteria that apply to the development of a clinical practice guideline

Nevertheless, GRADE is prone to misapplications [ 207 , 208 ], which can distort a systematic review’s conclusions about the certainty of evidence. Systematic review authors without proper GRADE training are likely to misinterpret the terms “quality” and “grade” and to misunderstand the constructs assessed by GRADE versus other appraisal tools. For example, review authors may reference the standard GRADE certainty ratings (Table 5.2 ) to describe evidence for their outcome(s) of interest. However, these ratings are invalidated if authors omit or inadequately perform RoB evaluations of each included primary study. Such deficiencies in RoB assessments are unacceptable but not uncommon, as reported in methodological studies of systematic reviews and overviews [ 104 , 186 , 209 , 210 ]. GRADE ratings are also invalidated if review authors do not formally address and report on the other criteria (Table 5.1 ) necessary for a GRADE certainty rating.

Other caveats pertain to application of a GRADE certainty of evidence rating in various types of evidence syntheses. Current adaptations of GRADE are described in Additional File 5 B and included on Table 6.3 , which is introduced in the next section.

Concise Guide to best practices for evidence syntheses, version 1.0 a

Cochrane , JBICochrane, JBICochraneCochrane, JBIJBIJBIJBICochrane, JBIJBI
 ProtocolPRISMA-P [ ]PRISMA-PPRISMA-PPRISMA-PPRISMA-PPRISMA-PPRISMA-PPRISMA-PPRISMA-P
 Systematic reviewPRISMA 2020 [ ]PRISMA-DTA [ ]PRISMA 2020

eMERGe [ ]

ENTREQ [ ]

PRISMA 2020PRISMA 2020PRISMA 2020PRIOR [ ]PRISMA-ScR [ ]
 Synthesis without MASWiM [ ]PRISMA-DTA [ ]SWiM eMERGe [ ] ENTREQ [ ] SWiM SWiM SWiM PRIOR [ ]

For RCTs: Cochrane RoB2 [ ]

For NRSI:

ROBINS-I [ ]

Other primary research

QUADAS-2[ ]

Factor review QUIPS [ ]

Model review PROBAST [ ]

CASP qualitative checklist [ ]

JBI Critical Appraisal Checklist [ ]

JBI checklist for studies reporting prevalence data [ ]

For NRSI: ROBINS-I [ ]

Other primary research

COSMIN RoB Checklist [ ]AMSTAR-2 [ ] or ROBIS [ ]Not required
GRADE [ ]GRADE adaptation GRADE adaptation

CERQual [ ]

ConQual [ ]

GRADE adaptation Risk factors GRADE adaptation

GRADE (for intervention reviews)

Risk factors

Not applicable

AMSTAR A MeaSurement Tool to Assess Systematic Reviews, CASP Critical Appraisal Skills Programme, CERQual Confidence in the Evidence from Reviews of Qualitative research, ConQual Establishing Confidence in the output of Qualitative research synthesis, COSMIN COnsensus-based Standards for the selection of health Measurement Instruments, DTA diagnostic test accuracy, eMERGe meta-ethnography reporting guidance, ENTREQ enhancing transparency in reporting the synthesis of qualitative research, GRADE Grading of Recommendations Assessment, Development and Evaluation, MA meta-analysis, NRSI non-randomized studies of interventions, P protocol, PRIOR Preferred Reporting Items for Overviews of Reviews, PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses, PROBAST Prediction model Risk Of Bias ASsessment Tool, QUADAS quality assessment of studies of diagnostic accuracy included in systematic reviews, QUIPS Quality In Prognosis Studies, RCT randomized controlled trial, RoB risk of bias, ROBINS-I Risk Of Bias In Non-randomised Studies of Interventions, ROBIS Risk of Bias in Systematic Reviews, ScR scoping review, SWiM systematic review without meta-analysis

a Superscript numbers represent citations provided in the main reference list. Additional File 6 lists links to available online resources for the methods and tools included in the Concise Guide

b The MECIR manual [ 30 ] provides Cochrane’s specific standards for both reporting and conduct of intervention systematic reviews and protocols

c Editorial and peer reviewers can evaluate completeness of reporting in submitted manuscripts using these tools. Authors may be required to submit a self-reported checklist for the applicable tools

d The decision flowchart described by Flemming and colleagues [ 223 ] is recommended for guidance on how to choose the best approach to reporting for qualitative reviews

e SWiM was developed for intervention studies reporting quantitative data. However, if there is not a more directly relevant reporting guideline, SWiM may prompt reviewers to consider the important details to report. (Personal Communication via email, Mhairi Campbell, 14 Dec 2022)

f JBI recommends their own tools for the critical appraisal of various quantitative primary study designs included in systematic reviews of intervention effectiveness, prevalence and incidence, and etiology and risk as well as for the critical appraisal of systematic reviews included in umbrella reviews. However, except for the JBI Checklists for studies reporting prevalence data and qualitative research, the development, validity, and reliability of these tools are not well documented

g Studies that are not RCTs or NRSI require tools developed specifically to evaluate their design features. Examples include single case experimental design [ 155 , 156 ] and case reports and series [ 82 ]

h The evaluation of methodological quality of studies included in a synthesis of qualitative research is debatable [ 224 ]. Authors may select a tool appropriate for the type of qualitative synthesis methodology employed. The CASP Qualitative Checklist [ 218 ] is an example of a published, commonly used tool that focuses on assessment of the methodological strengths and limitations of qualitative studies. The JBI Critical Appraisal Checklist for Qualitative Research [ 219 ] is recommended for reviews using a meta-aggregative approach

i Consider including risk of bias assessment of included studies if this information is relevant to the research question; however, scoping reviews do not include an assessment of the overall certainty of a body of evidence

j Guidance available from the GRADE working group [ 225 , 226 ]; also recommend consultation with the Cochrane diagnostic methods group

k Guidance available from the GRADE working group [ 227 ]; also recommend consultation with Cochrane prognostic methods group

l Used for syntheses in reviews with a meta-aggregative approach [ 224 ]

m Chapter 5 in the JBI Manual offers guidance on how to adapt GRADE to prevalence and incidence reviews [ 69 ]

n Janiaud and colleagues suggest criteria for evaluating evidence certainty for meta-analyses of non-randomized studies evaluating risk factors [ 228 ]

o The COSMIN user manual provides details on how to apply GRADE in systematic reviews of measurement properties [ 229 ]

The expected culmination of a systematic review should be a rating of overall certainty of a body of evidence for each outcome reported. The GRADE approach is recommended for making these judgments for outcomes reported in systematic reviews of interventions and can be adapted for other types of reviews. This represents the initial step in the process of making recommendations based on evidence syntheses. Peer reviewers should ensure authors meet the minimal criteria for supporting the GRADE approach when reviewing any evidence synthesis that reports certainty ratings derived using GRADE. Authors and peer reviewers of evidence syntheses unfamiliar with GRADE are encouraged to seek formal training and take advantage of the resources available on the GRADE website [ 211 , 212 ].

Part 6. Concise Guide to best practices

Accumulating data in recent years suggest that many evidence syntheses (with or without meta-analysis) are not reliable. This relates in part to the fact that their authors, who are often clinicians, can be overwhelmed by the plethora of ways to evaluate evidence. They tend to resort to familiar but often inadequate, inappropriate, or obsolete methods and tools and, as a result, produce unreliable reviews. These manuscripts may not be recognized as such by peer reviewers and journal editors who may disregard current standards. When such a systematic review is published or included in a CPG, clinicians and stakeholders tend to believe that it is trustworthy. A vicious cycle in which inadequate methodology is rewarded and potentially misleading conclusions are accepted is thus supported. There is no quick or easy way to break this cycle; however, increasing awareness of best practices among all these stakeholder groups, who often have minimal (if any) training in methodology, may begin to mitigate it. This is the rationale for inclusion of Parts 2 through 5 in this guidance document. These sections present core concepts and important methodological developments that inform current standards and recommendations. We conclude by taking a direct and practical approach.

Inconsistent and imprecise terminology used in the context of development and evaluation of evidence syntheses is problematic for authors, peer reviewers and editors, and may lead to the application of inappropriate methods and tools. In response, we endorse use of the basic terms (Table 6.1 ) defined in the PRISMA 2020 statement [ 93 ]. In addition, we have identified several problematic expressions and nomenclature. In Table 6.2 , we compile suggestions for preferred terms less likely to be misinterpreted.

Terms relevant to the reporting of health care–related evidence syntheses a

A review that uses explicit, systematic methods to collate and synthesize findings of studies that address a clearly formulated question.
The combination of quantitative results of two or more studies. This encompasses meta-analysis of effect estimates and other methods, such as combining values, calculating the range and distribution of observed effects, and vote counting based on the direction of effect.
A statistical technique used to synthesize results when study effect estimates and their variances are available, yielding a quantitative summary of results.
An event or measurement collected for participants in a study (such as quality of life, mortality).
The combination of a point estimate (such as a mean difference, risk ratio or proportion) and a measure of its precision (such as a confidence/credible interval) for a particular outcome.
A document (paper or electronic) supplying information about a particular study. It could be a journal article, preprint, conference abstract, study register entry, clinical study report, dissertation, unpublished manuscript, government report, or any other document providing relevant information.
The title or abstract (or both) of a report indexed in a database or website (such as a title or abstract for an article indexed in Medline). Records that refer to the same report (such as the same journal article) are “duplicates”; however, records that refer to reports that are merely similar (such as a similar abstract submitted to two different conferences) should be considered unique.
An investigation, such as a clinical trial, that includes a defined group of participants and one or more interventions and outcomes. A “study” might have multiple reports. For example, reports could include the protocol, statistical analysis plan, baseline characteristics, results for the primary outcome, results for harms, results for secondary outcomes, and results for additional mediator and moderator analyses.

a Reproduced from Page and colleagues [ 93 ]

Terminology suggestions for health care–related evidence syntheses

PreferredPotentially problematic

Evidence synthesis with meta-analysis

Systematic review with meta-analysis

Meta-analysis
Overview or umbrella review

Systematic review of systematic reviews

Review of reviews

Meta-review

RandomizedExperimental
Non-randomizedObservational
Single case experimental design

Single-subject research

N-of-1 design

Case report or case seriesDescriptive study
Methodological qualityQuality
Certainty of evidence

Quality of evidence

Grade of evidence

Level of evidence

Strength of evidence

Qualitative systematic reviewQualitative synthesis
Synthesis of qualitative data Qualitative synthesis
Synthesis without meta-analysis

Narrative synthesis , narrative summary

Qualitative synthesis

Descriptive synthesis, descriptive summary

a For example, meta-aggregation, meta-ethnography, critical interpretative synthesis, realist synthesis

b This term may best apply to the synthesis in a mixed methods systematic review in which data from different types of evidence (eg, qualitative, quantitative, economic) are summarized [ 64 ]

We also propose a Concise Guide (Table 6.3 ) that summarizes the methods and tools recommended for the development and evaluation of nine types of evidence syntheses. Suggestions for specific tools are based on the rigor of their development as well as the availability of detailed guidance from their developers to ensure their proper application. The formatting of the Concise Guide addresses a well-known source of confusion by clearly distinguishing the underlying methodological constructs that these tools were designed to assess. Important clarifications and explanations follow in the guide’s footnotes; associated websites, if available, are listed in Additional File 6 .

To encourage uptake of best practices, journal editors may consider adopting or adapting the Concise Guide in their instructions to authors and peer reviewers of evidence syntheses. Given the evolving nature of evidence synthesis methodology, the suggested methods and tools are likely to require regular updates. Authors of evidence syntheses should monitor the literature to ensure they are employing current methods and tools. Some types of evidence syntheses (eg, rapid, economic, methodological) are not included in the Concise Guide; for these, authors are advised to obtain recommendations for acceptable methods by consulting with their target journal.

We encourage the appropriate and informed use of the methods and tools discussed throughout this commentary and summarized in the Concise Guide (Table 6.3 ). However, we caution against their application in a perfunctory or superficial fashion. This is a common pitfall among authors of evidence syntheses, especially as the standards of such tools become associated with acceptance of a manuscript by a journal. Consequently, published evidence syntheses may show improved adherence to the requirements of these tools without necessarily making genuine improvements in their performance.

In line with our main objective, the suggested tools in the Concise Guide address the reliability of evidence syntheses; however, we recognize that the utility of systematic reviews is an equally important concern. An unbiased and thoroughly reported evidence synthesis may still not be highly informative if the evidence itself that is summarized is sparse, weak and/or biased [ 24 ]. Many intervention systematic reviews, including those developed by Cochrane [ 203 ] and those applying GRADE [ 202 ], ultimately find no evidence, or find the evidence to be inconclusive (eg, “weak,” “mixed,” or of “low certainty”). This often reflects the primary research base; however, it is important to know what is known (or not known) about a topic when considering an intervention for patients and discussing treatment options with them.

Alternatively, the frequency of “empty” and inconclusive reviews published in the medical literature may relate to limitations of conventional methods that focus on hypothesis testing; these have emphasized the importance of statistical significance in primary research and effect sizes from aggregate meta-analyses [ 183 ]. It is becoming increasingly apparent that this approach may not be appropriate for all topics [ 130 ]. Development of the GRADE approach has facilitated a better understanding of significant factors (beyond effect size) that contribute to the overall certainty of evidence. Other notable responses include the development of integrative synthesis methods for the evaluation of complex interventions [ 230 , 231 ], the incorporation of crowdsourcing and machine learning into systematic review workflows (eg the Cochrane Evidence Pipeline) [ 2 ], the shift in paradigm to living systemic review and NMA platforms [ 232 , 233 ] and the proposal of a new evidence ecosystem that fosters bidirectional collaborations and interactions among a global network of evidence synthesis stakeholders [ 234 ]. These evolutions in data sources and methods may ultimately make evidence syntheses more streamlined, less duplicative, and more importantly, they may be more useful for timely policy and clinical decision-making; however, that will only be the case if they are rigorously reported and conducted.

We look forward to others’ ideas and proposals for the advancement of methods for evidence syntheses. For now, we encourage dissemination and uptake of the currently accepted best tools and practices for their development and evaluation; at the same time, we stress that uptake of appraisal tools, checklists, and software programs cannot substitute for proper education in the methodology of evidence syntheses and meta-analysis. Authors, peer reviewers, and editors must strive to make accurate and reliable contributions to the present evidence knowledge base; online alerts, upcoming technology, and accessible education may make this more feasible than ever before. Our intention is to improve the trustworthiness of evidence syntheses across disciplines, topics, and types of evidence syntheses. All of us must continue to study, teach, and act cooperatively for that to happen.

Acknowledgements

Michelle Oakman Hayes for her assistance with the graphics, Mike Clarke for his willingness to answer our seemingly arbitrary questions, and Bernard Dan for his encouragement of this project.

Authors’ contributions

All authors participated in the development of the ideas, writing, and review of this manuscript. The author(s) read and approved the final manuscript.

The work of John Ioannidis has been supported by an unrestricted gift from Sue and Bob O’Donnell to Stanford University.

Declarations

The authors declare no competing interests.

This article has been published simultaneously in BMC Systematic Reviews, Acta Anaesthesiologica Scandinavica, BMC Infectious Diseases, British Journal of Pharmacology, JBI Evidence Synthesis, the Journal of Bone and Joint Surgery Reviews , and the Journal of Pediatric Rehabilitation Medicine .

Publisher’ s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Systematic reviews

  • Introduction to systematic reviews
  • Steps in a systematic review
  • Formulate the question
  • Create a protocol (plan)
  • Sources to search
  • Conduct a thorough search
  • Post search phase
  • Select studies (screening)
  • Appraise the quality of the studies

Data extraction

Synthesis and analysis, bibliometric analysis.

  • Interpret results and write
  • Guides and manuals
  • Training and support

Data extraction is the process that describes the collection of relevant information about the findings and characteristics of a study that is included in a systematic review. This information is usually collected in a data extraction form which will consist of a number of elements depending on the review question. This could be done for example in a spreadsheet, or Covidence also has built in extraction tools.

Data can mean any information from a study including:

  • Participants
  • Interventions
  • Outcome measures

For more details see Chapter 5 of the Cochrane Handbook Section 5.3 What data to collect

Useful references

Moon K, Rao S. Data Extraction from Included Studies . In: Patole S, editor. Principles and Practice of Systematic Reviews and Meta-Analysis. Cham: Springer International Publishing; 2021. p. 65-71.

Li T, Higgins JPT, Deeks JJ, editors. Collecting data . In: Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page MJ et al, editors. Cochrane handbook for systematic reviews of interventions, Cochrane; 2022.

The final part of the systematic review is to combine the results to answer the research question. This may be via a quantitative method using a statistical approach such as with a meta-analysis or it may rely on other methods of synthesis such those used in qualitative topics like meta-ethnography.

The final combination of results will be dependent on the nature of the question and the quality and homogeneity of the research.

Types of synthesis:

Quantitative.

  • Meta-analysis

Qualitative

  • Meta-ethnography
  • Thematic synthesis
  • Grounded theory
  • Content analysis
  • Qualitative comparative analysis

NVivo software may be helpful for systematic reviews with qualitative data. Our Using NVivo in systematic reviews library guide has more information.

References:

Crombie IK, Davies HT. What is meta-analysis?  [Internet]. Bandolier.org.uk; 2009.

Barnett-Page E, Thomas J. Methods for the synthesis of qualitative research: a critical review. BMC Med Res Methodol. 2009;9:59.

A bibliometric review is a particular type of systematic review which aims to analyse the bibliometric characteristics of a large set of literature on a particular topic. The characteristics analysed might include such things as the overall volume and growth of the literature, the geographical distribution, authorship and collaboration patterns etc. See more information on bibliometric reviews here , and note that typical systematic and scoping reviews don't do this type of analysis .

Bibliometric analysis can be undertaken using software packages such as:

Bibliometrix -  a freely available R package

UQ Library does not provide support for these software packages. Note also that there are limitations on how much data can be exported from traditional academic databases, so you will need to consider the source of your data in planning your project.  

  • << Previous: Appraise the quality of the studies
  • Next: Interpret results and write >>
  • Last Updated: Sep 9, 2024 8:06 AM
  • URL: https://guides.library.uq.edu.au/research-techniques/systematic-reviews

Library Administration

  • Mission & Organization
  • Employment Opportunities
  • Facts & Statistics
  • Library Policies
  • Support the Library
  • DEIA Committee

Our Heritage

  • Library History
  • Special Collections
  • General Feedback
  • NU Affiliated Hospitals & Libraries
  • Purchase Requests
  • Staff Directory

Find, Borrow, Request

Search tools.

  • Search Catalog
  • Search Tips

Borrow and Access Materials

  • Borrowing Privileges
  • Locations & Call Numbers
  • Models & Equipment
  • Easels & Poster Boards

Document Delivery

  • Course Reserves
  • Request an Article or Book

Research Services

Classes & consulting.

  • Take a Class
  • Request a Consultation
  • Find my Liaison Librarian
  • Curriculum Support

Specialized Services

  • Systematic & Scoping Reviews
  • Metrics & Impact
  • Data Management & Education
  • Support for Clinicians

Research Tools

  • GalterGuides
  • Biosciences & Bioinformatics
  • Publication Support
  • Software on Library Computers

The Library

  • Classrooms & Study Spaces
  • Extended Hours Study Space
  • Hours & Location
  • New: Library Policies for Visitors

Technology & Computing

  • Printing & Scanning
  • Public Computers

Visitor Privileges & Resources

  • Alumni Resources
  • Library Access
  • Online Resource Access

Quick Links

  • ClinicalKey
  • Ebook Collections
  • EndNote Support
  • Popular Databases
  • Student Resources
  • Web of Science
  • Metrics & Impact
  • Site Preferences

AI Resources for Systematic Reviews: Outlining the Benefits to AI and Things to Consider

Search development, citation discovery and management, screening and data extraction, appraisal synthesis, multipurpose tools, limitations, conclusions.

By Annie Wescott, Research Librarian

Systematic reviews are known for being both rigorous and time intensive. Artificial intelligence (AI) tools, which can be employed to perform and streamline tasks, have the potential to expedite elements of the review process. There is a balance to be had, however, as AI tools introduce complex limitations that must be considered before introducing them to the evidence synthesis process. The following AI tools have the potential to simplify, aid or streamline steps in the systematic review process. The tools and resources mentioned here are just a sampling of AI resources available. Additional tools, resources and considerations are covered in Galter Library’s AI Resources for Literature Reviews GalterGuide developed by librarian Q. Eileen Wafford. New AI tools continue to be released, and it is always worth checking for new tools that may support the systematic review process.

Several free tools are available to streamline the comprehensive search development process. Generative AI tools like Google Gemini and ChatGPT  can be used to enhance searches by providing candidate terms and building search strings. Tools like WordFreq and PubReMiner aid search development through word frequency analysis of selected relevant citations. Additionally, tools like Polyglot Search Translator can support the search translation process by adapting search syntax and formats for various databases. 

Generative AI tools require strong prompts to maximize their impact. Experts suggest clearly defining your question with concise language. You should include any relevant context in the prompt, and break down the more complex elements into smaller, direct prompts. Providing the AI tool with a persona and specifying the desired style and format of the results can create stronger outputs.

To enhance efficiency in the citation discovery and management stages of the systematic review process there are tools like LitSuggest and Scite , which employ machine learning to generate citation recommendations, and others that use AI-driven search and discovery like Semantic Scholar .

Screening records is one of the most time intensive steps of the systematic review process. AI assisted screening tools include those that detect discrepancies between reviewers like Disputatron , those that employ active learning techniques to assist in the screening process like ASReview , or those that index citations across multiple fields and simplify discovery like Semantic Scholar .

At the critical appraisal stage, AI tools can aid in the creation of elements like forest plots and risk-of-bias tables. RevMan is one such tool from the Cochrane Collaboration, which can be used for a fee. OpenMeta is an open-source platform that can be used to support the meta-analyses process.

There are several tools that aim to support multiple steps in the systematic review process. Covidence , Rayyan , and DistillerSR support teams in some manner through every stage of the systematic review process. In addition to other features, these tools perform automatic deduplication, simplify screening and create charts that may be used for reporting. Tools like Systematic Review Accelerator offer a suite of tools that may be used for one or more stages of the systematic review process.

Systematic reviews play an important role in patient care and decision making. It is important that the systematic review process is done with intention and rigor, to ensure reliability and quality in the final output. While these AI tools hold promise in streamlining the process, thereby synthesizing evidence for clinical use more efficiently, one must be aware of their limitations. It is important to note that AI tools often focus on pattern recognition and cannot read context in the questions asked. The review process still requires a human element. Additionally, because these tools are trained on existing data, the results are vulnerable to the biases present in that data. Outside of the ethical considerations of bias, accountability, and transparency in results, AI tools also can require initial time and resource commitment to properly learn the tools. Several other limitations should be considered when introducing AI to the systematic review process.

There is great potential for added efficiency in the systematic review process as more AI tools emerge or are improved. With the promise of more efficiency comes risks and limitations that should always be considered at the outset of a systematic review. Researchers should be sure to explore their AI tool options, maintain human oversight and validation, and always cite the tools they use in the systematic review process.

Blaizot A, Veettil SK, Saidoung P, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review . Res Synth Methods. 2022;13(3):353-362. doi:10.1002/jrsm.1553 

Fabiano N, Gupta A, Bhambra N, et al. How to optimize the systematic review process using AI tools . JCPP Adv. 2024;4(2):e12234. Published 2024 Apr 23. doi:10.1002/jcv2.12234 

van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used . BMJ Open. 2023;13(7):e072254. Published 2023 Jul 7. doi:10.1136/bmjopen-2023-072254 

Wafford, QE. AI resources for literature reviews . GalterGuides. Published 2024.

Updated: September 11, 2024

  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Editor's Choice
  • Supplements
  • French Abstracts
  • Portuguese Abstracts
  • Spanish Abstracts
  • Author Guidelines
  • Submission Site
  • Open Access
  • About International Journal for Quality in Health Care
  • About the International Society for Quality in Health Care
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Contact ISQua
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, conclusions, acknolwedgements, author contributions, supplementary data, conflict of interest, data availability.

  • < Previous

Physicians’ perspectives on clinical indicators: systematic review and thematic synthesis

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Ana Renker-Darby, Shanthi Ameratunga, Peter Jones, Corina Grey, Matire Harwood, Roshini Peiris-John, Timothy Tenbensel, Sue Wells, Vanessa Selak, Physicians’ perspectives on clinical indicators: systematic review and thematic synthesis, International Journal for Quality in Health Care , Volume 36, Issue 3, 2024, mzae082, https://doi.org/10.1093/intqhc/mzae082

  • Permissions Icon Permissions

Clinical indicators are increasingly used to improve the quality of care, particularly with the emergence of ‘big data’, but physicians’ views regarding their utility in practice is unclear. We reviewed the published literature investigating physicians’ perspectives, focusing on the following objectives in relation to quality improvement: (1) the role of clinical indicators, (2) what is needed to strengthen them, (3) their key attributes, and (4) the best tool(s) for assessing their quality. A systematic literature search (up to November 2022) was carried out using: Medline, EMBASE, Scopus, CINAHL, PsycInfo, and Web of Science. Articles that met all of the following inclusion criteria were included: reported on physicians’ perspectives on clinical indicators and/or tools for assessing the quality of clinical indicators, addressing at least one of the four review objectives; the clinical indicators related to care at least partially delivered by physicians; and published in a peer-reviewed journal. Data extracted from eligible studies were appraised using the Critical Appraisal Skills Programme tool. A thematic synthesis of data was conducted using NVivo software. Descriptive themes were inductively derived from codes, which were grouped into analytical themes answering each objective. A total of 14 studies were included, with 17 analytical themes identified for objectives 1–3 and no data identified for objective 4. Results showed that indicators can play an important motivating role for physicians to improve the quality of care and show where changes need to be made. For indicators to be effective, physicians should be involved in indicator development, recording relevant data should be straightforward, indicator feedback must be meaningful to physicians, and clinical teams need to be adequately resourced to act on findings. Effective indicators need to focus on the most important areas for quality improvement, be consistent with good medical care, and measure aspects of care within the control of physicians. Studies cautioned against using indicators primarily as punitive measures, and there were concerns that an overreliance on indicators can lead to narrowed perspective of quality of care. This review identifies facilitators and barriers to meaningfully engaging physicians in developing and using clinical indicators to improve the quality of healthcare.

Clinical indicators are measures designed to assess and improve the quality of health services. When they seek to support quality improvement efforts by clinicians, it is critical to meaningfully engage with clinicians in the development and monitoring of these indicators [ 1 ]. Previous research has found that among clinicians, physicians may be resistant to clinical indicator initiatives, particularly when indicators are designed for the purpose of accountability rather than quality improvement [ 2 ]. If physicians are more engaged with indicator development and use, they are more likely to accept and act upon findings from those indicators [ 1 ].

Traditionally, there has been a focus on manual audits of patient records as a mechanism for supporting quality improvement. The availability of ‘big data’ (high volumes of diverse electronic data [ 3 ]) has the potential to revolutionize clinical engagement with quality improvement activities [ 4 ]. Rather than undertaking static, intermittent audits of a subset of patients, big data makes it possible to continuously generate electronic data on clinical indicators to improve performance [ 5 ]. Further, techniques such as natural language processing can enhance the clinical relevance of identifiable cohorts by enabling free text as well as structured data to be interrogated systematically [ 6 , 7 ]. To maximize the extent to which these advances translate into improvements in the quality of care, it is critical that, where indicators are designed to support quality improvement efforts by physicians, meaningful clinical engagement with physicians is obtained in the development and monitoring of clinical indicators.

Previously Jones et al . developed a Quality Indicator Clinical Appraisal (QICA) tool to appraise indicators based on key attributes identified through a systematic review and survey of quality of care experts [ 8 ]. The QICA tool ‘provides an explicit basis for discussions around indicator selection’ [ 8 ]. However, ensuring that physicians will use and act on clinical indicator data also requires consideration of physicians’ perspectives. The objectives of this study were to determine physicians’ perspectives regarding (1) the role indicators play in supporting quality improvement, (2) what is needed to strengthen the ability of indicators to drive improvements in quality, (3) the ‘key’ attributes of an effective indicator, and (4) the best tool(s) for assessing the quality of indicators.

The systematic review protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO CRD42020152496).

Search strategy

A systematic literature search was carried out using Medline, EMBASE, Scopus, Cochrane, CINAHL, PsycInfo, and Web of Science (searched up to November 2022). The search strategies are provided in Supplementary Appendix S1 . One reviewer (A.R.) screened all (deduplicated) titles and abstracts using the inclusion and exclusion criteria. Full texts were then screened by A.R., and included texts were discussed with a co-author (V.S.).

Inclusion and exclusion criteria

Articles were included if they: reported on data from physicians, either as a group or subgroup; focused on clinical indicators and/or tools for assessing the quality of clinical indicators; reported on physicians’ perspectives on clinical indicators and/or tools for assessing the quality of clinical indicators; related to clinical care at least partially delivered by physicians; were published in a peer-reviewed journal; and addressed at least one of the four objectives from the perspective of physicians.

Articles were excluded if they: reported on data from health professionals, patients and/or family members without any physicians or without separate reporting for physicians; focused on evaluating quality of care, clinical guidelines, models of care or diagnostic criteria; were in a language other than English; were editorial or opinion pieces; or had insufficient data to adequately interpret the results. No time restrictions or methodological restrictions were applied.

Quality appraisal

As all included citations used qualitative methodologies, we selected the 10-item Critical Appraisal Skills Programme (CASP) qualitative studies tool [ 9 ] to assess and enable the quality of each citation to be factored in the review. A.R. reviewed the studies using the CASP checklist and discussed findings with V.S. until consensus was reached.

Data extraction and thematic synthesis

The method of thematic synthesis was adapted from Thomas and Harden [ 10 ] and guided by that detailed by Braun and Clarke [ 11 ]. A.R. and V.S. independently screened the results and discussion sections of all included articles to extract data consistent with the inclusion and exclusion criteria for this review (including direct quotes from study participants and author interpretations). Differences between A.R. and V.S. in data inclusion decisions were discussed until consensus was reached. A.R. coded each sentence or phrase with one or more codes using NVivo software [ 11 ]. Descriptive themes were inductively derived from codes, which were then grouped into analytical themes answering each of the objectives. Codes were renamed iteratively where relevant or combined if they addressed similar ideas. Themes were discussed between V.S. and A.R. until consensus was reached, and then reviewed and approved by all authors.

Ethics approval was not required as this study is a systematic review of published literature.

Study selection

From 4620 initial citations, a total of 14 studies were included in the study ( Fig. 1 ). These included papers were published between 2000 and 2022 and were based in the United Kingdom, United States, Canada, China and Germany, in teaching hospitals, primary care groups, and an ambulatory care organization ( Table 1 ).

PRISMA flow diagram showing four stages of study selection process. 4,617 records were identified through database searching and 3 additional records were identified through other sources. Following deduplication there were 2,200 records. These were screened using the inclusion criteria and 1,809 records were excluded. The remaining 391 articles were assessed for eligibility using their full-text, of which 377 were excluded. The remaining 14 articles were included in the qualitative synthesis.

PRISMA flow diagram of study selection.

Summary of included studies.

Study (year)AimMethodSettingParticipants
Ahmed . (2019)To explore the views of clinician–scientists and quality improvement experts regarding proposed domains of PCC, and to gain an understanding of current practices and opportunities for measurement of PCC at a healthcare system level.Semi-structured interviews (  = 16)Canada, USA, UKClinician–scientists (  = 4), Quality improvement experts (  = 12)
Benn . (2015)
Chapter 6: Qualitative Evaluation*
To conduct a quasi-experimental evaluation of the feedback initiative and its effect on quality of anaesthetic care and perioperative efficiency.Interviews (  = 35)Teaching hospital in London, UKConsultant anaesthetists (  = 24), surgical nursing leads (  = 6), perioperative service leads (  = 5)
Breidenbach . (2021)To identify factors that inhibit of facilitate the usage of PROs for clinical decision-making and monitoring patients in existing structures for oncological care, certified colorectal cancer centres in Germany.Semi-structured interviews (  = 12)Cancer centres participating in EDIUM study in GermanyPhysicians (  = 7), psycho-oncologist (  = 1), nurses (  = 3), physician assistant (  = 1)
D’Lima . (2017)*To report the experience of anaesthetists participating in a long-term initiative to provide comprehensive personalized feedback to consultants on patient-reported quality of recovery indicators in a large London teaching hospital.Semi-structured interviews (  = 21)Teaching hospital in London, UKConsultant anaesthetists (  = 13), surgical nursing leads (  = 6), theatre manager (  = 1), clinical coordinator for recovery (  = 1)
Exworthy . (2003) To review qualitative findings from an empirical study within one English primary care group on the response to a set of clinical performance indicators relating to general practitioners in terms of the effect upon their clinical autonomy.Semi-structured interviews (  = 52)Primary care group in southern England, UKGPs (  = 29), practice nurses (  = 12), practice managers (  = 11)
Gagliardi . (2008)To explore patient, nurse, physician, and manager preferences for cancer care quality indicators.Interviews (  = 30)Two teaching hospitals, CanadaSurgeons (  = 2), radiation oncologists (  = 2), medical oncologist (  = 1), nurses (  = 5), managers (  = 5), patients (  = 15)
Gill . (2012)To explore the perspectives of general practitioners on the introduction of child-specific quality markers to the UK’s Quality Outcomes Framework.Semi-structured interviews (  = 20)Five Primary Care Trusts, EnglandGPs (  = 20)
Gray . (2018)To explore the role that metrics and measurement play in a wide-reaching ‘Lean’-based continuous quality improvement effort carried out in the primary care departments of a large, ambulatory care healthcare organization.Semi-structured interviews (  = 130)Large, multispecialty, ambulatory care organization, USAPrimary care physicians (# of participants not disclosed)
Hicks . (2021)To identify all available patient-reported outcome measures relevant to diseases treated by vascular surgeons and to evaluate vascular surgeon perceptions, barriers to widespread implementation, and concerns regarding PROs.Focus groups (# of focus groups not disclosed)Society for Vascular Surgery, USASociety for Vascular Surgery members (# of participants not disclosed)
Litvin . (2015)To systematically solicit recommendations from Meaningful Use exemplars to inform Stage 3 Meaningful Use clinical quality measure requirements.Focus groups (  = 3)A national Electronic Health Record-based primary care practice-based research network, USAGeneral internists (  = 5) internal medicine/paediatric physicians (  = 2), family medicine physicians (  = 16)
Maxwell . (2002)To investigate the acceptability among general practitioners of a patient-completed post-consultation measure of outcome and its use in conjunction with two further quality indicators: time spent in consultation and patients reporting knowing the doctor well.Focus groups (  = 7)Oxford, Coventry, London, and Edinburgh, UKGPs (  = 46)
Rasooly . (2022)To understand the current state of quality and performance measurement in primary diabetes care, and the facilitators and barriers to their implementation.Interviews (  = 26)Tertiary hospitals CHCs in Shanghai, ChinaPatients (  = 12), family doctors (  = 3), endocrinologists (  = 2), CHC managers (  = 4), policymakers (  = 5)
Van den Heuvel . (2010)To describe and explore the views of German general practitioners on the clinical indicators of the Quality and Outcomes FrameworkFocus groups (  = 7)North-western part of GermanyGPs (  = 54)
Wilkinson . (2000) To investigate reactions to the use of evidence-based cardiovascular and stroke performance indicators within one primary care group.Semi-structured interviews (  = 29)Fifteen practices from a primary care group in southern EnglandGPs (  = 29)
Study (year)AimMethodSettingParticipants
Ahmed . (2019)To explore the views of clinician–scientists and quality improvement experts regarding proposed domains of PCC, and to gain an understanding of current practices and opportunities for measurement of PCC at a healthcare system level.Semi-structured interviews (  = 16)Canada, USA, UKClinician–scientists (  = 4), Quality improvement experts (  = 12)
Benn . (2015)
Chapter 6: Qualitative Evaluation*
To conduct a quasi-experimental evaluation of the feedback initiative and its effect on quality of anaesthetic care and perioperative efficiency.Interviews (  = 35)Teaching hospital in London, UKConsultant anaesthetists (  = 24), surgical nursing leads (  = 6), perioperative service leads (  = 5)
Breidenbach . (2021)To identify factors that inhibit of facilitate the usage of PROs for clinical decision-making and monitoring patients in existing structures for oncological care, certified colorectal cancer centres in Germany.Semi-structured interviews (  = 12)Cancer centres participating in EDIUM study in GermanyPhysicians (  = 7), psycho-oncologist (  = 1), nurses (  = 3), physician assistant (  = 1)
D’Lima . (2017)*To report the experience of anaesthetists participating in a long-term initiative to provide comprehensive personalized feedback to consultants on patient-reported quality of recovery indicators in a large London teaching hospital.Semi-structured interviews (  = 21)Teaching hospital in London, UKConsultant anaesthetists (  = 13), surgical nursing leads (  = 6), theatre manager (  = 1), clinical coordinator for recovery (  = 1)
Exworthy . (2003) To review qualitative findings from an empirical study within one English primary care group on the response to a set of clinical performance indicators relating to general practitioners in terms of the effect upon their clinical autonomy.Semi-structured interviews (  = 52)Primary care group in southern England, UKGPs (  = 29), practice nurses (  = 12), practice managers (  = 11)
Gagliardi . (2008)To explore patient, nurse, physician, and manager preferences for cancer care quality indicators.Interviews (  = 30)Two teaching hospitals, CanadaSurgeons (  = 2), radiation oncologists (  = 2), medical oncologist (  = 1), nurses (  = 5), managers (  = 5), patients (  = 15)
Gill . (2012)To explore the perspectives of general practitioners on the introduction of child-specific quality markers to the UK’s Quality Outcomes Framework.Semi-structured interviews (  = 20)Five Primary Care Trusts, EnglandGPs (  = 20)
Gray . (2018)To explore the role that metrics and measurement play in a wide-reaching ‘Lean’-based continuous quality improvement effort carried out in the primary care departments of a large, ambulatory care healthcare organization.Semi-structured interviews (  = 130)Large, multispecialty, ambulatory care organization, USAPrimary care physicians (# of participants not disclosed)
Hicks . (2021)To identify all available patient-reported outcome measures relevant to diseases treated by vascular surgeons and to evaluate vascular surgeon perceptions, barriers to widespread implementation, and concerns regarding PROs.Focus groups (# of focus groups not disclosed)Society for Vascular Surgery, USASociety for Vascular Surgery members (# of participants not disclosed)
Litvin . (2015)To systematically solicit recommendations from Meaningful Use exemplars to inform Stage 3 Meaningful Use clinical quality measure requirements.Focus groups (  = 3)A national Electronic Health Record-based primary care practice-based research network, USAGeneral internists (  = 5) internal medicine/paediatric physicians (  = 2), family medicine physicians (  = 16)
Maxwell . (2002)To investigate the acceptability among general practitioners of a patient-completed post-consultation measure of outcome and its use in conjunction with two further quality indicators: time spent in consultation and patients reporting knowing the doctor well.Focus groups (  = 7)Oxford, Coventry, London, and Edinburgh, UKGPs (  = 46)
Rasooly . (2022)To understand the current state of quality and performance measurement in primary diabetes care, and the facilitators and barriers to their implementation.Interviews (  = 26)Tertiary hospitals CHCs in Shanghai, ChinaPatients (  = 12), family doctors (  = 3), endocrinologists (  = 2), CHC managers (  = 4), policymakers (  = 5)
Van den Heuvel . (2010)To describe and explore the views of German general practitioners on the clinical indicators of the Quality and Outcomes FrameworkFocus groups (  = 7)North-western part of GermanyGPs (  = 54)
Wilkinson . (2000) To investigate reactions to the use of evidence-based cardiovascular and stroke performance indicators within one primary care group.Semi-structured interviews (  = 29)Fifteen practices from a primary care group in southern EnglandGPs (  = 29)

CHC,  community healthcare centre, GP, General Practitioners, PCC, patient-centred care, PRO, patient-reported outcome, *Articles report on the same study.

Articles report on the same study but retained to incorporate potential differing interpretations of the data.

Quality assessment

The articles used qualitative methodology (either interviews or focus groups). The CASP quality assessment revealed that most included studies met most quality criteria ( Supplementary Appendix S2 ). All studies provided clear description of findings. However, there were some methodological limitations with several studies. Nine of 14 articles did not discuss ethical issues with many failing to report ethics approval of the data collection. The relationship between researchers and participants was not adequately discussed in 11 studies, 2 articles did not specify the research aim, and several others failed to report the participant recruitment strategy and only included a limited discussion of how data were analysed.

Objectives and themes

Data from included studies addressed the first three objectives but no articles addressed the fourth objective.The themes for each objective are described below, and summarized in Table 2 .

Objectives and themes.

ObjectiveThemes
What is the role of clinical indicators in supporting quality improvement?Show where changes need to be made
Motivate physicians to improve quality of care
Increase physicians’ accountability
Can encourage myopic quality improvement
Should be used by physicians, not government or the public
Should not be used punitively
What is needed to strengthen the ability of indicators to drive improvements in quality?Support and participation of physicians in their development
Recording data should be straightforward
Feedback delivered in a way that is helpful for physicians
Availability of sufficient resource for quality improvement
Quality improvement requires working together
Incentives have advantages and disadvantages
Key attributes of effective indicatorsTarget the most important areas for quality improvement
Consistent with good medical care
Within physicians’ control
Reliable
Consider patient-reported measures alongside
ObjectiveThemes
What is the role of clinical indicators in supporting quality improvement?Show where changes need to be made
Motivate physicians to improve quality of care
Increase physicians’ accountability
Can encourage myopic quality improvement
Should be used by physicians, not government or the public
Should not be used punitively
What is needed to strengthen the ability of indicators to drive improvements in quality?Support and participation of physicians in their development
Recording data should be straightforward
Feedback delivered in a way that is helpful for physicians
Availability of sufficient resource for quality improvement
Quality improvement requires working together
Incentives have advantages and disadvantages
Key attributes of effective indicatorsTarget the most important areas for quality improvement
Consistent with good medical care
Within physicians’ control
Reliable
Consider patient-reported measures alongside

What is the role of clinical indicators in supporting quality improvement?

Shows where changes need to be made.

Physicians noted that a key role of clinical indicators was their ability to illuminate specific areas of care requiring change. In many cases, physicians stated that it was only through clinical indicators that they received regular feedback on the quality of their care. Physicians appreciated the objective assessment of quality that clinical indicators provided, as opposed to intuiting where care may require improvement. Physicians also thought that clinical indicators could facilitate up-to-date, evidence-based care, provided that the indicators were based on best practice.

Motivate physicians to improve quality of care

Physicians commented on two ways in which clinical indicators motivated efforts to improve quality of care: first, seeing the clinical indicator feedback was often a prompt for physicians to take action on quality improvement. Physicians expressed that it was difficult to ignore this type of objective feedback. Second, clinical indicator feedback showing improvements in care motivated physicians, as it demonstrated tangible evidence of how quality improvement could translate into improved outcomes. Many physicians also thought that engaging in quality improvement was part of being a ‘good’ physician.

Increase physicians’ accountability

Physicians thought that measuring quality using clinical indicators would make them more accountable for the quality of their care. Some were concerned however that clinical indicators could be used by their organization for performance management, and they feared a loss of autonomy in their practice.

Can encourage myopic quality improvement

Physicians were concerned that clinical indicators could lead to a myopic view and produce unintended consequences. They commented that many of the ‘softer’ aspects of quality were difficult to quantify using indicators and risked being side-lined in favour of areas of care more easily quantified. Physicians were concerned that using clinical indicators may distract them from providing more holistic, patient-centred care. Overall, physicians stressed that clinical indicators should be a means to good care, not an end in themselves.

Should be used by physicians, not government or the public

Physicians stressed that clinical indicators should be used by physicians for the purpose of quality improvement, not by government or the public. They emphasized the potential for indicators to be misinterpreted by those outside the profession and were worried about being held accountable for measures they could not influence. Physicians also highlighted the tensions between their own priorities for quality improvement and the priorities of government or their organization. They thought that government or organization management were more likely to prioritize productivity and efficiency over the quality of patient care, and were worried that clinical indicators could entrench these priorities.

Should not be used punitively

Physicians thought that clinical indicators could either be employed in a ‘soft’ manner to encourage quality improvement or a ‘hard’ manner where poor performance would be criticized or punished. They stressed that this punitive approach would only isolate physicians and was unlikely to improve the quality of care.

What is needed to strengthen the ability of clinical indicators to drive improvements in quality?

Support and participation of physicians in their development.

Physicians thought that clinical indicators were more likely to drive improvements in quality if they had the support of clinicians. Physicians were more inclined to use the indicators to make changes to their practice if they understood their purpose and agreed with the measures. They suggested that one way of ensuring their buy-in was to involve them in the development of clinical indicators.

Recording data should be straightforward

Physicians thought that recording data for clinical indicators could lead to an unmanageable increase in their workload and may require additional support staff. They suggested that recording indicator data should be integrated into their workflow and automated where possible.

Feedback delivered in a way that is helpful for physicians

Physicians had several suggestions for useful ways to deliver clinical indicator feedback. They wanted indicator feedback delivered in a manner that was visually appealing and easy to interpret—most suggested the use of charts rather than tables. Comparison feedback between departments, practices, or individual physicians was also considered useful. Physicians found it helpful to see patterns over time in their feedback. They also highlighted that the timing of feedback was important and should be aligned with appropriate interventions to improve quality.

Availability of sufficient resource for quality improvement

Physicians stated that sufficient resources were required for both the use of clinical indicators and subsequent improvements in quality. Physicians also emphasized that they needed sufficient time and resources to reflect on their practice and makes any changes to respond to indicator feedback and improve quality.

Quality improvement requires working together

Physicians emphasized that measuring quality of care was not enough to improve quality—it was also crucial that they had support to translate feedback into quality improvement. Most importantly, physicians wanted clinical indicator feedback to be linked to a clear action for improvement. They also suggested that quality improvement needed to happen as a team.

Using incentives has advantages and disadvantages

Physicians thought that while tying incentives to clinical indicators could accelerate quality improvement, there was also the potential for unintended consequences and ‘gaming’ the system.

What are the key attributes of effective indicators?

Target the most important areas of care for quality improvement.

Physicians thought that the number of clinical indicators should be limited and only cover the most important areas of care. In particular, physicians suggested a focus on diseases where improved care can have a substantial impact, or a focus on especially high-risk patients. Technical process indicators were also suggested as an important aspect of care to measure. Physicians were generally resistant to productivity-oriented indicators.

Consistent with good medical care

Physicians thought it was important for clinical indicators to be evidence-based and to reflect best practice. They felt that indicators must be consistent with other policies and guidelines, and indicators should not contradict each other.

Within physicians’ control

Physicians thought it was important that clinical indicators measured aspects of care that were within their control. This was particularly important if indicators were tied to incentives or used punitively. Despite many physicians agreeing that outcome indicators measured what was ultimately important, they also expressed concern that outcomes were often affected by factors outside of physicians’ control.

Physicians commented on several important attributes that would increase their trust in clinical indicators being able to drive improvements in quality. Physicians thought there were several important attributes that made a clinical indicator reliable and hence trustworthy. They stated that clinical indicators should be of high quality, valid, precise, technically specific, clearly defined, and only require information that could be measured accurately.

Consider patient-reported measures alongside

Physicians agreed that there was a role for patient-reported outcome measures in driving quality improvement. Patient-reported outcome measures and patient experience indicators were seen as representing one aspect of quality that was important to consider. However, physicians also recommended that such measures should be considered alongside other clinical indicators. They also thought that some aspects of patient experience are subjective and therefore less helpful for quality improvement.

Statement of principal findings

This systematic review found overall agreement that indicators could play a clear role in motivating physicians to improve the quality of care and showing where changes needed to be made. While it was felt that indicators increased physicians’ accountability, it was clear that they should be used by physicians themselves, rather than by the government or the public, and should not be used punitively. There was concern that an overreliance on indicators might lead to myopic quality improvement at the expense of more holistic care. In order to strengthen the ability of indicators to drive improvements in quality, physicians need to support and participate in the process of indicator development, recording relevant data should be straightforward, indicator feedback needs to be meaningful, and physicians and their teams need to be adequately resourced to act on findings.

While it was recognized that incentives might accelerate quality improvement, there was also the risk of unintended consequences and ‘gaming’. Key attributes of effective indicators were a focus on the most important areas for quality improvement, consistency with good medical care, measurement of aspects of care that were within the control of physicians and reliability. While there was support for the use of patient-reported outcome measures alongside clinical indicators, there was a potential disconnect between the supposed subjectivity of these measures and the desire for indicators to be ‘accurate’ or objective.

Strengths and limitations

This thematic synthesis of data identified from a systematic review of the literature was focused on physicians’ views regarding the utility of clinical indicators in practice. This is important to understand given the increasing use of clinical indicators and expectations that physicians will use and act on clinical indicator data. As we did not have access to the raw data from primary studies, our findings represent a synthesis of selected data included in the primary studies as well as the authors’ interpretations of that data. The literature search and coding were performed by one reviewer, which may have resulted in bias in the selection of articles. Lastly, texts in languages other than English were excluded.

There were also several limitations in the literature included in this systematic review. As noted, most (9/14) articles did not discuss ethical issues associated with their research with many failing to report ethics approval of the data collection. Generalizability of the results to all physicians is difficult to ascertain because most participants were primary care physicians. Generalizability of the results may also depend on when the data were obtained (given that perspectives are likely to change over time) and the specific health systems examined in each study. Unfortunately, it was not feasible to disaggregate themes according to study context due to the limited number of included studies.

Interpretation within the context of the wider literature

While our literature search did not return results for physicians’ perspectives on the best tools for appraising the quality of clinical indicators, Jones et al . [ 8 ] have previously developed the quality improvement critical appraisal tool (QICA), to provide an explicit basis for clinical indicator selection. The findings of our review are consistent with key aspects of the QICA tool, including the need for indicators to measure the most important aspects of medical care and to be evidence-based, acceptable, concordant with other measures of the issue, reliable and to consider the potential for unintended effects, such as bias as well as the resource implications of measurement itself [ 12 ]. There were several technical characteristics listed in the QICA tool that were not explored in our systematic review, including the need for a well-defined target population, exclusions and measurement systems, need for indicators to reflect differing cultural values, the power and precision of an indicator to detect clinically important changes beyond random variation, and potential ethical issues involved with data gathering and reporting of results [ 12 ].

The final part of the QICA tool addresses the practical implications of indicator implementation in both data collection and data analysis [ 12 ]. There was significant overlap here between the characteristics included in the tool and those that physicians thought were important in the systematic review. Similar findings included the importance of limiting extra work in collecting data for clinical indicators, ensuring that technology is sufficient, ensuring that indicator feedback is actionable and that the results are understandable by physicians so they can be used to improve the quality of care.

Implications for policy, practice, and research

This review found that indicators can play an important motivating role for physicians to improve the quality of care and show where changes need to be made. For indicators to be effective, physicians should be involved in indicator development, recording relevant data should be straightforward, indicator feedback must be meaningful to physicians, and clinical teams need to be adequately resourced to act on findings. Effective indicators need to focus on the most important areas for quality improvement, be consistent with good medical care, and measure aspects of care within the control of physicians. Studies cautioned against using indicators primarily as punitive measures, and there were concerns that an overreliance on indicators could lead to a narrowed perspective on quality of care.

In this systematic review, we found that physicians believe that they should participate in the development of indicators and control the use of those indicators. However, it is worth noting that there are other legitimate groups and stakeholders that also have an interest in the development and use of indicators. Physicians form one professional group among a broader range of multi-disciplinary health providers, as well as patients themselves, whose perspectives need to be engaged in indicator development. It has also been argued that a key impediment faced by collaborative healthcare teams working towards quality improvement is the ‘structured embeddedness of medical dominance’ [ 13 ]. Balancing the perspectives of multiple professional groups as well as patients while avoiding the tendency for physicians to disengage from the process entirely is one of the challenges for the use of clinical indicators for driving quality improvements in policy as well as practice, and would be a valuable area for future research.

This review identified facilitators and barriers to meaningfully engaging physicians in developing and using clinical indicators to improve the quality of healthcare. Such information will help maximize the extent to which the potential of ‘big data’ in revolutionizing clinical engagement with quality improvement activites is able to be realized.

Not applicable.

Ana Renker-Darby (Data curation, analysis (lead), original draft preparation, reviewing and editing), Shanthi Ameratunga (Analysis, reviewing & editing), Peter Jones (Analysis, reviewing & editing), Corina Grey (Analysis, reviewing & editing), Matire Harwood (Analysis, reviewing & editing), Roshini Peiris-John (Analysis, reviewing & editing), Timothy Tenbensel (Analysis, reviewing & editing), Sue Wells (Analysis, reviewing & editing), Vanessa Selak (Conceptuatlisation, Analysis, reviewing & editing).

Supplementary data is available at IJQHC online.

During the conduct of this research, S.A., C.G., M.H., P.J., R.P.J., V.S., and S.W. received funding for other research projects from the Health Research Council of New Zealand; S.A., C.G., M.H., V.S., and S.W. received funding from the National Heart Foundation of New Zealand and the National Science Challenge (Healthier Lives); V.S. and S.W. received funding from the Auckland Medical Research Foundation, and P.J. received funding from the A+ Trust.

A.R.’s work on this research was funded by a grant from the University of Auckland’s Faculty of Medical and Health Sciences Research Development Fund.

No new data were generated or analysed in support of this research.

Raleigh   VS , Foot   C . Getting the Measure of Quality. Opportunities and Challenges . London : The King’s Fund , 2010 .

Google Scholar

Google Preview

Solberg   LI , Asche   SE , Margolis   KL  et al.    Measuring an organization’s ability to manage change: the change process capability questionnaire and its use for improving depression care . Am J Med Qual   2008 ; 23 : 193 – 200 . https://doi.org/10.1177/1062860608314942

Hemingway   H , Asselbergs   FW , Danesh   J  et al.    Big data from electronic health records for early and late translational cardiovascular research: challenges and potential . Eur Heart J   2018 ; 39 : 1481 – 95 . https://doi.org/10.1093/eurheartj/ehx487

Roski   J , Bo-Linn   GW , Andrews   TA . Creating value in health care through big data: opportunities and policy implications . Health Affairs   2014 ; 33 : 1115 – 22 .

Patel   S , Rajkomar   A , Harrison   JD  et al.    Next-generation audit and feedback for inpatient quality improvement using electronic health record data: a cluster randomised controlled trial . BMJ Qual Saf   2018 ; 27 : 691 – 9 .

Hurrell   M , Stein   A , MacDonald   S . Use of natural language processing to identify significant abnormalities for follow-up in a large accumulation of non-delivered radiology reports . J Health Med Inform   2017 ; 8 :2.

Liao   KP , Cai   T , Savova   GK  et al.    Development of phenotype algorithms using electronic medical records and incorporating natural language processing . BMJ   2015 ; 350 :h1885.

Jones   P , Shepherd   M , Wells   S  et al.    Review article: what makes a good healthcare quality indicator? A systematic review and validation study . Emergency Med Australasia   2014 ; 26 : 113 – 24 . https://doi.org/10.1111/1742-6723.12195

Critical Appraisal Skills Programme . CASP Qualitative Checklist 2018 . https://casp-uk.net/casp-tools-checklists/ ( 1 November 2019 , date last accessed).

Thomas   J , Harden   A . Methods for the thematic synthesis of qualitative research in systematic reviews . BMC Med Res Method   2008 ; 8 :10. https://doi.org/10.1186/1471-2288-8-45

Braun   V , Clarke   V . Using thematic analysis in psychology . Qual Res Psychol   2006 ; 3 : 77 – 101 . https://doi.org/10.1191/1478088706qp063oa

Jones   P . Defining and Validating a Metric for Emergency Department Crowding . Auckland, New Zealand : University of Auckland , 2018 .

Bourgeault   IL , Mulvale   G . Collaborative health care teams in Canada and the USA: confronting the structural embeddedness of medical dominance . Health Sociol Rev   2014 ; 15 : 481 – 95 . https://doi.org/10.5172/hesr.2006.15.5.481

  • quality of care
  • quality improvement
Month: Total Views:
August 2024 107
September 2024 63

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1464-3677
  • Print ISSN 1353-4505
  • Copyright © 2024 International Society for Quality in Health Care and Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Submit a Manuscript
  • Advanced search

American Journal of Neuroradiology

American Journal of Neuroradiology

Advanced Search

Risk of Hemorrhagic Transformation after Mechanical Thrombectomy without versus with IV Thrombolysis for Acute Ischemic Stroke: A Systematic Review and Meta-analysis of Randomized Clinical Trials

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Seyed Behnam Jazayeri
  • ORCID record for Sherief Ghozy
  • ORCID record for Cem Bilgin
  • ORCID record for Mohamed Elfil
  • ORCID record for Ramanathan Kadirvel
  • ORCID record for David F. Kallmes
  • Figures & Data
  • Info & Metrics

This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.

BACKGROUND: When treating acute ischemic stroke due to large-vessel occlusion, both mechanical thrombectomy and intravenous (IV) thrombolysis carry the risk of intracerebral hemorrhage.

PURPOSE: This study aimed to delve deeper into the risk of intracerebral hemorrhage and its subtypes associated with mechanical thrombectomy with or without IV thrombolysis to contribute to better decision-making in the treatment of acute ischemic stroke due to large-vessel occlusion.

DATA SOURCES: PubMed, EMBASE, and Scopus databases were searched for relevant studies from inception to September 6, 2023.

STUDY SELECTION: The eligibility criteria included randomized clinical trials or post hoc analysis of randomized controlled trials that focused on patients with acute ischemic stroke in the anterior circulation. After screening 4870 retrieved records, we included 9 studies (6 randomized controlled trials and 3 post hoc analyses of randomized controlled trials) with 3241 patients.

DATA ANALYSIS: The interventions compared were mechanical thrombectomy + IV thrombolysis versus mechanical thrombectomy alone, with the outcome of interest being any form of intracerebral hemorrhage and symptomatic intracerebral hemorrhage after intervention. A common definition for symptomatic intracerebral hemorrhage was pooled from various classification systems, and subgroup analyses were performed on the basis of different definitions and anatomic descriptions of hemorrhage. The quality of the studies was assessed using the revised version of Cochrane Risk of Bias 2 assessment tool. Meta-analysis was performed using the random effects model.

DATA SYNTHESIS: Eight studies had some concerns, and 1 study was considered high risk. Overall, the risk of symptomatic intracerebral hemorrhage was comparable between mechanical thrombectomy + IV thrombolysis and mechanical thrombectomy alone (risk ratio, 1.24 [95% CI, 0.89–1.72]; P = .20), with no heterogeneity across studies. Subgroup analysis of symptomatic intracerebral hemorrhage showed a non-significant difference between 2 groups based on the National Institute of Neurological Disorders and Stroke ( P = .3), the Heidelberg Bleeding Classification ( P = .5), the Safe Implementation of Thrombolysis in Stroke-Monitoring Study ( P = .4), and the European Cooperative Acute Stroke Study III ( P = .7) criteria. Subgroup analysis of different anatomic descriptions of intracerebral hemorrhage showed no difference between the 2 groups. Also, we found no difference in the risk of any intracerebral hemorrhage between two groups (risk ratio, 1.10 [95% CI, 1.00–1.21]; P = .052) with no heterogeneity across studies.

LIMITATIONS: There was a potential for performance bias in most studies.

CONCLUSIONS: In this systematic review and meta-analysis, the risk of any intracerebral hemorrhage and symptomatic intracerebral hemorrhage, including its various classifications and anatomic descriptions, was comparable between mechanical thrombectomy + IV thrombolysis and mechanical thrombectomy alone.

  • ABBREVIATIONS:
  • © 2024 by American Journal of Neuroradiology

Log in using your username and password

In this issue.

American Journal of Neuroradiology: 45 (9)

  • Table of Contents
  • Index by author
  • Complete Issue (PDF)

Thank you for your interest in spreading the word on American Journal of Neuroradiology.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager

del.icio.us logo

  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • MATERIALS AND METHODS
  • CONCLUSIONS

Related Articles

  • No related articles found.
  • Google Scholar

Cited By...

  • No citing articles found.

This article has not yet been cited by articles in journals that are participating in Crossref Cited-by Linking.

More in this TOC Section

  • MCA Parallel Anatomic Scanning MR Imaging–Guided Recanalization of a Chronic Occluded MCA by Endovascular Treatment
  • Incidence, Risk Factors, and Clinical Implications of Subarachnoid Hyperdensities on Flat-Panel Detector CT following Mechanical Thrombectomy in Patients with Anterior Circulation Acute Ischemic Stroke
  • Endovascular Thrombectomy for Carotid Pseudo-Occlusion in the Setting of Acute Ischemic Stroke: A Comparative Systematic Review and Meta-analysis

Similar Articles

COMMENTS

  1. 6. Data synthesis and summary

    Mixed methods synthesis: is an advanced method of data synthesis developed by EPPI-Centre to better understand the meanings of quantitative studies by conducting a parallel review of user evaluations to traditional systematic reviews and combining the findings of the syntheses to identify and provide clear directions in practice. [11]

  2. Overview of data-synthesis in systematic reviews of studies on outcome

    The methodology of data synthesis in a review of the latter type of prognosis is comparable to the methodology of aetiological reviews. For that reason, in the present study we only focused on reviews of outcome prediction studies. ... a systematic review and literature synthesis. Pain Med. 2009; 10:639-653. doi: 10.1111/j.1526-4637.2009.00632.x.

  3. Systematic reviews: Structure, form and content

    A systematic review collects secondary data, and is a synthesis of all available, relevant evidence which brings together all existing primary studies for review (Cochrane 2016). A systematic review differs from other types of literature review in several major ways.

  4. Synthesis

    Data synthesis overview. Now that you have extracted your data, the next step is to synthesise the data. Move through the slide deck below to learn about data synthesis. Alternatively, download the PDF document at the bottom of this box. This document is a printable version of the slide deck above.

  5. Chapter 9: Summarizing study characteristics and preparing for synthesis

    Box 9.2.a provides a general framework for synthesis that can be applied irrespective of the methods used to synthesize results. Planning for the synthesis should start at protocol-writing stage, and Chapter 2 and Chapter 3 describe the steps involved in planning the review questions and comparisons between intervention groups. These steps included specifying which characteristics of the ...

  6. Overview of data-synthesis in systematic reviews of studies on outcome

    Data-extraction. After completing the data-extraction form for all of the included reviews, most disagreements between review authors were found on items concerning the review objectives, the type of primary studies included, and the method of qualitative data-synthesis.

  7. Synthesise

    The synthesis part of a systematic review will determine the outcomes of the review. There are two commonly accepted methods of synthesis in systematic reviews: Quantitative data synthesis. Qualitative data synthesis. The way the data is extracted from your studies and synthesised and presented depends on the type of data being handled.

  8. An overview of methodological approaches in systematic reviews

    1. INTRODUCTION. Evidence synthesis is a prerequisite for knowledge translation. 1 A well conducted systematic review (SR), often in conjunction with meta‐analyses (MA) when appropriate, is considered the "gold standard" of methods for synthesizing evidence related to a topic of interest. 2 The central strength of an SR is the transparency of the methods used to systematically search ...

  9. LibGuides: Systematic Reviews: 8. Synthesize Your Results

    A quantitative synthesis, or meta-analysis, uses statistical techniques to combine and analyze the results of multiple studies. The feasibility and sensibility of including a meta-analysis as part of your systematic review will depend on the data available. Requirements for quantitative synthesis:

  10. Evidence Synthesis Guide : Synthesis & Meta-Analysis

    Synthesis involves pooling the extracted data from the included studies and summarizing the findings based on the overall strength of the evidence and consistency of observed effects. All reviews should include a qualitative synthesis and may also include a quantitative synthesis (i.e. meta-analysis). Data from sufficiently comparable and ...

  11. Systematic Reviews & Evidence Synthesis Methods

    In a systematic review, researchers do more than summarize findings from identified articles. You will synthesize the information you want to include. While a summary is a way of concisely relating important themes and elements from a larger work or works in a condensed form, a synthesis takes the information from a variety of works and ...

  12. Chapter 12: Synthesizing and presenting findings using other methods

    Box 12.4.a summarizes the review comparisons and outcomes, and decisions taken by the review authors in planning their synthesis. While the example is loosely based on an actual review, the review description, scenarios and data are fabricated for illustration. Box 12.4.a The review

  13. Synthesis and systematic maps

    If a systematic review question is about ways of understanding a social phenomena, it iteratively analyses the findings of studies to develop overarching concepts, theories or themes. The included studies are likely to provide theories, concepts or insights about a phenomena. ... A systematic map can help to plan a synthesis. It may be that the ...

  14. How to Do a Systematic Review: A Best Practice Guide for Conducting and

    Systematic reviews are characterized by a methodical and replicable methodology and presentation. They involve a comprehensive search to locate all relevant published and unpublished work on a subject; a systematic integration of search results; and a critique of the extent, nature, and quality of evidence in relation to a particular research question.

  15. What is Evidence Synthesis?

    Evidence syntheses may also include a meta-analysis, a more quantitative process of synthesizing and visualizing data retrieved from various studies. ... One commonly used form of evidence synthesis is a systematic review. This table compares a traditional literature review with a systematic review.

  16. Synthesising the data

    Synthesising the data. Synthesis is a stage in the systematic review process where extracted data, that is the findings of individual studies, are combined and evaluated. The general purpose of extracting and synthesising data is to show the outcomes and effects of various studies, and to identify issues with methodology and quality.

  17. Qualitative Evidence Synthesis: Where Are We at?

    A qualitative evidence synthesis, or QES, is a type of systematic review that brings together the findings from primary qualitative research in a systematic way. A primary qualitative research study is one that uses a qualitative method of data collection and analysis.

  18. Synthesis without meta-analysis (SWiM) in systematic reviews ...

    In systematic reviews that lack data amenable to meta-analysis, alternative synthesis methods are commonly used, but these methods are rarely reported. This lack of transparency in the methods can cast doubt on the validity of the review findings. The Synthesis Without Meta-analysis (SWiM) guideline has been developed to guide clear reporting in reviews of interventions in which alternative ...

  19. 11. Data Extraction

    JBI Sumari (the Joanna Briggs Institute System for the United Management, Assessment and Review of Information) is a systematic review software platform geared toward fields such as health, social sciences, and humanities. Among the other steps of a review project, it facilitates data extraction and data synthesis.

  20. Guidance to best tools and practices for systematic reviews

    Methods and guidance to produce a reliable evidence synthesis. Several international consortiums of EBM experts and national health care organizations currently provide detailed guidance (Table (Table1). 1).They draw criteria from the reporting and methodological standards of currently recommended appraisal tools, and regularly review and update their methods to reflect new information and ...

  21. Extract data, synthesise and analyse

    The final part of the systematic review is to combine the results to answer the research question. This may be via a quantitative method using a statistical approach such as with a meta-analysis or it may rely on other methods of synthesis such those used in qualitative topics like meta-ethnography.

  22. Summarising good practice guidelines for data extraction for systematic

    Data extraction is the process of a systematic review that occurs between identifying eligible studies and analysing the data, whether it can be a qualitative synthesis or a quantitative synthesis involving the pooling of data in a meta-analysis. The aims of data extraction are to obtain information about the included studies in terms of the characteristics of each study and its population and ...

  23. A practical guide to data analysis in general literature reviews

    The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields. ... This article seeks to describe a systematic method of data ...

  24. AI Resources for Systematic Reviews: Outlining the Benefits to AI and

    There are several tools that aim to support multiple steps in the systematic review process. Covidence, Rayyan, and DistillerSR support teams in some manner through every stage of the systematic review process. In addition to other features, these tools perform automatic deduplication, simplify screening and create charts that may be used for ...

  25. Physicians' perspectives on clinical indicators: systematic review and

    This thematic synthesis of data identified from a systematic review of the literature was focused on physicians' views regarding the utility of clinical indicators in practice. This is important to understand given the increasing use of clinical indicators and expectations that physicians will use and act on clinical indicator data.

  26. Risk of Hemorrhagic Transformation after Mechanical Thrombectomy

    Meta-analysis was performed using the random effects model. DATA SYNTHESIS: Eight studies had some concerns, and 1 study was considered high risk. ... CONCLUSIONS: In this systematic review and meta-analysis, the risk of any intracerebral hemorrhage and symptomatic intracerebral hemorrhage, including its various classifications and anatomic ...

  27. GPS Use and Navigation Ability: A Systematic Review and Meta-Analysis

    A systematic review is also important to identify reliable measures to assess the impact of GPS use on navigation abilities. To our knowledge, there is currently no systematic review and meta-analysis of the evidence concerning the relationship between GPS use and navigation ability. ... Data Collection and Synthesis of the Findings. We created ...

  28. Post-diagnostic support for adults diagnosed with autism in adulthood

    Data collection: Online questionnaire; 1:1 interviews, focus groups Data analysis: Descriptive statistics; no methodology provided for qualitative analysis NB: Stakeholder consultation meetings held with 12 service users to confirm evaluation conclusions: 54 adults received 1:1 support. The peer group was the most-used aspect of the service (63%).