- History & Society
- Science & Tech
- Biographies
- Animals & Nature
- Geography & Travel
- Arts & Culture
- Games & Quizzes
- On This Day
- One Good Fact
- New Articles
- Lifestyles & Social Issues
- Philosophy & Religion
- Politics, Law & Government
- World History
- Health & Medicine
- Browse Biographies
- Birds, Reptiles & Other Vertebrates
- Bugs, Mollusks & Other Invertebrates
- Environment
- Fossils & Geologic Time
- Entertainment & Pop Culture
- Sports & Recreation
- Visual Arts
- Demystified
- Image Galleries
- Infographics
- Top Questions
- Britannica Kids
- Saving Earth
- Space Next 50
- Student Center
- Introduction
Comparison with controlled study design
Natural experiments as quasi experiments, instrumental variables.
- When did science begin?
- Where was science invented?
- Is Internet technology "making us stupid"?
- What is the impact of artificial intelligence (AI) technology on society?
natural experiment
Our editors will review what you’ve submitted and determine whether to revise the article.
- Table Of Contents
natural experiment , observational study in which an event or a situation that allows for the random or seemingly random assignment of study subjects to different groups is exploited to answer a particular question. Natural experiments are often used to study situations in which controlled experimentation is not possible, such as when an exposure of interest cannot be practically or ethically assigned to research subjects. Situations that may create appropriate circumstances for a natural experiment include policy changes, weather events, and natural disasters. Natural experiments are used most commonly in the fields of epidemiology , political science , psychology , and social science .
Key features of experimental study design include manipulation and control. Manipulation, in this context , means that the experimenter can control which research subjects receive which exposures. For instance, subjects randomized to the treatment arm of an experiment typically receive treatment with the drug or therapy that is the focus of the experiment, while those in the control group receive no treatment or a different treatment. Control is most readily accomplished through random assignment, which means that the procedures by which participants are assigned to a treatment and control condition ensure that each has equal probability of assignment to either group. Random assignment ensures that individual characteristics or experiences that might confound the treatment results are, on average, evenly distributed between the two groups. In this way, at least one variable can be manipulated, and units are randomly assigned to the different levels or categories of the manipulated variables.
In epidemiology, the gold standard in research design generally is considered to be the randomized control trial (RCT). RCTs, however, can answer only certain types of epidemiologic questions, and they are not useful in the investigation of questions for which random assignment is either impracticable or unethical. The bulk of epidemiologic research relies on observational data, which raises issues in drawing causal inferences from the results. A core assumption for drawing causal inference is that the average outcome of the group exposed to one treatment regimen represents the average outcome the other group would have had if they had been exposed to the same treatment regimen. If treatment is not randomly assigned, as in the case of observational studies, the assumption that the two groups are exchangeable (on both known and unknown confounders) cannot be assumed to be true.
As an example, suppose that an investigator is interested in the effect of poor housing on health. Because it is neither practical nor ethical to randomize people to variable housing conditions, this subject is difficult to study using an experimental approach. However, if a housing policy change, such as a lottery for subsidized mortgages, was enacted that enabled some people to move to more desirable housing while leaving other similar people in their previous substandard housing, it might be possible to use that policy change to study the effect of housing change on health outcomes. In another example, a well-known natural experiment in Helena , Montana, smoking was banned from all public places for a six-month period. Investigators later reported a 60-percent drop in heart attacks for the study area during the time the ban was in effect.
Because natural experiments do not randomize participants into exposure groups, the assumptions and analytical techniques customarily applied to experimental designs are not valid for them. Rather, natural experiments are quasi experiments and must be thought about and analyzed as such. The lack of random assignment means multiple threats to causal inference , including attrition , history, testing, regression , instrumentation, and maturation, may influence observed study outcomes. For this reason, natural experiments will never unequivocally determine causation in a given situation. Nevertheless, they are a useful method for researchers, and if used with care they can provide additional data that may help with a research question and that may not be obtainable in any other way.
The major limitation in inferring causation from natural experiments is the presence of unmeasured confounding. One class of methods designed to control confounding and measurement error is based on instrumental variables (IV). While useful in a variety of applications, the validity and interpretation of IV estimates depend on strong assumptions, the plausibility of which must be considered with regard to the causal relation in question.
In particular, IV analyses depend on the assumption that subjects were effectively randomized, even if the randomization was accidental (in the case of an administrative policy change or exposure to a natural disaster) and adherence to random assignment was low. IV methods can be used to control for confounding in observational studies, to control for confounding due to noncompliance, and to correct for misclassification.
IV analysis, however, can produce serious biases in effect estimates. It can also be difficult to identify the particular subpopulation to which the causal effect IV estimate applies. Moreover, IV analysis can add considerable imprecision to causal effect estimates. Small sample size poses an additional challenge in applying IV methods.
Causality and natural experiments: the 2021 Nobel Prize in Economic Sciences
The Royal Swedish Academy of Sciences awarded the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2021 to three economists—Joshua Angrist, David Card, and Guido Imbens. Their contributions to the economics literature shaped economists’ understanding of when causal relationships can be established, especially using non-experimental data, and what kinds of methods and assumptions allow us to uncover the true causal effect of one variable on another. Today, businesses, courts and policymakers rely on causal empirical evidence to make their decisions.
The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2021 is shared by three economists.
- David Card received half of the prize ‘for his empirical contributions to labour economics’.
- Joshua D. Angrist and Guido W. Imbens shared the other half of the prize ‘for their methodological contributions to the analysis of causal relationships’. 1
Alan Krueger
Another prominent economist made a great contribution to the literature and research agenda on causal inference alongside the 2021 Nobel Prize winners. We have no doubt that Alan Krueger would have shared this award—however, he passed away in 2019, and Nobel Prizes are not awarded posthumously.
The common theme for the 2021 Nobel Prize is causality and natural experiments .
Why causality matters
Many policy and business decisions require a thorough understanding of causes and effects. They might involve questions such as:
- will a higher minimum wage cause unemployment?
- how much will a person’s income increase if they have one more year of schooling?
- by how much will a company’s sales decrease if it increases its prices?
- what is the damage resulting from a particular cartel agreement?
However, naïve approaches to analysing data in order to answer such questions may result in policy recommendations, or decisions, that are based on a misunderstanding of the effects of a factor on an outcome of interest.
Answering such questions requires an approach that goes beyond using data to explore mere correlations—since relationships observed in data are not necessarily informative of causal effects if they are not collected and analysed using the right approaches. The field of empirical economics is devoted to understanding these approaches, especially in cases where it is difficult to follow the ‘gold standard’.
The gold standard in identifying empirical causal relationships is arguably randomised control trials (RCTs). For example, in medicine it is common practice to randomly allocate a medical treatment to some participants in a medical trial and, in parallel, define a comparable control group of non-treated participants who usually receive a placebo treatment. Outcomes for the treated and non-treated participants are then compared to identify the effectiveness of the treatment. Such trials were undertaken in 2020 and 2021 in order to test the efficacy of COVID-19 vaccines. The key factor that means that such trials can identify the causal impact of a treatment on patient health is the randomisation between who receives the treatment and who does not.
For example, Figure 1 gives a simplified representation of the relationship between the height and intelligence of hypothetical individuals in a hypothetical society, and illustrates how it varies depending on the sample selected.
Figure 1 Non-random samples can severely mischaracterise underlying relationships (sampling bias)
Note: Please use the selector to display a particular sample. The option ‘All society’ displays all data points. ‘Non-random selection’ displays a particular sample that is not selected randomly. ‘Random selection’ displays observations that are selected randomly from the underlying society. Lines show the line of best fit to the displayed sample. The figure illustrates that the linear relationship estimated using the non-random selection is significantly different from the relationship in the whole society, whereas the relationship estimated using the random selection closely approximates it.
Source: Oxera.
Common theme: natural experiments
A natural experiment can be considered an approximation of a RCT—when it is impossible or unethical to randomise a treatment across participants. Natural experiments arise when nature, or policies, result in situations where a treatment is automatically assigned to participants in a manner that is ‘almost as good as random’. The term ‘treatment’ is thereby not limited to medical treatment. It covers anything that might have an effect on an outcome—for instance, education on earnings, gender on pay, or a cartel on prices.
Treatments that are used to answer many empirical questions in economics would be impossible to randomise across participants. For example, among many other social issues, Angrist and Krueger analysed the impact of educational attainment on workplace earnings. 2 To study this empirical relationship, and to obtain a dataset with a randomised treatment, the authors would have had to assign different years of schooling to different children randomly and compare their future earnings. Such a research design is unethical, and an RCT is therefore not a suitable research method to answer this question.
Angrist and Krueger instead used a natural experiment arising from US legislation—the compulsory education leaving age. They argued that this legislation results in a natural experiment, as students who are born earlier in the calendar year are slightly older, and therefore reach the compulsory education leaving age earlier than their peers (born later in the calendar year). As such, people born earlier in the calendar year tend to have less education than those born later even though the birth date of an individual is random. The legislation therefore arguably results in an ‘almost as good as random’ allocation of years of schooling.
Card and Krueger showed another example of how natural experiments can be used to identify causal effects—in their case, to understand the impact of a minimum wage on employment in the USA, and specifically whether raising the minimum wage costs jobs. 3 They used data arising from a policy change 4 along with a suitable analytical approach. 5 Specifically, the authors compared the evolution of employment metrics in New Jersey, a state with a change in its minimum wage rate policy, and the evolution of the same employment metrics in neighbouring Pennsylvania, a state without the change in minimum wage. In this case, ‘assignment’ of the treatment can be considered ‘almost as good as random’ across New Jersey and Pennsylvania. The authors found ‘no indication’ of a negative impact on jobs in fast-food chain restaurants in the two states (shown in Figure 2 below). Such innovative use of data has influenced generations of empirical economists who have used natural experiments.
Figure 2 Geographical boundaries are sometimes used to differentiate policy impacts across similar units
Note: The bright green line represents the border between Pennsylvania and New Jersey. Light and dark green circles represent some of the fast-food restaurants near the border. Fast-food restaurants are selected randomly for illustration only and are not necessarily part of Card and Krueger’s data. In 1992, restaurants on the Pennsylvania (left-hand) side of the border received the treatment (increased minimum wage), and those on the New Jersey (right-hand) side of the border did not. Considering how close these groups of restaurants are to each other, it might be reasonable to assume that the group of New Jersey restaurants (dark green circles) represent a reasonable control group to measure the impact of a treatment on the group of Pennsylvania restaurants (light green circles).
A fundamental difference between natural experiments and RCTs is the researcher’s ability to control who receives the treatment. In an RCT, a researcher allocates participants randomly to treatment and control groups without giving them a choice. However, as natural experiments are by their nature uncontrolled, there can be various situations where individual responses to treatment vary. When such situations arise, the interpretation of the differences in outcomes becomes challenging.
For example, in the Angrist and Krueger study on the impact of educational attainment on earnings, one could look at who the legislation on compulsory schooling really affects. We are asked to consider compulsory schooling as an intervention affecting all students; however, it actually affects the behaviour only of those who would have left education absent the legislation on compulsory schooling—other students would have completed their education anyway. The impact of the compulsory schooling treatment varies across participants.
Imbens and Angrist developed the analytical framework to use in such situations, which shaped how researchers use, and think about, natural experiments. 6 They showed that what can be identified under certain assumptions is the impact of treatment only on those whose behaviour is altered as a result of the treatment, and name this outcome the ‘local average treatment effect’.
How does regression analysis fit alongside these concepts?
RCTs and natural experiments have the advantage that their random or almost random assignment does much of the work to uncover the effect, and the statistical analysis itself can be simple. However, RCTs and natural experiments (as with neighbouring states subject to different policy changes) are not always available to researchers or informative about the underlying mechanisms. Nonetheless, the idea behind them both—in other words, how randomisation and causal inference are linked—affects how causal research is done by today’s organisations.
If these approaches are unavailable, other statistical approaches can be used instead. Notably, without the benefits of randomisation, regression approaches start by defining a causal model of assumed relationships, therefore giving an explicit assumed structure to how one factor affects another. When these assumed relationships are correct, regression outcomes are informative about the causal effect and the mechanisms through which it takes place. 7 To ensure adherence to the ‘almost as good as random’ principle, regression approaches need to ‘control’ for the effects of not just the principal variable of interest but all relevant factors (the ‘confounders’). 8 If these can all be accounted for, a regression approach allows the researcher to identify and measure the effect of the main variable of interest.
A practical example might be a cartel case where we might expect prices to increase due to the formation of the cartel. However, prices depend on many factors, of which the cartel conduct is only one. Such an analysis might rely on a comparison of prices during the cartel period with prices after it. As prices might also change over time due to (for instance), input costs, demand, or product characteristics, simply comparing averages between two points in time would not measure the cartel effect but all the effects together, and would be uninformative about individual components. Only if all other relevant factors are accounted for, and the modelling assumptions hold, is it is possible to measure the actual effect of the cartel on prices.
Figure 3 shows observations on price and demand in two hypothetical markets.
Figure 3 Regression approaches control for confounders to assess the relationship
Note: The relationship between prices and demand is simplified for illustrative purposes. If differences in market characteristics are not accounted for, a regression of demand on prices results in a counterintuitive positive relationship. Only after controlling for market characteristics does a regression approach help to uncover the expected negative relationship. The animation shows a step-by-step guide to how this is done.
Natural experiments and causal research in today’s applied economics
An understanding of cause-and-effect relationships is not only relevant in the public policy space but also plays an increasingly important role in business decision-making. For example, companies may want to know how much more profit they can generate from a better customer experience specifically, separately from any impacts of changes in product quality or differences in customer demographics.
In this example, a company could choose the ‘controlled experiment’ route—i.e. running an experiment in one region of the market and comparing changes in its profits against changes in a similar region that does not go through the treatment. Alternatively, if an experiment is not feasible, it could run a regression analysis examining the impact of different levels of customer experience on profits, controlling for all other relevant factors. In both approaches, the experimental design and implementation, as well as the analysis of the impact, would require careful consideration.
Indeed, a growing number of companies are applying economic analysis to understand causal relationships and the impacts of their business decisions in many areas—such as product design, marketing, pricing, and arrangements with suppliers/buyers. Tech companies are perhaps at the forefront of applying economic analysis in the design of platforms, from more high-level questions on sustainable business models to more detailed ones on platform features and the customer journey. 9
Oxera uses natural experiments as part of its analytical toolkit. For example, in the context of a recent merger assessment, our team assessed the closeness of competition between two airlines that intended to merge. If the parties were close competitors, the merger might have led to a lessening of competition and an increase in prices. As part of the assessment, the team undertook a case study that looked at how prices evolved on a route following the grounding of Boeing 737 MAX aircraft due to safety concerns. One of the parties to the merger (airline A) operated 737 MAX aircraft on the route—and was therefore affected by this grounding event—while the other (airline B) did not. This situation provided a natural experiment that the team used for its analysis. Because this negative supply shock (the grounding of one type of aircraft) yielded an unexpected output decline for airline A, it provided useful information about the competitive reactions of other airlines flying on that same route. Oxera’s analysis indicated that the parties impose material competitive constraints on each other: the grounding event resulted in airline B increasing its prices on the route, and when airline A started operating the route using different aircraft, airline B’s prices subsequently fell.
Looking to the future
Generations of researchers have been inspired by the approaches that the 2021 Nobel winners have pioneered. As practitioners of applied economics, we are looking forward to their future contributions to this field, which may expand the economist’s analytical toolkit with various data science approaches.
For example, Imbens has recently worked on understanding how prediction-focused data science approaches can be translated to causal settings and when they can be useful. 10 It is exciting to see developments in these areas as they illustrate the growing ability of economic research designs to capture true underlying effects, which will result in better and more accurate data-driven policy and business decisions in the future.
1 The Nobel Prize (2021), ‘Press release: The Prize in Economic Sciences 2021’ .
2 Angrist, J.D. and Krueger, A.B. (1991), ‘Does compulsory school attendance affect schooling and earnings?’, The Quarterly Journal of Economics , 106 :4, November, pp. 979–1014.
3 Card, D. and Kruger, A.B. (1994), ‘Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania’, The American Economic Review , 84 :4, September, pp. 772–93.
4 On 1 April 1992, the minimum wage rate in New Jersey was increased from $4.25 to $5.05 per hour.
5 They compared employment, and other labour metrics such as wage, between New Jersey and its neighbouring state Pennsylvania before and after the policy change. This analytical approach is more widely known as a ‘difference-in-differences’ approach.
6 Imbens, G.W. and Angrist, J.D. (1994), ‘Identification and estimation of local average treatment effects’, Econometrics , 62 :2, March, pp. 467–75.
7 This comes at the cost of various modelling assumptions that are not required when using RCTs or natural experiments.
8 Technically, this means that once all common causes (confounders) of an outcome and treatment assignment mechanism are included in the regression in suitable forms—i.e. once the other effects are controlled for—the assignment to one group or the other is ‘almost as good as random’. Controlling for common causes allows researchers to mimic random assignment when the modelling assumptions hold.
9 For example, see the work by Amazon’s Core AI group: Bajari, P., Cen, Z., Chernozhukov, V., Huerta, R., Li, J., Manukonda, M. and Monokroussos, G. (2020), ‘New Goods, Productivity and the Measurement of Inflation: Using Machine Learning to Improve Quality Adjustments’, American Economic Association , January .
10 See, for instance, Athey, S. and Imbens, G.W. (2019), ‘Machine learning methods that economists should know about’, Annual Review of Economics , 11 , August, pp. 685–725.
Dr Stefan Witte
Contributor.
- Antitrust Damages
- Analytics and Data Science
- Public Policy and Impact Assessment
Reducing or removing CO2 emissions: Can offsets make the difference?
As countries and corporates focus on reducing emissions in line with European net zero targets up to 2050, in this article, Oxera Partner Sir Philip Lowe examines the use of offsets, particularly in hard-to-abate segments of the economy. Oxera’s research on… Read More
Assessing the financial regulation of European football clubs
The roar of the crowd, the thrill of the game—football is a global phenomenon. With rising TV audiences and lucrative commercial deals, it has become big business. Money has surged into the game and changed the incentives for clubs, their executives and owners. So, what needs to be done to… Read More
Back to top
What's on your mind?
Get in touch to talk to our team
- Technical advance
- Open access
- Published: 11 February 2021
Conceptualising natural and quasi experiments in public health
- Frank de Vocht ORCID: orcid.org/0000-0003-3631-627X 1 , 2 , 3 ,
- Srinivasa Vittal Katikireddi 4 ,
- Cheryl McQuire 1 , 2 ,
- Kate Tilling 1 , 5 ,
- Matthew Hickman 1 &
- Peter Craig 4
BMC Medical Research Methodology volume 21 , Article number: 32 ( 2021 ) Cite this article
22k Accesses
69 Citations
93 Altmetric
Metrics details
Natural or quasi experiments are appealing for public health research because they enable the evaluation of events or interventions that are difficult or impossible to manipulate experimentally, such as many policy and health system reforms. However, there remains ambiguity in the literature about their definition and how they differ from randomized controlled experiments and from other observational designs. We conceptualise natural experiments in the context of public health evaluations and align the study design to the Target Trial Framework.
A literature search was conducted, and key methodological papers were used to develop this work. Peer-reviewed papers were supplemented by grey literature.
Natural experiment studies (NES) combine features of experiments and non-experiments. They differ from planned experiments, such as randomized controlled trials, in that exposure allocation is not controlled by researchers. They differ from other observational designs in that they evaluate the impact of events or process that leads to differences in exposure. As a result they are, in theory, less susceptible to bias than other observational study designs. Importantly, causal inference relies heavily on the assumption that exposure allocation can be considered ‘as-if randomized’. The target trial framework provides a systematic basis for evaluating this assumption and the other design elements that underpin the causal claims that can be made from NES.
Conclusions
NES should be considered a type of study design rather than a set of tools for analyses of non-randomized interventions. Alignment of NES to the Target Trial framework will clarify the strength of evidence underpinning claims about the effectiveness of public health interventions.
Peer Review reports
When designing a study to estimate the causal effect of an intervention, the experiment (particularly the randomised controlled trial (RCT) is generally considered to be the least susceptible to bias. A defining feature of the experiment is that the researcher controls the assignment of the treatment or exposure. If properly conducted, random assignment balances unmeasured confounders in expectation between the intervention and control groups . In many evaluations of public health interventions, however, it is not possible to conduct randomised experiments. Instead, standard observational epidemiological study designs have traditionally been used. These are known to be susceptible to unmeasured confounding.
Natural experimental studies (NES) have become popular as an alternative evaluation design in public health research, as they have distinct benefits over traditional designs [ 1 ]. In NES, although the allocation and dosage of treatment or exposure are not under the control of the researcher, they are expected to be unrelated to other factors that cause the outcome of interest [ 2 , 3 , 4 , 5 ]. Such studies can provide strong causal information in complex real-world situations, and can generate effect sizes close to the causal estimates from RCTs [ 6 , 7 , 8 ]. The term natural experiment study is sometimes used synonymously with quasi-experiment; a much broader term that can also refer to researcher-led but non-randomised experiments. In this paper we argue for a clearer conceptualisation of natural experiment studies in public health research, and present a framework to improve their design and reporting and facilitate assessment of causal claims.
Natural and quasi-experiments have a long history of use for evaluations of public health interventions. One of the earliest and best-known examples is the case of ‘Dr John Snow and the Broad Street pump’ [ 9 ]. In this study, cholera deaths were significantly lower among residents served by the Lambeth water company, which had moved its intake pipe to an upstream location of the Thames following an earlier outbreak, compared to those served by the Southwark and Vauxhall water company, who did not move their intake pipe. Since houses in the study area were serviced by either company in an essentially random manner, this natural experiment provided strong evidence that cholera was transmitted through water [ 10 ].
Natural and quasi experiments
Natural and quasi experiments are appealing because they enable the evaluation of changes to a system that are difficult or impossible to manipulate experimentally. These include, for example, large events, pandemics and policy changes [ 7 , 11 ]. They also allow for retrospective evaluation when the opportunity for a trial has passed [ 12 ]. They offer benefits over standard observational studies because they exploit variation in exposure that arises from an exogenous ( i.e. not caused by other factors in the analytic model [ 1 ]) event or intervention. This aligns them to the ‘ do -operator’ in the work of Pearl [ 13 ]. Quasi experiments (QES) and NES thus combine features of experiments (exogenous exposure) and non-experiments (observations without a researcher-controlled intervention). As a result, they are generally less susceptible to confounding than many other observational study designs [ 14 ]. However, a common critique of QES and NES is that because the processes producing variation in exposure are outside the control of the research team, there is uncertainty as to whether confounding has been sufficiently minimized or avoided [ 7 ]. For example, a QES of the impact of a voluntary change by a fast food chain to label its menus with information on calories on subsequent purchasing of calories [ 15 ]. Unmeasured differences in the populations that visit that particular chain compared to other fast-food choices could lead to residual confounding.
A distinction is sometimes made between QES and NES. The term ‘natural experiment’ has traditionally referred to the occurrence of an event with a natural cause; a ‘force of nature‘(Fig. 1 a) [ 1 ]. These make for some of the most compelling studies of causation from non-randomised experiments. For example, the Canterbury earthquakes in 2010–2011 have been used to study the causal impact of such disasters because about half of an established birth cohort lived in the affected area with the remainder of the cohort living elsewhere [ 16 ]. More recently, the use of the term ‘natural’ has been understood more broadly as an event which did not involve the deliberate manipulation of exposure for research purposes (for example a policy change), even if human agency was involved [ 17 ]. Compared to natural experiments in QES the research team may be able to influence exposure allocation, even if the event or exposure itself is not under their full control; for example in a phased roll out of a policy [ 18 ]. A well-known example of a natural experiment is the “Dutch Hunger Winter” summarised by Lumey et al. [ 19 ]. During this period in the Second World War the German authorities blocked all food supplies to the occupied West of the Netherlands, which resulted in widespread starvation. Food supplies were restored immediately after the country was liberated, so the exposure was sharply defined by time as well as place. Because there was sufficient food in the occupied and liberated areas of the Netherlands before and after the Hunger Winter, exposure to famine occurred based on an individual’s time and place (of birth) only. Similar examples of such ‘political’ natural experiment studies are the study of the impact of China’s Great Famine [ 20 ] and the ‘special period’ in Cuba’s history following the collapse of the Soviet Union and the imposition of a US blockade [ 21 ]. NES that describe the evaluation of an event which did not involve the deliberate manipulation of an exposure but involved human agency, such as the impact of a new policy, are the mainstay of ‘natural experimental research’ in public health, and the term NES has become increasingly popular to indicate any quasi-experimental design (although it has not completely replaced it).
Different conceptualisations of natural and quasi experiments within wider evaluation frameworks
Dunning takes the distinction of a NES further. He defines a NES as a QES where knowledge about the exposure allocation process provides a strong argument that allocation, although not deliberately manipulated by the researcher, is essentially random. This concept is referred to as ‘as-if randomization’ (Fig. 1 b) [ 4 , 8 , 10 ]. Under this definition, NES differ from QES in which the allocation of exposure, whether partly controlled by the researcher or not, does not clearly resemble a random process.
A third distinction between QES and NES has been made that argues that NES describe the study of unplanned events whereas QES describe evaluations of events that are planned (but not controlled by the researcher), such as policies or programmes specifically aimed at influencing an outcome (Fig. 1 c) [ 17 ]. In practice however, the distinction between these can be ambiguous.
When the assignment of exposure is not controlled by the researcher, with rare exceptions (for example lottery-system [ 22 ] or military draft [ 23 ] allocations), it is typically very difficult to prove that true (as-if) randomization occurred. Because of the ambiguity of ‘as-if randomization’ and the fact that the tools to assess this are the same as those used for assessment of internal validity in any observational study [ 12 ], the UK Medical Research Council (MRC) guidance advocates a broader conceptualisation of a NES. Under the MRC guidance, a NES is defined as any study that investigates an event that is not under the control of the research team, and which divides a population into exposed and unexposed groups, or into groups with different levels of exposure (Fig. 1 d).
Here, while acknowledging the remaining ambiguity regarding the precise definition of a NES, in consideration of the definitions above [ 24 ], we argue that:
what distinguishes NES from RCTs is that allocation is not controlled by the researchers and;
what distinguishes NES from other observational designs is that they specifically evaluate the impact of a clearly defined event or process which result in differences in exposure between groups.
A detailed assessment of the allocation mechanism (which determines exposure status) is essential. If we can demonstrate that the allocation process approximates a randomization process, any causal claims from NES will be substantially strengthened. The plausibility of the ‘as-if random’ assumption strongly depends on detailed knowledge of why and how individuals or groups of individuals were assigned to conditions and how the assignment process was implemented [ 10 ]. This plausibility can be assessed quantitatively for observed factors using standard tools for assessment of internal validity of a study [ 12 ], and should ideally be supplemented by a qualitative description of the assignment process. Common with contemporary public health practice, we will use the term ‘natural experiment study’, or NES to refer to both NES and QES, from hereon.
Medline, Embase and Google Scholar were searched using search terms including quasi-experiment, natural experiment, policy evaluation and public health evaluation and key methodological papers were used to develop this work. Peer-reviewed papers were supplemented by grey literature.
Part 1. Conceptualisations of natural experiments
An analytic approach.
Some conceptualisations of NES place their emphasis on the analytic tools that are used to evaluate natural experiments [ 25 , 26 ]. In this conceptualisation NES are understood as being defined by the way in which they are analysed, rather than by their design. An array of different statistical methods is available to analyse natural experiments, including regression adjustments, propensity scores, difference-in-differences, interrupted time series, regression discontinuity, synthetic controls, and instrumental variables. Overviews including strengths and limitations of the different methods are provided in [ 12 , 27 ]. However, an important drawback of this conceptualisation is that it suggests that there is a distinct set of methods for the analysis of NES.
A study design
The popularity of NES has resulted in some conceptual stretching, where the label is applied to a research design that only implausibly meets the definitional features of a NES [ 10 ]. For example, observational studies exploring variation in exposures (rather than the study of an event or change in exposure) have sometimes also been badged as NES. A more stringent classification of NES as a type of study design, rather than a collection of analytic tools, is important because it prevents attempts to incorrectly cover observational studies with a ‘glow of experimental legitimacy’ [ 10 ]. If the design rather than the statistical methodology defines a NES, this allows an open-ended array of statistical tools. These tools are not necessarily constrained by those mentioned above, but could also, for example, include new methods such as synthetic controls that can be utilised to analyse the natural experiments. The choice of appropriate evaluation method should be based on what is most suitable for each particular study, and then depends on the knowledge about the event, the availability of data, and design elements such as its allocation process.
Dunning argues that it is the overall research design, rather than just the statistical methods, that compels conviction when making causal claims. He proposes an evaluation framework for NES along the three dimensions of (1) the plausibility of as-if randomization of treatment, (2) the credibility of causal and statistical models, and (3) the substantive relevance of the treatment. Here, the first dimension is considered key for distinguishing NES from other QES [ 4 ]. NES can be divided into those where a plausible case for ‘as-if random’ assignment can be made (which he defines as NES), and those where confounding from observed factors is directly adjusted for through statistical means. The validity of the latter (which Dunning defines as ‘other quasi experiments’, and we define as ‘weaker NES’) relies on the assumption that unmeasured confounding is absent [ 8 ], and is considered less credible in theory for making causal claims [ 4 ]. In this framework, the ‘as-if-randomised’ NES can be viewed as offering stronger causal evidence than other quasi-experiments. In principle, they offer an opportunity for direct estimates of effects (akin to RCTs) where control for confounding factors would not necessarily be required [ 4 ], rather than relying on adjustment to derive conditional effect estimates [ 10 ]. Of course, the latter may well reach valid and compelling conclusions as well, but causal claims suffer to a higher degree from the familiar threats of bias and unmeasured confounding.
Part 2. A target trial framework for natural experiment studies
In this section, we provide recommendations for evaluation of the ‘as if random’ assumption and provide a unifying Target Trial Framework for NES, which brings together key sets of criteria that can be used to appraise the strength of causal claims from NES and assist with study design and reporting.
In public health, there is considerable overlap between analytic and design-based uses of the term NES. Nevertheless, we argue that if we consider NES a type of study design, causal inference can be strengthened by clear appraisal of the likelihood of ‘as-if’ random allocation of exposure. This should be demonstrated by both empirical evidence and by knowledge and reasoning about the causal question and substantive domain under question [ 8 , 10 ]. Because the concept of ‘as-if’ randomization is difficult, if not impossible to prove, it should be thought of along a ‘continuum of plausibility’ [ 10 ]. Specifically, for claims of ‘as-if’ randomization to be plausible, it must be demonstrated that the variables that determine treatment assignment are exogenous. This means that they are: i) strongly correlated with treatment status but are not caused by the outcome of interest (i.e. no reverse causality) and ii) independent of any other (measured or unmeasured) causes of the outcome of interest [ 8 ].
Given this additional layer of justification, especially with respect to the qualitative knowledge of the assignment process and domain knowledge from practitioners more broadly, we argue where feasible for the involvement of practitioners. This could, for example, be formalized through co-production in which members of the public and policy makers are involved in the development of the evaluation. If we appraise NES as a type of study design, which distinguish themselves from other designs because i) there is a particular change in exposure that is evaluated and ii) causal claims are supported by an argument of the plausibility of as-if randomization, then we guard against conflating NES with other observational designs [ 10 , 28 ].
There is a range of ways of dealing with the problems of selection on measured and unmeasured confounders in NES [ 8 , 10 ] which can be understood in terms of a ‘target trial’ we are trying to emulate, had randomization been possible [ 29 ]. The protocol of a target trial describes seven components common to RCTs (‘eligibility criteria’, ‘treatment strategies’, ‘assignment procedures’, ‘follow-up period’, ‘outcome’, ‘causal contrasts of interest’, and the ‘analysis plan’), and provides a systematic way of improving, reporting and appraising NES relative to a ‘gold standard’ (but often not feasible in practice) trial. In the design phase of a NES deviations from the target trial in each domain can be used to evaluate where improvements and where concessions will have to be made. This same approach can be used to appraise existing NES. The target trial framework also provides a structured way for reporting NES, which will facilitate evaluation of the strength of NES, improve consistency and completeness of reporting, and benefit evidence syntheses.
In Table 1 , we bring together elements of the Target Trial framework and conceptualisations of NES to derive a framework to describe the Target Trial for NES [ 12 ]. By encouraging researchers to address the questions in Table 1 , the framework provides a structured approach to the design, reporting and evaluation of NES across the seven target trial domains. Table 1 also provides recommendations to improve the strength of causal claims from NES, focussing primarily on sensitivity analyses to improve internal validity.
An illustrative example of a well-developed NES based on the criteria outlined in Table 1 is by Reeves et al. [ 39 ]. The NES evaluates the impact of the introduction of a National Minimum Wage on mental health. The study compared a clearly defined intervention group of recipients of a wage increase up to 110% of pre-intervention wage with clearly defined control groups of (1) people ineligible to the intervention because their wage at baseline was just above (100–110%) minimum wage and (2) people who were eligible, but whose companies did not comply and did not increase minimum wage. This study also included several sensitivity tests to strengthen causal arguments. We have aligned this study to the Target Trial framework in Additional file 1 .
The Target Trial Approach for NES (outlined in Table 1 ) provides a straightforward approach to improve, report, and appraise existing NES and to assist in the design of future studies. It focusses on structural design elements and goes beyond the use of quantitative tools alone to assess internal validity [ 12 ]. This work complements the ROBINS-I tool for assessing risk of bias in non-randomised studies of interventions, which similarly adopted the Target Trial framework [ 40 ]. Our approach focusses on the internal validity of a NES, with issues of construct and external validity being outside of the scope of this work (guidelines for these are provided in for example [ 41 ]). It should be acknowledged that less methodologically robust studies can still reach valid and compelling conclusions, even without resembling the notional target trial. However, we believe that drawing on the target trial framework helps highlight occasions when causal inference can be made more confidently.
And finally, the framework does explicitly exclude observational studies that aim to investigate the effects of changes in behaviour without an externally forced driver to do so. For example, although a cohort study can be the basis for the evaluation of a NES in principle, effects of the change of diet of some participants (compared to those who did not change their diet) is not an external cause (i.e. exogenous) and does not fall within the definition of an experiment [ 11 ]. However, such studies are likely to be more convincing than those which do not study within-person changes and we note that the statistical methods used may be similar to NES.
Despite their advantages, NES remain based on observational data and thus biases in assignment of the intervention can never be completely excluded (although for plausibly ‘as if randomised’ natural experiments these should be minimal). It is therefore important that a robust assessment of different potential sources of bias is reported. It has additionally been argued that sensitivity analyses are required to assess whether a pattern of small biases could explain away any ostensible effect of the intervention, because confidence intervals and statistical tests do not do this [ 14 ]. Recommendations that would improve the confidence with which we can make causal claims from NES, derived from work by Rosenbaum [ 14 ], have been outlined in Table 1 . Although sensitivity analyses can place plausible limits on the size of the effects of hidden biases, because such analyses are susceptible to assumptions about the maximum size of omitted biases, they cannot completely rule out residual bias [ 34 ]. Of importance for the strength of causal claims therefore, is the triangulation of NES with other evaluations using different data or study designs susceptible to different sources of bias [ 5 , 42 ].
None of the recommendations outlined in Table 1 will by themselves eliminate bias in a NES, but neither is it required to implement all of them to be able to make a causal claim with some confidence. Instead, a continuum of confidence in the causal claims based on the study design and the data is a more appropriate and practical approach [ 43 ]. Each sensitivity analysis aims to minimise ambiguity of a particular potential bias or biases, and as such a combination of selected sensitivity analyses can strengthen causal claims [ 14 ]. We would generally, but not strictly, consider a well conducted RCT as the design where we are most confident about such claims, followed by natural experiments, and then other observational studies; this would be an extension of the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) framework [ 44 ]. GRADE provides a system for rating the quality (or certainty) of a body of evidence and grading the strength of recommendations for use in systematic reviews, health technology assessments (HTAs), and clinical practice guidelines. It typically only distinguishes between trials and observational studies when making these judgments (note however, that recent guidance does not make this explicit distinction when using ROBINS-I [ 45 ]). Given the increased contribution of NES in public health, especially those based on routine data [ 37 ], the specific inclusion of NES in this system might improve the rating of the evidence from these study designs.
Our recommendations are of particular importance for ensuring rigour in the context of (public) health research where natural experiments have become increasingly popular for a variety of reasons, including the availability of large routinely collected datasets [ 37 ]. Such datasets invite the discovery of natural experiments, even where the data may not be particularly applicable to this design, but also these enable many of the sensitivity analyses to be conducted from within the same dataset or through linkage to other routine datasets.
Finally, alignment to the Target Trial Framework also links natural experiment studies directly to other measures of trial validity, including pre-registration, reporting checklists, and evaluation through risk-of-bias-tools [ 40 ]. This aligns with previous recommendations to use established reporting guidelines such as STROBE, TREND [ 12 ], and TIDieR-PHP [ 46 ] for the reporting of natural experiment studies. These reporting guidelines could be customized to specific research areas (for example, as developed for a systematic review of quasi-experimental studies of prenatal alcohol use and birthweight and neurodevelopment [ 47 ]).
We provide a conceptualisation of natural experiment studies as they apply to public health. We argue for the appreciation of natural experiments as a type of study design rather than a set of tools for the analyses of non-randomised interventions. Although there will always remain some ambiguity about the strength of causal claims, there are clear benefits to harnessing NES rather than relying purely on observational studies. This includes the fact that NES can be based on routinely available data and that timely evidence of real-world relevance can be generated. The inclusion of a discussion of the plausibility of as-if randomization of exposure allocation will provide further confidence in the strength of causal claims.
Aligning NES to the Target Trial framework will guard against conceptual stretching of these evaluations and ensure that the causal claims about whether public health interventions ‘work’ are based on evidence that is considered ‘good enough’ to inform public health action within a ‘practice-based evidence’ framework. This framework describes how evaluations can help reducing critical uncertainties and adjust the compass bearing of existing policy (in contrast to the ‘evidence-based practice’ framework in which RCTs are used to generate ‘definitive’ evidence for particular interventions) [ 48 ].
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.
Abbreviations
Randomised Controlled Trial
Natural Experiment
Stable Unit Treatment Value Assumption
Intention-To-Treat
Shadish WR, Cook TD, Campbell DT. Experimental and Quasi-Experimental Designs. 2nd ed. Wadsworth, Cengage Learning: Belmont; 2002.
Google Scholar
King G, Keohane RO, Verba S. The importance of research Design in Political Science. Am Polit Sci Rev. 1995;89:475–81.
Article Google Scholar
Meyer BD. Natural and quasi-experiments in economics. J Bus Econ Stat. 1995;13:151–61.
Dunning T. Natural experiments in the social sciences. A design-based approach. 6th edition. Cambridge: Cambridge University Press; 2012.
Book Google Scholar
Craig P, Cooper C, Gunnell D, Haw S, Lawson K, Macintyre S, et al. Using natural experiments to evaluate population health interventions: new medical research council guidance. J Epidemiol Community Health. 2012;66:1182–6.
Cook TD, Shadish WR, Wong VC. Three conditions under which experiments and observational studies produce comparable causal estimates: new findings from within-study comparisons. J Policy Anal Manag. 2008;27:724–50.
Bärnighausen T, Røttingen JA, Rockers P, Shemilt I, Tugwell P. Quasi-experimental study designs series—paper 1: introduction: two historical lineages. J Clin Epidemiol. 2017;89:4–11.
Waddington H, Aloe AM, Becker BJ, Djimeu EW, Hombrados JG, Tugwell P, et al. Quasi-experimental study designs series—paper 6: risk of bias assessment. J Clin Epidemiol. 2017;89:43–52.
Saeed S, Moodie EEM, Strumpf EC, Klein MB. Evaluating the impact of health policies: using a difference-in-differences approach. Int J Public Health. 2019;64:637–42.
Dunning T. Improving causal inference: strengths and limitations of natural experiments. Polit Res Q. 2008;61:282–93.
Bärnighausen T, Tugwell P, Røttingen JA, Shemilt I, Rockers P, Geldsetzer P, et al. Quasi-experimental study designs series—paper 4: uses and value. J Clin Epidemiol. 2017;89:21–9.
Craig P, Katikireddi SV, Leyland A, Popham F. Natural experiments: an overview of methods, approaches, and contributions to public health intervention research. Annu Rev Public Health. 2017;38:39–56.
Pearl J, Mackenzie D. The book of why: the new science of cause and effect. London: Allen Lane; 2018.
Rosenbaum PR. How to see more in observational studies: some new quasi-experimental devices. Annu Rev Stat Its Appl. 2015;2:21–48.
Petimar J, Ramirez M, Rifas-Shiman SL, Linakis S, Mullen J, Roberto CA, et al. Evaluation of the impact of calorie labeling on McDonald’s restaurant menus: a natural experiment. Int J Behav Nutr Phys Act. 2019;16. Article no: 99.
Fergusson DM, Horwood LJ, Boden JM, Mulder RT. Impact of a major disaster on the mental health of a well-studied cohort. JAMA Psychiatry. 2014;71:1025–31.
Remler DK, Van Ryzin GG. Natural and quasi experiments. In: Research methods in practice: strategies for description and causation. 2nd ed. Thousand Oaks: SAGE Publication Inc.; 2014. p. 467–500.
Cook PA, Hargreaves SC, Burns EJ, De Vocht F, Parrott S, Coffey M, et al. Communities in charge of alcohol (CICA): a protocol for a stepped-wedge randomised control trial of an alcohol health champions programme. BMC Public Health. 2018;18. Article no: 522.
Lumey LH, Stein AD, Kahn HS, Van der Pal-de Bruin KM, Blauw GJ, Zybert PA, et al. Cohort profile: the Dutch hunger winter families study. Int J Epidemiol. 2007;36:1196–204.
Article CAS Google Scholar
Meng X, Qian N. The Long Term Consequences of Famine on Survivors: Evidence from a Unique Natural Experiment using China’s Great Famine. Natl Bur Econ Res Work Pap Ser. 2011;NBER Worki.
Franco M, Bilal U, Orduñez P, Benet M, Morejón A, Caballero B, et al. Population-wide weight loss and regain in relation to diabetes burden and cardiovascular mortality in Cuba 1980-2010: repeated cross sectional surveys and ecological comparison of secular trends. BMJ. 2013;346:f1515.
Angrist J, Bettinger E, Bloom E, King E, Kremer M. Vouchers for private schooling in Colombia: evidence from a randomized natural experiment. Am Econ Rev. 2002;92:1535–58.
Angrist JD. Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records. Am Econ Rev. 1990;80:313–36.
Dawson A, Sim J. The nature and ethics of natural experiments. J Med Ethics. 2015;41:848–53.
Bärnighausen T, Oldenburg C, Tugwell P, Bommer C, Ebert C, Barreto M, et al. Quasi-experimental study designs series—paper 7: assessing the assumptions. J Clin Epidemiol. 2017;89:53-66.
Tugwell P, Knottnerus JA, McGowan J, Tricco A. Big-5 Quasi-Experimental designs. J Clin Epidemiol. 2017;89:1–3.
Reeves BC, Wells GA, Waddington H. Quasi-experimental study designs series—paper 5: a checklist for classifying studies evaluating the effects on health interventions—a taxonomy without labels. J Clin Epidemiol. 2017;89:30–42.
Rubin DB. For objective causal inference, design trumps analysis. Ann Appl Stat. 2008;2:808–40.
Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016;183:758–64.
Benjamin-Chung J, Arnold BF, Berger D, Luby SP, Miguel E, Colford JM, et al. Spillover effects in epidemiology: parameters, study designs and methodological considerations. Int J Epidemiol. 2018;47:332–47.
Munafò MR, Tilling K, Taylor AE, Evans DM, Smith GD. Collider scope: when selection bias can substantially influence observed associations. Int J Epidemiol. 2018;47:226–35.
Schwartz S, Gatto NM, Campbell UB. Extending the sufficient component cause model to describe the stable unit treatment value assumption (SUTVA). Epidemiol Perspect Innov. 2012;9:3.
Cawley J, Thow AM, Wen K, Frisvold D. The economics of taxes on sugar-sweetened beverages: a review of the effects on prices, sales, cross-border shopping, and consumption. Annu Rev Nutr. 2019;39:317–38.
Reichardt CS. Nonequivalent Group Designs. In: Quasi-Experimentation. A Guide to Design and Analysis. 1st edition. New York: The Guildford Press; 2019. p. 112–162.
Denzin N. Sociological methods: a sourcebook. 5th ed. New York: Routledges; 2006.
Matthay EC, Hagan E, Gottlieb LM, Tan ML, Vlahov D, Adler NE, et al. Alternative causal inference methods in population health research: evaluating tradeoffs and triangulating evidence. SSM - Popul Heal. 2020;10:10052.
Leatherdale ST. Natural experiment methodology for research: a review of how different methods can support real-world research. Int J Soc Res Methodol. 2019;22:19–35.
Reichardt CS. Quasi-experimentation. A guide to design and analysis. 1st ed. New York: The Guildford Press; 2019.
Reeves A, McKee M, Mackenbach J, Whitehead M, Stuckler D. Introduction of a National Minimum Wage Reduced Depressive Symptoms in Low-Wage Workers: A Quasi-Natural Experiment in the UK. Heal Econ (United Kingdom). 2017;26:639–55.
Sterne JA, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919.
Shadish WR, Cook TD, Campbell DT. Generalized Causal Inference: A Grounded Theory. In: Experimental and Quasi-Experimental Designs for Generalized Causal Inference. 2nd ed. Belmont: Wadsworth, Cengage Learning; 2002. p. 341–73.
Lawlor DA, Tilling K, Smith GD. Triangulation in aetiological epidemiology. Int J Epidemiol. 2016;45:1866–86.
Hernán MA. The C-word: scientific euphemisms do not improve causal inference from observational data. Am J Public Health. 2018;108:616–9.
Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1. Introduction - GRADE evidence profiles and summary of findings tables. J Clin Epidemiol. 2011;64:383–94.
Schünemann HJ, Cuello C, Akl EA, Mustafa RA, Meerpohl JJ, Thayer K, et al. GRADE guidelines: 18. How ROBINS-I and other tools to assess risk of bias in nonrandomized studies should be used to rate the certainty of a body of evidence. J Clin Epidemiol. 2019;111:105–14.
Campbell M, Katikireddi SV, Hoffmann T, Armstrong R, Waters E, Craig P. TIDieR-PHP: a reporting guideline for population health and policy interventions. BMJ. 2018;361:k1079.
Mamluk L, Jones T, Ijaz S, Edwards HB, Savović J, Leach V, et al. Evidence of detrimental effects of prenatal alcohol exposure on offspring birthweight and neurodevelopment from a systematic review of quasi-experimental studies. Int J Epidemiol. 2021;49(6):1972-95.
Ogilvie D, Adams J, Bauman A, Gregg EW, Panter J, Siegel KR, et al. Using natural experimental studies to guide public health action: turning the evidence-based medicine paradigm on its head. J Epidemiol Community Health. 2019;74:203–8.
Download references
Acknowledgements
This study is funded by the National Institute for Health Research (NIHR) School for Public Health Research (Grant Reference Number PD-SPH-2015). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The funder had no input in the writing of the manuscript or decision to submit for publication. The NIHR School for Public Health Research is a partnership between the Universities of Sheffield; Bristol; Cambridge; Imperial; and University College London; The London School for Hygiene and Tropical Medicine (LSHTM); LiLaC – a collaboration between the Universities of Liverpool and Lancaster; and Fuse - The Centre for Translational Research in Public Health a collaboration between Newcastle, Durham, Northumbria, Sunderland and Teesside Universities. FdV is partly funded by National Institute for Health Research Applied Research Collaboration West (NIHR ARC West) at University Hospitals Bristol NHS Foundation Trust. SVK and PC acknowledge funding from the Medical Research Council (MC_UU_12017/13) and Scottish Government Chief Scientist Office (SPHSU13). SVK acknowledges funding from a NRS Senior Clinical Fellowship (SCAF/15/02). KT works in the MRC Integrative Epidemiology Unit, which is supported by the Medical Research Council (MRC) and the University of Bristol [MC_UU_00011/3].
Author information
Authors and affiliations.
Population Health Sciences, Bristol Medical School, University of Bristol, Canynge Hall, 39 Whatley Road, Bristol, BS8 2PS, UK
Frank de Vocht, Cheryl McQuire, Kate Tilling & Matthew Hickman
NIHR School for Public Health Research, Newcastle, UK
Frank de Vocht & Cheryl McQuire
NIHR Applied Research Collaboration West, Bristol, UK
Frank de Vocht
MRC/CSO Social and Public Health Sciences Unit, University of Glasgow, Bristol, UK
Srinivasa Vittal Katikireddi & Peter Craig
MRC IEU, University of Bristol, Bristol, UK
Kate Tilling
You can also search for this author in PubMed Google Scholar
Contributions
FdV conceived of the study. FdV, SVK,CMQ,KT,MH, PC interpretated the evidence and theory. FdV wrote the first version of the manuscript. SVK,CMQ,KT,MH, PC provided substantive revisions to subsequent versions. All authors have read and approved the manuscript. FdV, SVK,CMQ,KT,MH, PC agreed to be personally accountable for their own contributions and will ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.
Corresponding author
Correspondence to Frank de Vocht .
Ethics declarations
Ethics approval and consent to participate.
Not applicable.
Consent for publication
Competing interests.
The authors declare that they have no competing interests.
Additional information
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1..
Online Supplementary Material. Table 1 . the Target Trial for Natural Experiments and Reeves et al. [ 28 ]. Alignment of Reeves et al. (Introduction of a National Minimum Wage Reduced Depressive Symptoms in Low-Wage Workers: A Quasi-Natural Experiment in the UK. Heal Econ. 2017;26:639–55) to the Target Trial framework.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
About this article
Cite this article.
de Vocht, F., Katikireddi, S.V., McQuire, C. et al. Conceptualising natural and quasi experiments in public health. BMC Med Res Methodol 21 , 32 (2021). https://doi.org/10.1186/s12874-021-01224-x
Download citation
Received : 14 July 2020
Accepted : 28 January 2021
Published : 11 February 2021
DOI : https://doi.org/10.1186/s12874-021-01224-x
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Public health
- Public health policy
- Natural experiments
- Quasi experiments
- Evaluations
BMC Medical Research Methodology
ISSN: 1471-2288
- General enquiries: [email protected]
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- CAREER FEATURE
- 02 August 2021
Pandemic upheaval offers a huge natural experiment
- Julia Rosen 0
Julia Rosen is a freelance journalist in Portland, Oregon.
You can also search for this author in PubMed Google Scholar
Soon after COVID-19 lockdowns began in March 2020, physicians in certain nations noticed something unexpected: the number of premature births seemed to plummet. Preliminary research in one region of Ireland documented a 73% decrease in very-low-birth-weight babies 1 . And scientists in Denmark measured a roughly 90% country-wide drop in extremely premature births compared with the previous five years 2 . In Nepal, however, researchers reported 3 that the risk of preterm birth — before 37 weeks of gestation — jumped by 30% during lockdown, a pandemic trend that scientists expect to find in other economically disadvantaged nations. In some countries, reports of increased numbers of stillbirths further complicated the picture 4 .
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
24,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
185,98 € per year
only 3,65 € per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Nature 596 , 149-151 (2021)
doi: https://doi.org/10.1038/d41586-021-02092-7
Philip, R. K. et al. BMJ Glob. Health 5 , e003075 (2020).
Article PubMed Google Scholar
Hedermann, G. Preprint at medRxiv https://doi.org/10.1101/2020.05.22.20109793 (2020).
Ashish, K. C. et al. Lancet Glob. Health 8 , e1273–e1281 (2020).
Chmielewska, B. et al. Lancet Glob. Health 9 , e759–e772 (2021).
Stock, S. J. et al. Wellcome Open Res. 6 , 21 (2021).
Article Google Scholar
Rosales-Rueda, M. J. Health Econ. 62 , 13–44 (2018).
Leon, D. A. et al. Lancet 350 , 383–388 (1997).
Franco, M. et al. BMJ 346 , f1515 (2013).
Thomson, B. Circulation 142 , 14–16 (2020).
Gaubert, B. et al. J. Geophys. Res. Atmos. 126 , e2020JD034213 (2021).
Schmidt, S. C. E. et al. Sci. Rep. 10 , 21780 (2020).
Soto, E. H. et al. Biol. Conserv. 255 , 108972 (2021).
Abe, K., Miyawaki, A., Nakamura, M., Ninomiya, H. & Kobayashi, Y. J. Allergy Clin. Immunol. Pract. 9 , 494–496 (2021).
Kenyon, C. C., Hill, D. A., Henrickson, S. E., Bryant‑Stephens, T. C. & Zorc, J. J. J. Allergy Clin. Immunol. Pract. 8 , 2774–2776 (2021).
Amegah, A. K. Lancet Glob. Health 8 , e1110–e1111 (2020).
Silva-Rodríguez, E. A., Gálvez, N., Swan, G. J. F., Cusack, J. J. & Moreira-Arce, D. Sci. Total Environ. 765 , 142713 (2021).
Bates, A. E., Primack, R. B., Moraga, P. & Duarte, C. M. Biol. Conserv. 248 , 108665 (2020).
Bates, A. E., Primack, R. B., PAN-Environment Working Group & Duarte, C. M. Biol. Conserv . https://doi.org/10.1016/j.biocon.2021.109175 (2021).
Download references
Related Articles
- Scientific community
- Research data
What drives me to help female students to thrive at my Ugandan university
Career Q&A 15 OCT 24
Restore Internet access in war-torn Sudan
Editorial 15 OCT 24
Why this PhD candidate joined campus protests against the Israel–Hamas war
Career Q&A 14 OCT 24
How long COVID could lift the fog on neurocognitive disorders
Nature Index 02 OCT 24
Twenty years of Addgene
Technology Feature 30 SEP 24
The youngest among us fight COVID-19 in their own way
Research Highlight 24 SEP 24
Conservation policies must address an overlooked issue: how war affects the environment
Comment 16 OCT 24
UN plastic pollution treaty must not ignore the scourge of microplastics
Correspondence 15 OCT 24
Open Faculty Position in Mathematical and Information Security
We are now seeking outstanding candidates in all areas of mathematics and information security.
Dongguan, Guangdong, China
GREAT BAY INSTITUTE FOR ADVANCED STUDY: Institute of Mathematical and Information Security
PostDoc Associate
PostDoc Associate in Dr. Zhaohui Feng's lab
New Brunswick, New Jersey
Rutgers Cancer Institute of New Jersey
Professor of Mathematics and AI
Lancaster University is seeking to appoint a Professor to take a prominent, strategic role in the leadership of MARS
Lancaster, Lancashire (GB)
Lancaster University
Policy Coordinator
As OA Policy Coordinator, you will help drive development of Springer Nature’s open access policy and strategy to meet market changes.
London or Berlin – Hybrid working model
Springer Nature Ltd
Faculty Positions at SUSTech Department of Biomedical Engineering
We seek outstanding applicants for full-time tenure-track/tenured faculty positions. Positions are available for both junior and senior-level.
Shenzhen, Guangdong, China
Southern University of Science and Technology (Biomedical Engineering)
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Stack Exchange Network
Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
How to identify natural experiments?
The motivation of this question is learn what to look for in deciding whether observational data can be used for causal inference (in the sense of Pearl).
Wikipedia describes a natural experiment :
A natural experiment is an empirical study in which individuals (or clusters of individuals) are exposed to the experimental and control conditions that are determined by nature or by other factors outside the control of the investigators. The process governing the exposures arguably resembles random assignment. Thus, natural experiments are observational studies and are not controlled in the traditional sense of a randomized experiment (an intervention study). Natural experiments are most useful when there has been a clearly defined exposure involving a well defined subpopulation (and the absence of exposure in a similar subpopulation) such that changes in outcomes may be plausibly attributed to the exposure. In this sense, the difference between a natural experiment and a non-experimental observational study is that the former includes a comparison of conditions that pave the way for causal inference, but the latter does not.
I find this too vague to translate into actions or decisions.
This post gives a similar description:
The difference is that a natural experiment is an observational study that, in which out so happens that there is something that is effectively a randomization.
So the key ingredient seems to be about whether exposure to a condition was randomized. The same post gives an example:
A nice clean example is studying the effects of suddenly having lots of money by comparing lottery winners against a representative random sample of people that bought tickets for the same drawing. Effectively, someone else did the randomization for you and it very clearly had nothing to do with some characteristic of the people whether they won or not.
Which indeed, this example has randomization explicitly by construction. Looking through the examples on the wikipedia page, I find that some of them are similarly obvious (e.g. lotteries by design) while others I am less sure of.
A phrase from the wiki that doesn't seem to entirely match the above descriptions is:
While game shows might seem to be artificial contexts, they can be considered natural experiments due to the fact that the context arises without interference of the scientist.
I wouldn't think that the scientists not interfering would be enough to conclude that an observational study is a natural experiment.
What should we look for in deciding on whether exposure in an observational study was "randomly assigned"?
- observational-study
Randomisation is used because it breaks back-door paths into the effect of interest. The important thing is not randomisation itself - it's the absence of back-door paths . That is, there are no confounders, a.k.a. common causes, that influence both the treatment and the effect. (And no selection bias.)
Causal processes are almost never fully observed, so it's hard to be certain no back-door paths exist unless you randomise. However, in natural experiments we can argue that a back-door path is:
- Unlikely to operate,
- Controlled for (i.e. there is an observed covariate we can condition on to block the path), or
- Too weak to account for the observed effect.
One example given on Wikipedia is John Snow's investigation of cholera mortality among customers of the Southwark and Vauxhall Waterworks Company, vs the Lambeth Waterworks Company. Water service was not assigned randomly, but renters were usually constrained to use whichever company had run pipes to their home, and many had no idea which company they were actually paying. The two distributors serviced slightly different areas, so Snow looked at mortality in the areas where they overlapped (controlling for the possible geographic confounder). There was a striking difference in cholera rates between the customers of each company, too large to explain by other differences between the two groups.
If we're not completely certain of a single natural experiment, we can use convergent evidence to strengthen the case. By itself, the Southwark & Vauxhall vs Lambeth experiment might not have sufficed to show that cholera was waterborne. But in combination with other natural experiments - e.g. mortality among people who drank from the Broad St Pump vs other pumps - Snow had a strong argument.
Your Answer
Sign up or log in, post as a guest.
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .
Not the answer you're looking for? Browse other questions tagged causality observational-study or ask your own question .
- Featured on Meta
- Upcoming initiatives on Stack Overflow and across the Stack Exchange network...
- Preventing unauthorized automated access to the network
Hot Network Questions
- Rationale for requiring struct prefix in C
- Reference on why finding a feasible solution to the Set Partitioning problem is NP-Hard?
- Rename files with non-Latin characters
- If the categorical variable is retained in my final model in R, then why does the post hoc analysis say the levels do not differ?
- A grid of Primes
- How good is Quicken Spell as a feat?
- On the martingale betting scheme
- Can the headphone jack of a laptop send MIDI signals with a Jack TRS cable?
- Roll a die in 3D
- Girsanov's theorem for Gaussian measures as the Cameron-martin theorem with a random shift
- Universal footprint for 16 bit ISA slot
- The answer is a highly composite number
- Crocodile paradox
- Stick lodging into front wheel - is it preventable?
- Card design with long and short text options
- Recursively move subtitle files into subdirectories
- Nothing outside of you needed
- Under what circumstances is the observation of X proof of the existence of X?
- Are there any sane ways to make or explore the properties of FOOF (dioxygen difluoride)?
- Does painting or staining a fence make it last longer?
- Is entropy real or just a consequence of the way we choose to examine a physical system?
- Bragg's Law: Don't Waves have to Overlap to Interfere?
- Will I have enough time to connect between Paris Gare du Nord and Gare de l’Est accounting for any passport control?
- Disadvantages of posting on arXiv when submitting to Nature or Science?
Our systems are now restored following recent technical disruption, and we’re working hard to catch up on publishing. We apologise for the inconvenience caused. Find out more: https://www.cambridge.org/universitypress/about-us/news-and-blogs/cambridge-university-press-publishing-update-following-technical-disruption
We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings .
Login Alert
- > Journals
- > Annales. Histoire, Sciences Sociales - English Edition
- > Volume 72 Issue 4
- > Natural Experiments and Causality in Economic History:...
Article contents
Natural experiments and causality in economic history: on their relations to theory and temporality.
Published online by Cambridge University Press: 21 June 2021
A recent and influential research methodology, mainly endorsed by economists, proposes to renew historical analysis based on the notions of natural experiment and causality. It has the dual ambition of unifying various disciplines around a common understanding of causality in order to tackle major historical questions (such as the role of colonization, political regimes, or religion in economic development) and of making the analysis of history more scientific. The definition of causality it promotes—of the “interventionist” type—tends to liken historical events to laboratory experiments. This is articulated with a neo-institutionalist perspective aimed at measuring the long-term effects of past institutional changes, which are considered exogenous. In the first part of this article, we present the ambitions, contributions, methods, and hypotheses (implicit and explicit) of this approach, showing how it differs from more traditional quantitative economic history and placing it in the context of the recent empirical and neo-institutionalist “turns” of the economic discipline. In a second stage, we consider the criticism—often scathing—voiced by historians or economists against this method and its objectives. Finally, we emphasize the many difficulties posed by this approach when it comes to taking into account the historicity of phenomena, to producing general statements based on particular cases, and to providing a complete and coherent definition of causality in history.
Access options
This article was translated from the French by Amy Jacobs-Colas and edited by Robin Emlein, Chloe Morgan, and Nicolas Barreyre.
An earlier version of this text was presented at two conferences, one entitled “Les expériences de pensée” (ENS ULM, 2014), the other “Histoire et causalité” (EHESS, 2015). Our thanks to the conference organizers and participants for their many comments and suggestions. Denis Cogneau, Sophie Cras, Claude Diebolt, Jacques Revel, and Alain Trannoy provided helpful, detailed remarks on earlier versions, and our discussions with Guillaume Calafat and François Keslair were extremely valuable. We alone are responsible for the interpretations and claims put forward here.
1 Maurice Lévy-Leboyer , “ La ‘New Economic History,’” Annales ESC 24, no. 5 (1969): 1035 – 69; Jean Heffer, “Une histoire scientifique. La nouvelle histoire économique,” Annales ESC 32, no. 4 (1977): 824 – 42; Claudia Goldin, “Cliometrics and the Nobel,” Journal of Economic Perspectives 9, no. 2 (1995): 191 – 208. According to several recent, more or less critical essays, the natural experiments movement in history represents either the continuation or the potential successor of new economic history. Those texts include Peter Temin, “The Rise and Fall of Economic History at MIT ,” History of Political Economy 46, no. 1 (2014): 337 – 50; Temin, “Economic History and Economic Development: New Economic History in Retrospect and Prospect,” in Handbook of Cliometrics , ed. Claude Diebolt and Michael Haupert (Berlin: Springer, 2016), 33 – 51; Francesco Boldizzoni, T he Poverty of Clio: Resurrecting Economic History (Princeton: Princeton University Press, 2011). There has also been lively debate on the relevance of dubbing the many applications of natural experiments to African history “the new economic history of Africa.” For an introduction to this issue, see Morten Jerven , “A C lash of Disciplines? Economists and Historians Approaching the African Past,” Economic History of Developing Regions 26, no. 2 (2011): 111 – 24; Denis Cogneau, “The Economic History of Africa: Renaissance or False Dawn?” Annales HSS (English Edition) 71, no. 4 (2016): 539 – 56.
2 Ran Abramitzky, “Economics and the Modern Economic Historian,” Journal of Economic History 75, no. 4 (2015): 1240 – 51, here pp. 1245 – 47.
3 Jared Diamond and James A. Robinson, eds., Natural Experiments of History (Cambridge: Belknap Press of Harvard University Press, 2010).
4 Randolph Roth, “Scientific History and Experimental History,” Journal of Interdisciplinary History 43, no. 3 (2013): 443 – 58.
5 Joel Mokyr, review of Diamond and Robinson, Natural Experiments of History , American Historical Review 116, no. 3 (2011): 752 – 55. Mokyr concludes by stating that if the approach reined in its ambitions, it could in fact lead to better interdisciplinary use of comparative history, as advocated by Marc Bloch.
6 Abramitzky , “E conomics.”
7 This obviously does not mean that any economics study based on this type of causality necessarily fits into the neo-institutionalist paradigm. Here we detail the specific problems involved in applying to history natural experiment methods that draw on an underlying theory of institutions, and explain how these problems differ from the ones raised in criticisms of this type of causal approach applied in other fields.
8 Jo Guldi and David Armitage, The History Manifesto (Cambridge: Cambridge University Press, 2014), 110; and the dossier entitled “Debating the Longue Durée ,” Annales HSS (English Edition) 70, no. 2 (2015): 215 – 303.
9 Michel De Vroey and Luca Pensieroso, “The Rise of a Mainstream in Economics” (IRES discussion paper no. 26, Institute for Economic and Social Research, Université catholique de Louvain, 2016), 2 – 27; Matthew T. Panhans and John D. Singleton, “The Empirical Economist’s Toolkit: From Models to Methods” (working paper no. 3, Center for the History of Political Economy, Duke University, Durham, 2015), 1 – 27.
10 Joshua D. Angrist and Alan B. Krueger, “Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments,” Journal of Economic Perspectives 15, no. 4 (2001): 69 – 85; Joshua D. Angrist and Jörn-Steffen Pischke, “The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con out of Econometrics,” Journal of Economic Perspectives 24, no. 2 (2010): 3 – 30. For a discussion of these methods as applied in political science, see Allison J. Sovey and Donald P. Green, “Instrumental Variables Estimation in Political Science: A Readers’ Guide,” American Journal of Political Science 55, no. 1 (2011): 188 – 200; Jasjeet S. Sekhon and Rocío Titiunik, “When Natural Experiments are Neither Natural nor Experiments,” American Political Science Review 106, no. 1 (2012): 35 – 57.
11 Angrist and Pischke, “The Credibility Revolution,” 5.
12 Heffer, “Une histoire scientifique,” 824 – 25.
13 Abramitzky, “Economics.”
14 In Angrist and Pischke’s manifesto “The Credibility Revolution in Empirical Economics,” the possible applications of this method to macroeconomics are largely drawn from studies using natural experiments in economic history.
15 This connection is studied by Boldizzoni in The Poverty of Clio.
16 In Why Nations Fail: The Origins of Power, Prosperity, and Poverty (New York: Crown Business, 2013), Daron Acemoglu and James A. Robinson identify two types of institutions said to be observable in different periods and regions: “extractive” and “inclusive.”
17 For a comprehensive introduction to neo-institutionalist method and how it is applied in economic history, see Avner Greif, Institutions and the Path to the Modern Economy: Lessons from Medieval Trade (Cambridge: Cambridge University Press, 2006); and Robert Boyer’s critical assessment, “Historiens et économistes face à l’émergence des institutions du marché,” Annales HSS 64, no. 3 (2009): 665 – 93. See also Guillaume Calafat, “Familles, réseaux et confiance dans l’économie de l’époque moderne. Diasporas marchandes et commerce interculturel,” Annales HSS 66, no. 2 (2011): 513 – 31. In Why Nations Fail , Acemoglu and Robinson present both their neo-institutionalist notion and the distinction between extractive and inclusive institutions, a distinction which builds on that made by Douglass North in his last writings; see especially Douglass C. North, John Joseph Wallis, and Barry R. Weingast, Violence and Social Orders: A Conceptual Framework for Interpreting Recorded Human History (Cambridge: Cambridge University Press, 2009). For a critical discussion of the filiation between North, Wallis, and Weingast and their references, see Martin Daunton, “Rationality and Institutions: Reflections on Douglass North,” Structural Change and Economic Dynamics 21, no. 2 (2010): 147 – 56.
18 Technically, exogeneity is ensured when the independent variable in an econometric model is not correlated with the residuals. In addition to the two conditions stated above, measurement errors in the independent variable can also lead to correlation with the residual.
19 An experiment is understood to be “controlled” when the researcher designs it and intervenes. An example might be comparing a group of students who benefited from an education reform with another group who were not exposed to it.
20 Angrist and Pischke, “The Credibility Revolution.”
21 Ibid . See also Mark R. Rosenzweig and Kenneth I. Wolpin, “Natural ‘Natural Experiments’ in Economics,” Journal of Economic Literature 38, no. 4 (2000): 827 – 74. For an epistemological discussion of instrumental variables, see Julian Reiss, “Causal Instrumental Variables and Interventions,” Philosophy of Science 72, no. 5 (2005): 964 – 76.
22 Angrist and Pischke, “The Credibility Revolution.”
23 It goes without saying that total control of outside parameters is seldom possible, even in so-called “randomized controlled” experiments: see Agnès Labrousse, “Learning From Randomized Controlled Experiments: The Narrative of Scientificity, Practical Complications, Historical Experience,” La vie des idées 2016, https://booksandideas.net/Learning-from-Randomized-Controlled-Experiments.html ; Angus Deaton and Nancy Cartwright, “Understanding and Misunderstanding Randomized Controlled Trials” (NBER working paper no. 22595, National Bureau of Economic Research, Cambridge, 2016). Experimental economics makes extensive use of laboratory experiments with individuals to test economic theories about behavior but, paradoxically, seldom uses the notion of causality as employed by applied economics in studies on economic policy realized outside of a laboratory setting. More clarifications are needed on the sources of the notion that laboratory experiments are the ideal study situation; here we simply analyze the presuppositions underlying the definition of causality associated with the laboratory reference.
24 Diamond and Robinson, Natural Experiments , 1.
26 Adopting a temporal notion of causality (cause precedes effect) does not mean assuming that researchers can fully account for temporality itself (the succession of events and whole chains of cause and effect). See the second half of the present article.
27 For a deeper analysis of this postulate of the temporal asymmetry of causes and effects and the conceptual possibility of reverse causality, see Bourgeois-Gironde, Temps et causalité (Paris: Presses universitaires de France, 2002).
28 The paradigmatic example is price determination in the neoclassical model.
29 For a particularly well-formulated version of this definition, see James Woodward, Making Things Happen: A Theory of Causal Explanation (Oxford: Oxford University Press, 2005). For a discussion of instrumental variables in that framework, see Reiss, “Causal Instrumental Variables.”
30 Judea Pearl, “Causal Inference,” in “Causality: Objectives and Assessment,” ed. Isabelle Guyon, Dominik Janzing, and Bernhard Schölkopf, Proceedings of Machine Learning Research , vol. 6 (2010): 39 – 58.
31 As can be seen in the articles discussed here, specifically in those by Angrist and Pischke (“design-based studies”) and Acemoglu etal. (“designed institutional change”). The more general notion of “research design” is widely used to describe the methods of applied economics, while “institutional design” is recurrent in the work of neo-institutionalists.
32 Diamond and Robinson, Natural Experiments , 271 – 74.
33 Nancy Cartwright, “Are RCTs the Gold Standard?” BioSocieties 2, no. 1 (2007): 11 – 20; Deaton and Cartwright, “Understanding and Misunderstanding.”
34 François Simiand, “Méthode historique et science sociale. Étude critique d’après les ouvrages récents de M. Lacombe et de M. Seignobos,” Revue de synthèse historique 16 (1903): 1 – 22; Simiand, “La causalité en histoire,” Bulletin de la Société française de philosophie 6 (1906): 245 – 90. On the context of Simiand’s position on causality and the debates it sparked, see Jacques Revel, “Histoire et sciences sociales. Lectures d’un débat français autour de 1900,” Mil neuf cent. Revue d’histoire intellectuelle 25, no. 1 (2007): 101 – 26.
35 Diamond and Robinson, Natural Experiments , 271 – 74. From the outset, the editors explain that they think of “natural experiments” as similar to the “comparative method,” but define them only in reference to laboratory experiments, not to other comparative strategies in the discipline of history.
36 Ibid . So little is offered here on the status of generalization, on relations to theory, and therefore on the question of laws that we do not know, for instance, if the authors would go so far as to conclude that “laws of history” exist and that causal analysis is intended to reveal them.
37 Daron Acemoglu etal., “From Ancien Régime to Capitalism: The French Revolution as a Natural Experiment,” in Diamond and Robinson, Natural Experiments , 221 – 56; Daron Acemoglu etal., “The Consequences of Radical Reform: The French Revolution,” American Economic Review 101, no. 7 (2011): 3286 – 307.
38 Acemoglu etal., “From Ancien Régime to Capitalism,” 249 – 50.
39 Acemoglu etal., “The Consequences of Radical Reform,” 3303.
40 Daron Acemoglu, “Oligarchic versus Democratic Societies,” Journal of the European Economic Association 6, no. 1 (2008): 1 – 44.
41 On applying similar methods and reasoning in other contexts, see Daron Acemoglu, Tarek A. Hassan, and James A. Robinson, “Social Structure and Development: A Legacy of the Holocaust in Russia,” Quarterly Journal of Economics 126, no. 2 (2010): 895 – 946; Sara Lowes etal., “The Evolution of Culture and Institutions: Evidence from the Kuba Kingdom,” Econometrica 85, no. 4 (2017): 1065 – 91. In both these cases, it is again institutionalist theory that links actions, rules, and human behavior and allows for interpreting as causal a correlation between two observations separated by several centuries.
42 Réka Juhász, “Temporary Protection and Technology Adoption: Evidence from the Napoleonic Blockade” (Centre for Economic Performance discussion paper no. 1322, London School of Economics, 2014); Peter Koudijs, “The Boats That Did Not Sail: Asset Price Volatility in a Natural Experiment,” Journal of Finance 71, no. 3 (2016): 1185 – 226. The criticisms presented in the second half of the present article also apply to this type of study.
43 See Daniel M. Hausman, ed., The Philosophy of Economics: An Anthology (Cambridge: Cambridge University Press, 1984; repr. 1994); Julian Reiss, Error in Economics: Towards a More Evidence-Based Methodology (London: Routledge, 2008; repr. 2016).
44 Two renowned articles come to mind here: Rajnish Mehra and Edward C. Prescott, “The Equity Premium: A Puzzle,” Journal of Monetary Economics 15, no. 2 (1985): 145 – 61; Karl E. Case and Robert J. Shiller, “The Efficiency of the Market for Single-Family Homes,” American Economic Review 79, no. 1 (1989): 125 – 37.
45 In addition to the aforementioned texts by Diamond and Robinson, and Acemoglu and Robinson’s Why Nations Fail , the following articles are particularly clear on this point: Nathan Nunn, “The Importance of History for Economic Development,” Annual Review of Economics 1, no. 1 (2009): 65 – 92; James Fenske, “The Causal History of Africa: A Response to Hopkins,” Economic History of Developing Regions 25, no. 2 (2010): 177 – 212.
46 Daron Acemoglu, Simon Johnson, and James A. Robinson, “The Colonial Origins of Comparative Development: An Empirical Investigation,” American Economic Review 91, no. 5 (2001): 1369 – 401; Acemoglu, Johnson, and Robinson, “Reversal of Fortune: Geography and Institutions in the Making of the Modern World Income Distribution,” Quarterly Journal of Economics 117, no. 4 (2002): 1231 – 94.
47 Gareth Austin, “The ‘Reversal of Fortune’ Thesis and the Compression of History: Perspectives from African and Comparative Economic History,” Journal of International Development 20, no. 8 (2008): 996 – 1027; Antony G. Hopkins, “The New Economic History of Africa,” Journal of African History 50, no. 2 (2009): 155 – 77; Hopkins, “Causes and Confusions in African History,” Economic History of Developing Regions 26, no. 2 (2011): 107 – 10; Jerven, “A Clash of Disciplines.”
48 For a recent review of this literature, see Sascha O. Becker, Steven Pfaff, and Jared Rubin, “Causes and Consequences of the Protestant Reformation,” Explorations in Economic History 62, no. 3 (2016): 1 – 25. The authors define the research procedures in this new field as identifying causality using econometric procedures and estimating long-term effects from an institutionalist perspective. An exception is Davide Cantoni’s “test” of the relation Weber established between the Protestant ethic and development; see Cantoni, “The Economic Effects of the Protestant Reformation: Testing the Weber Hypothesis in the German Lands,” Journal of the European Economic Association 13, no. 4 (2015): 561 – 98. Cantoni finds no positive effect on economic growth in regions that converted to Protestantism and concludes that Weber’s theory has been invalidated. Although, strictly speaking, he only assesses links between the differentiated impacts of the Peace of Augsburg (presented as the source of a natural experiment) and population growth in particular German cities, the initial ambition of producing a generalization leads the author to present his work as an econometric test of a major theory. He is ultimately compelled to downsize that claim.
49 Max Weber, The Protestant Ethic and the Spirit of Capitalism [1906], trans. Talcott Parsons (London: Routledge, 1930; repr. 2005), 124: “Since asceticism undertook to remodel the world and to work out its ideals in the world, material goods have gained an increasing and finally inexorable power over the lives of men as at no previous period in history. Today the spirit of religious asceticism—whether finally, who knows?—has escaped from the cage. But victorious capitalism, since it rests on mechanical foundations, needs its support no longer.” To our knowledge, this crucial comment by Weber does not seem to have had any resonance in economic studies of the long-term link between the Protestant religion and economic growth.
50 Greif, Institutions ; Robert D. Putnam, Making Democracy Work: Civic Traditions in Modern Italy (Princeton: Princeton University Press, 1993). For an introduction to how economists have appropriated these two schools of thought, see Guido Tabellini, “Presidential Address: Institutions and Culture,” Journal of the European Economic Association 6, no. 3 (2008): 255 – 94. For a critique of how economists have used Putnam’s work and the historical determinism that follows from it, see Nicolas Delalande, “Is a History of Trust Possible? Remarks on the Historic Imagination of Two Economists” [2008], La vie des idées , 2011, https://booksandideas.net/Is-a-History-of-Trust-Possible.html .
51 Our way of proceeding here resembles that of Quentin Deluermoz and Pierre Singaravélou in Pour une histoire des possibles (Paris: Éd. du Seuil, 2016), 219. These authors call for examining the political uses of counterfactuals in history and criticize the claims to scientificity of mechanical conceptions of causality that fail to apply counterfactual reasoning to their own “analytical framework.”
52 In addition to the references cited above, see Boldizzoni, The Poverty of Clio.
53 Rosenzweig and Wolpin, “Natural ‘Natural Experiments’”; Cartwright, “Are RCTs the Gold Standard?”; Deaton and Cartwright, “Understanding and Misunderstanding”; Sekhon and Titiunik, “When Natural Experiments.”
54 In public policy, too, it can be ill-advised and potentially dangerous to predicate policy implementation on the ability to assess it using statistical methods; see Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Ann Arbor: University of Michigan Press, 2008).
55 Naomi Lamoreaux, “The Future of Economic History Must Be Interdisciplinary,” Journal of Economic History 75, no. 4 (2015): 1251 – 57, here p. 1255. Lamoreaux refers explicitly to Nunn’s studies of African history cited in n. 64 below.
56 This criticism could already have been made of attempts by economists—including the first generation of new economic historians—to use statistical and theoretical tools to study history. However, Lamoreaux is right that, over time, some historian economists have either surrendered or adjusted their neoclassical assumptions and managed to tailor their quantitative procedures to more context-sensitive approaches; see Clément Dherbécourt and Éric Monnet, “Les angles morts de The Poverty of Clio ,” Tracés 16 (2016): 137 – 50.
57 Witold Kula, “Histoire et économie. La longue durée,” Annales ESC 15, no. 2 (1960): 294 – 313; Jean-Yves Grenier and Bernard Lepetit, “L’expérience historique. À propos de C.-E. Labrousse,” Annales ESC 44, no. 6 (1989): 1337 – 60; Claire Lemercier, “A History Without the Social Sciences?” Annales HSS (English Edition) 70, no. 2 (2015): 271 – 83.
58 Nathan Nunn and Nancy Qian, “The Potato’s Contribution to Population and Urbanization: Evidence from an Historical Experiment,” Quarterly Journal of Economics 126, no. 2 (2011): 593 – 650.
59 See Hopkins’ critique of using population data as a proxy for economic growth in Africa in “The New Economic History.”
60 Rafael La Porta etal., “Law and Finance,” Journal of Political Economy 106, no. 6 (1998): 1113 – 55.
61 Claire Lemercier, “Napoléon contre la croissance ? À propos de droit, d’économie et d’histoire,” La vie des idées , 2008, http://www.laviedesidees.fr/Napoleon-contre-la-croissance.html ; Jérôme Sgard, “Do Legal Origins Matter? The Case of Bankruptcy Laws in Europe, 1808 – 1914,” European Review of Economic History 10, no. 3 (2006): 389 – 419; Aldo Musacchio and John D. Turner, “Does the Law and Finance Hypothesis Pass the Test of History?” Business History 55, no. 4 (2013): 524 – 42.
62 See the thematic dossier “The Economics of Contemporary Africa,” in Annales HSS (English Edition) 71, no. 4 (2016): 503 – 79; Jerven, “A Clash of Disciplines”; Denis Cogneau and Yannick Dupraz, “Institutions historiques et développement économique en Afrique,” Histoire et mesure 30, no. 1 (2015): 103 – 34.
63 Austin, “The ‘Reversal of Fortune.’”
64 Nathan Nunn, “The Long-Term Effects of Africa’s Slave Trades,” Quarterly Journal of Economics 123, no. 1 (2008): 139 – 76. See also Nathan Nunn and Leonard Wantchekon, “The Slave Trade and the Origins of Mistrust in Africa,” American Economic Review 101, no. 7 (2011): 3221 – 52.
65 Ewout Frankema and Marlous van Waijenburg, “Structural Impediments to African Growth? New Evidence from Real Wages in British Africa, 1880 – 1965” (Centre for Global Economic History working paper no. 24, Utrecht, 2011).
66 Edward E. Leamer, “Tantalus on the Road to Asymptopia,” Journal of Economic Perspectives 24, no. 2 (2010): 31 – 46, here p. 44.
67 Christopher A. Sims, “But Economics Is Not an Experimental Science,” Journal of Economic Perspectives 24, no. 2 (2010): 59 – 68, here p. 59.
68 It is beyond the scope of this article to present all the alternative conceptions of causality that have been developed in macroeconomics, but we can cite three—while noting that none is without problems or better than the others. Granger’s probabilistic causality is based on much weaker postulates than interventionist-type causality, stating simply that A causes B when A chronologically precedes B and when variations of A allow us to (partially) predict variations of B. Another procedure often used in macroeconomics is to estimate a stylized model of the economy and simulate exogenous shocks in that model in order to discuss how those shocks may correspond to observed variations. This type of causality is both interventionist and structural in that all interactions resulting from an exogenous shock are modeled. The third procedure, closer to process theory, is to document common statistical trends, or regularities, and study them in relation to general historical developments.
69 Wesley C. Salmon, Causality and Explanation (Oxford: Oxford University Press, 1998); Phil Dowe, “Process Causality and Asymmetry,” Erkenntnis 37, no. 2 (1992): 179 – 96.
70 Phil Dowe, “On the Reduction of Process Causality to Statistical Relations,” British Journal for the Philosophy of Science 44, no. 2 (1993): 325 – 27.
71 On the analogy between models and thought experiments, see Mary S. Morgan, The World in the Model: How Economists Work and Think (Cambridge: Cambridge University Press, 2012). This analogy is often used by economists themselves; see Leamer, “Tantalus on the Road to Asymptopia,” 44.
72 This is because microeconomic models are models of partial rather than general equilibrium. See Steven D. Levitt’s test of a standard economic model of crime, which uses an exogenous change in the probability of the crime being identified: “Testing the Economic Model of Crime: The National Hockey League’s Two-Referee Experiment,” Contributions to Economic Analysis and Policy 1, no. 1 (2002): 1 – 21.
73 Hume, “Of the Balance of Trade,” Political Discourses II, here 5.11. Hume argued that the balance of payments would always achieve equilibrium due to the link between money supply, prices, and trade. After constructing a model interlinking these variables, he devised the following thought experiment: “Suppose four-fifths of all the money in GREAT BRITAIN to be annihilated in one night, and the nation reduced to the same condition, with regard to specie, as in the reigns of the HARRYS and EDWARDS, what would be the consequence?” ( Political Discourses II, 5.9). Once he had used this thought experiment to isolate and describe the theoretical mechanisms at work, he clarified that what he imagined could not happen in reality: “Now, it is evident, that the same causes, which would correct these exorbitant inequalities, were they to happen miraculously, must prevent their happening in the common course of nature, and must for ever, in all neighbouring nations, preserve money nearly proportionable to the art and industry of each nation” ( Political Discourses II, 5.11).
74 In “Chronicle of a Deflation Unforetold,” Journal of Political Economy 117, no. 4 (2009): 591 – 634, François R. Velde studied the effect of an arbitrary reduction of the money supply in France in 1724, an event that might seem to correspond to Hume’s thought experiment. But he refused to identify this as a natural experiment because, in eighteenth-century France, it was customary for the king to make this type of decision and economic actors took that into account.
75 Judea Pearl, Causality: Models, Reasoning, and Inference (Cambridge: Cambridge University Press, 2009). Pearl’s conception of causality as it relates to econometrics is based on older works by Trygve Haavelmo. Specifically, the aim is to distinguish clearly between a condition and a cause, the latter involving fixing a parameter in the model (and therefore exogenizing it from the set of equations). By contrast, a condition is defined as a purely statistical concept. Pearl’s approach is a reminder that what statisticians and econometrists estimate is only ever a statistical relation and therefore a condition. To have causality, researchers must make an additional hypothesis; that is, they must explain why the variable can be thought of as “fixed.” For Pearl’s own critique of the confusion economists perpetuate on the statistical handling of causality, see Bryant Chen and Judea Pearl, “Regression and Causation: A Critical Examination of Six Econometrics Textbooks,” Real-World Economics Review 65 (2013): 2 – 20.
76 John Leslie Mackie, “Causes and Conditions,” American Philosophical Quarterly 2, no. 4 (1965): 245 – 64. INUS is the term used in the international literature to designate Mackie’s definition of causality: “Insufficient but Non-redundant parts of a condition which is itself Unnecessary but Sufficient.”
77 Benoît Rihoux and Charles C. Ragin, eds., Configurational Comparative Methods: Qualitative Comparative Analysis (QCA) and Related Techniques (Thousand Oaks: Sage Publications, 2008). We are much beholden to one of the anonymous peer reviewers of this article for suggesting how Ragin’s analyses could help imagine alternative solutions to the type of causality envisioned by proponents of natural experiments.
78 In his review of Diamond and Robinson, Natural Experiments of History (p. 274), Mokyr recalls that it was possible to adopt “progressive” institutions even where there was no French conquest and that Spain instated no reforms despite the French invasion.
79 In “Causation in the Social Sciences: Evidence, Inference, and Purpose,” Philosophy of the Social Sciences 39, no. 1 (2009): 20 – 40, Julian Reiss explains that the natural experiment method is based on an interventionist conception of causality.
80 See the references listed in n. 1 above.
81 This is probably the case for some studies on the effects of Protestantism.
82 Lemercier, “A History Without the Social Sciences?”
Linked content
This is a translation of: Expériences naturelles et causalité en histoire économique: Quels rapports à la théorie et à la temporalité ?
No CrossRef data available.
View all Google Scholar citations for this article.
Save article to Kindle
To save this article to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle .
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
- Volume 72, Issue 4
- Sacha Bourgeois-Gironde (a1) and Éric Monnet (a2)
- DOI: https://doi.org/10.1017/ahsse.2021.8
Save article to Dropbox
To save this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your Dropbox account. Find out more about saving content to Dropbox .
Save article to Google Drive
To save this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your Google Drive account. Find out more about saving content to Google Drive .
Reply to: Submit a response
- No HTML tags allowed - Web page URLs will display as text only - Lines and paragraphs break automatically - Attachments, images or tables are not permitted
Your details
Your email address will be used in order to notify you when your comment has been reviewed by the moderator and in case the author(s) of the article or the moderator need to contact you directly.
You have entered the maximum number of contributors
Conflicting interests.
Please list any fees and grants from, employment by, consultancy for, shared ownership in or any close relationship with, at any time over the preceding 36 months, any organisation whose interests may be affected by the publication of the response. Please also list any non-financial associations or interests (personal, professional, political, institutional, religious or other) that a reasonable reader would want to know about in relation to the submitted work. This pertains to all the authors of the piece, their spouses or partners.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- HHS Author Manuscripts
Quasi-Experimental Designs for Causal Inference
When randomized experiments are infeasible, quasi-experimental designs can be exploited to evaluate causal treatment effects. The strongest quasi-experimental designs for causal inference are regression discontinuity designs, instrumental variable designs, matching and propensity score designs, and comparative interrupted time series designs. This article introduces for each design the basic rationale, discusses the assumptions required for identifying a causal effect, outlines methods for estimating the effect, and highlights potential validity threats and strategies for dealing with them. Causal estimands and identification results are formalized with the potential outcomes notations of the Rubin causal model.
Causal inference plays a central role in many social and behavioral sciences, including psychology and education. But drawing valid causal conclusions is challenging because they are warranted only if the study design meets a set of strong and frequently untestable assumptions. Thus, studies aiming at causal inference should employ designs and design elements that are able to rule out most plausible threats to validity. Randomized controlled trials (RCTs) are considered as the gold standard for causal inference because they rely on the fewest and weakest assumptions. But under certain conditions quasi-experimental designs that lack random assignment can also be as credible as RCTs ( Shadish, Cook, & Campbell, 2002 ).
This article discusses four of the strongest quasi-experimental designs for identifying causal effects: regression discontinuity design, instrumental variable design, matching and propensity score designs, and the comparative interrupted time series design. For each design we outline the strategy and assumptions for identifying a causal effect, address estimation methods, and discuss practical issues and suggestions for strengthening the basic designs. To highlight the design differences, throughout the article we use a hypothetical example with the following causal research question: What is the effect of attending a summer science camp on students’ science achievement?
POTENTIAL OUTCOMES AND RANDOMIZED CONTROLLED TRIAL
Before we discuss the four quasi-experimental designs, we introduce the potential outcomes notation of the Rubin causal model (RCM) and show how it is used in the context of an RCT. The RCM ( Holland, 1986 ) formalizes causal inference in terms of potential outcomes, which allow us to precisely define causal quantities of interest and to explicate the assumptions required for identifying them. RCM considers a potential outcome for each possible treatment condition. For a dichotomous treatment variable (i.e., a treatment and control condition), each subject i has a potential treatment outcome Y i (1), which we would observe if subject i receives the treatment ( Z i = 1), and a potential control outcome Y i (0), which we would observe if subject i receives the control condition ( Z i = 0). The difference in the two potential outcomes, Y i (1)− Y i (0), represents the individual causal effect.
Suppose we want to evaluate the effect of attending a summer science camp on students’ science achievement score. Then each student has two potential outcomes: a potential control score for not attending the science camp, and the potential treatment score for attending the camp. However, the individual causal effects of attending the camp cannot be inferred from data, because the two potential outcomes are never observed simultaneously. Instead, researchers typically focus on average causal effects. The average treatment effect (ATE) for the entire study population is defined as the difference in the expected potential outcomes, ATE = E [ Y i (1)] − E [ Y i (0)]. Similarly, we can also define the ATE for the treated subjects (ATT), ATT = E [ Y i (1) | Z i = 1] − E [ Y (0) | Z i =1]. Although the expectations of the potential outcomes are not directly observable because not all potential outcomes are observed, we nonetheless can identify ATE or ATT under some reasonable assumptions. In an RCT, random assignment establishes independence between the potential outcomes and the treatment status, which allows us to infer ATE. Suppose that students are randomly assigned to the science camp and that all students comply with the assigned condition. Then random assignment guarantees that the camp attendance indicator Z is independent of the potential achievement scores Y i (0) and Y i (1).
The independence assumption allows us to rewrite ATE in terms of observable expectations (i.e., with observed outcomes instead of potential outcomes). First, due to the independence (randomization), the unconditional expectations of the potential outcomes can be expressed as conditional expectations, E [ Y i (1)] = E [ Y i (1) | Z i = 1] and E [ Y i (0)] = E [ Y i (0) | Z i = 0] Second, because the potential treatment outcomes are actually observed for the treated, we can replace the potential treatment outcome with the observed outcome such that E [ Y i (1) | Z i = 1] = E [ Y i | Z i = 1] and, analogously, E [ Y i (0) | Z i = 0] = E [ Y i | Z i = 0] Thus, the ATE is expressible in terms of observable quantities rather than potential outcomes, ATE = E [ Y i (1)] − E [ Y i (0)] = E [ Y i | Z i = 1] – E [ Y i | Z i = 0], and we that say ATE is identified.
This derivation also rests on the stable-unit-treatment-value assumption (SUTVA; Imbens & Rubin, 2015 ). SUTVA is required to properly define the potential outcomes, that is, (a) the potential outcomes of a subject depend neither on the assignment mode nor on other subjects’ treatment assignment, and (b) there is only one unique treatment and one unique control condition. Without further mentioning, we assume SUTVA for all quasi-experimental designs discussed in this article.
REGRESSION DISCONTINUITY DESIGN
Due to ethical or budgetary reasons, random assignment is often infeasible in practice. Nonetheless, researchers may sometimes still retain full control over treatment assignment as in a regression discontinuity (RD) design where, based on a continuous assignment variable and a cutoff score, subjects are deterministically assigned to treatment conditions.
Suppose that the science camp is a remedial program and only students whose grade point average (GPA) score is less than or equal to 2.0 are eligible to participate. Figure 1 shows a scatterplot of hypothetical data where the x-axis represents the assignment variable ( GPA ) and the y -axis the outcome ( Science Score ). All subjects with a GPA score below the cutoff attended the camp (circles), whereas all subjects scoring above the cutoff do not attend (squares). Because all low-achieving students are in the treatment group and all high-achieving students in the control group, their respective GPA distributions do not overlap, not even at the cutoff. This lack of overlap complicates the identification of a causal effect because students in the treatment and control group are not comparable at all (i.e., they have a completely different distribution of the GPA scores).
A hypothetical example of regression discontinuity design. Note . GPA = grade point average.
One strategy of dealing with the lack of overlap is to rely on the linearity assumption of regression models and to extrapolate into areas of nonoverlap. However, if the linear models do not correctly specify the functional form, the resulting ATE estimate is biased. A safer strategy is to evaluate the treatment effect only at the cutoff score where treatment and control cases almost overlap, and thus functional form assumptions and extrapolation are almost no longer needed. Consider the treatment and control students that score right at the cutoff or just above it. Students with a GPA score of 2.0 participate in the science camp and students with a GPA score of 2.1 are in the control condition (the status quo condition or a different camp). The two groups of students are essentially equivalent because the difference in their GPA scores is negligibly small (2.1 − 2.0 = .1) and likely due to random chance (measurement error) rather than a real difference in ability. Thus, in the very close neighborhood around the cutoff score, the RD design is equivalent to an RCT; therefore, the ATE at the cutoff (ATEC) is identified.
CAUSAL ESTIMAND AND IDENTIFICATION
ATEC is defined as the difference in the expected potential treatment and control outcomes for the subjects scoring exactly at the cutoff: ATEC = E [ Y i (1) | A i = a c ] − E [ Y i (0) | A i = a c ], where A denotes assignment variable and a c the cutoff score. Because we observe only treatment subjects and not control subjects right at the cutoff, we need two assumptions in order to identify ATEC ( Hahn, Todd, & van Klaauw, 2001 ): (a) the conditional expectations of the potential treatment and control outcomes are continuous at the cutoff ( continuity ), and (b) all subjects comply with treatment assignment ( full compliance ).
The continuity assumption can be expressed in terms of limits as lim a ↓ a C E [ Y i ( 1 ) | A i = a ] = E [ Y i ( 1 ) | A i = a ] = lim a ↑ a C E [ Y i ( 1 ) | A i = a ] and lim a ↓ a C E [ Y i ( 0 ) | A i = a ] = E [ Y i ( 0 ) | A i = a ] = lim a ↑ a C E [ Y i ( 0 ) | A i = a ] . Thus, we can rewrite ATEC as the difference in limits, A T E C = lim a ↑ a C E [ Y i ( 1 ) | A i = a c ] − lim a ↓ a C E [ Y i ( 0 ) | A i = a c ] , which solves the issue that no control subjects are observed directly at the cutoff. Then, by the full compliance assumption, the potential treatment and control outcomes can be replaced with the observed outcomes such that A T E C = lim a ↑ a C E [ Y i | A i = a c ] − lim a ↓ a C E [ Y i | A i = a c ] is identified at the cutoff (i.e., ATEC is now expressed in terms of observable quantities). The difference in the limits represents the discontinuity in the mean outcomes exactly at the cutoff ( Figure 1 ).
Estimating ATEC
ATEC can be estimated with parametric or nonparametric regression methods. First, consider the parametric regression of the outcome Y on the treatment Z , the cutoff-centered assignment variable A − a c , and their interaction: Y = β 0 + β 1 Z + β 2 ( A − a c ) + β 3 ( Z × ( A − a c )) + e . If the model correctly specifies the functional form, then β ^ 1 is an unbiased estimator for ATEC. In practice, an appropriate model specification frequently involves also quadratic and cubic terms of the assignment variable plus their interactions with the treatment indicator.
To avoid overly strong functional form assumptions, semiparametric or nonparametric regression methods like generalized additive models or local linear kernel regression can be employed ( Imbens & Lemieux, 2008 ). These methods down-weight or even discard observations that are not in the close neighborhood around the cutoff. The R packages rdd ( Dimmery, 2013 ) and rdrobust ( Calonico, Cattaneo, & Titiunik, 2015 ), or the command rd in STATA ( Nichols, 2007 ) are useful for estimation and diagnostic purposes.
Practical Issues
A major validity threat for RD designs is the manipulation of the assignment score around the cutoff, which directly results in a violation of the continuity assumption ( Wong et al., 2012 ). For instance, if a teacher knows the assignment score in advance and he wants all his students to attend the science camp, the teacher could falsely report a GPA score of 2.0 or below for the students whose actual GPA score exceeds the cutoff value.
Another validity threat is noncompliance, meaning that subjects assigned to the control condition may cross over to the treatment and subjects assigned to the treatment do not show up. An RD design with noncompliance is called a fuzzy RD design (instead of a sharp RD design with full compliance). A fuzzy RD design still allows us to identify the intention-to-treat effect or the local average treatment effect at the cutoff (LATEC). The intention-to-treat effect refers to the effect of treatment assignment rather than the actual treatment receipt. LATEC estimates ATEC for the subjects who comply with treatment assignment. LATEC is identified if one uses the assignment status as an instrumental variable for treatment receipt (see the upcoming Instrumental Variable section).
Finally, generalizability and statistical power are often mentioned as major disadvantages of RD designs. Because RD designs identify the treatment effect only at the cutoff, ATEC estimates are not automatically generalizable to subjects scoring further away from the cutoff. Statistical power for detecting a significant effect is an issue because the lack of overlap on the assignment variable results in increased standard errors. With semi- or nonparametric regression methods, power further diminishes.
Strengthening RD Designs
To avoid systematic manipulations of the assignment variable, it is desirable to conceal the assignment rule from study participants and administrators. If the assignment rule is known to them, manipulations can hardly be ruled out, particularly when the stakes are high. Researchers can use the McCrary test ( McCrary, 2008 ) to check for potential manipulations. The test investigates whether there is a discontinuity in the distribution of the assignment variable right at the cutoff. Plotting baseline covariates against the assignment variable, and regressing the covariates on the assignment variable and the treatment indicator also help in detecting potential discontinuities at the cutoff.
The RD design’s validity can be increased by combining the basic RD design with other designs. An example is the tie-breaking RD design, which uses two cutoff scores. Subjects scoring between the two cutoff scores are randomly assigned to treatment conditions, whereas subjects scoring outside the cutoff interval receive the treatment or control condition according to the RD assignment rule ( Black, Galdo & Smith, 2007 ). This design combines an RD design with an RCT and is advantageous with respect to the correct specification of the functional form, generalizability, and statistical power. Similar benefits can be obtained by adding pretest measures of the outcome or nonequivalent comparison groups ( Wing & Cook, 2013 ).
Imbens and Lemieux (2008) and Lee and Lemieux (2010) provided comprehensive introductions to RD designs. Lee and Lemieux also summarized many applications from economics. Angrist and Lavy (1999) applied the design to investigate the effect of class size on student achievement.
INSTRUMENTAL VARIABLE DESIGN
In practice, researchers often have no or only partial control over treatment selection. In addition, they might also lack reliable knowledge of the selection process. Nonetheless, even with limited control and knowledge of the selection process it is still possible to identify a causal treatment effect if an instrumental variable (IV) is available. An IV is an exogenous variable that is related to the treatment but is completely unrelated to the outcome, except via treatment. An IV design requires researchers either to create an IV at the design stage (as in an encouragement design; see next) or to find an IV in the data set at hand or a related data base.
Consider the science camp example, but instead of random or deterministic treatment assignment, students decide on their own or together with their parents whether to attend the camp. Many factors may determine the decision, for instance, students’ science ability and motivation, parents’ socioeconomic status, or the availability of public transportation for the daily commute to the camp. Whereas the first three variables are presumably also related to the science outcome, public transportation might be unrelated to the science score (except via camp attendance). Thus, the availability of public transportation may qualify as an IV. Figure 2 illustrates such IV design: Public transportation (IV) directly affects camp attendance but has no direct or indirect effect on science achievement (outcome) other than through camp attendance (treatment). The question mark represents unknown or unobserved confounders, that is, variables that simultaneously affect both camp attendance and science achievement. The IV design allows us to identify a causal effect even if some or all confounders are unknown or unobserved.
A diagram of an example of instrumental variable design.
The strategy for identifying a causal effect is based on exploiting the variation in the treatment variable explained by IV. In Figure 2 , the total variation in the treatment consists of (a) the variation induced by the IV and (b) the variation induced by confounders (question mark) and other exogenous variables (not shown in the figure). The identification of the camp’s effect requires us to isolate the treatment variation that is related to public transportation (IV), and then to use the isolated variation to investigate the camp’s effect on the science score. Because we exploit the treatment variation exclusively induced by the IV but ignore the variation induced by unobserved or unknown confounders, the IV design identifies the ATE for the sub-population of compliers only. In our example, the compliers are the students who attend the camp because public transportation is available and do not attend because it is unavailable. For students whose parents always use their own car to drop them off and pick them up at the camp location, we cannot infer the causal effect, because their camp attendance is completely unrelated to the availability of public transportation.
Causal Estimand and Identification
The complier average treatment effect (CATE) is defined as the expected difference in potential outcomes for the sub-population of compliers: CATE = E [ Y i (1) | Complier ] − E [ Y i (0) | Complier ] = τ C .
Identification requires us to distinguish between four latent groups: compliers (C), who attend the camp if public transportation is available but do not attend if unavailable; always-takers (A), who always attend the camp regardless of whether or not public transportation is available; never-takers (N), who never attend the camp regardless of public transportation; and defiers (D), who do not attend if public transportation is available but attend if unavailable. Because group membership is unknown, it is impossible to directly infer CATE from the data of compliers. However, CATE is identified from the entire data set if (a) the IV is predictive of the treatment ( predictive first stage ), (b) the IV is unrelated to the outcome except via treatment ( exclusion restriction ), and (c) no defiers are present ( monotonicity ; Angrist, Imbens, & Rubin, 1996 ; see Steiner, Kim, Hall, & Su, 2015 , for a graphical explanation).
First, notice that the IV’s effects on the treatment (γ) and the outcome (δ) are directly identified from the observed data because the IV’s relation with the treatment and outcome is unconfounded. In our example ( Figure 2 ), γ denotes the effect of public transportation on camp attendance and δ the indirect effect of public transportation on the science score. Both effects can be written as weighted averages of the corresponding group-specific effects ( γ C , γ A , γ N , γ D and δ C , δ A , δ N , δ D for compliers, always-takers, never-takers, and defiers, respectively): γ = p ( C ) γ C + p ( A ) γA + p ( N ) γ N + p ( D ) γ D and δ = p ( C ) δ C + p ( A ) δ A + p ( N ) δ N + p ( D ) δ D where p (.) represents the portion of the respective latent group in the population and p ( C ) + p ( A ) + p ( N ) + p ( D ) = 1. Because the treatment choice of always-takers and never-takers is entirely unaffected by the instrument, the IV’s effect on the treatment is zero, γ A = γ N = .0, and together with the exclusion restriction , we also know δ A = δ N = 0, that is, the IV has no effect on the outcome. If no defiers are present, p ( D ) = 0 ( monotonicity ), then the IV’s effects on the treatment and outcome simplify to γ = p ( C ) γC and δ = p ( C ) δC , respectively. Because δ C = γ C τ C and γ ≠ 0 ( predictive first stage ), the ratio of the observable IV effects, γ and δ, identifies CATE: δ γ = p ( C ) γ C τ C p ( C ) γ C = τ C .
Estimating CATE
A two-stage least squares (2SLS) regression is typically used for estimating CATE. In the first stage, treatment Z is regressed on the IV, Z = β 0 + β 1 IV + e . The linear first-stage model applies with a dichotomous treatment variable (linear probability model). The second stage then regresses the outcome Y on the predicted values Z ^ from the first stage model, Y = π 0 + π 1 Z ^ + r , where π ^ 1 is the CATE estimator. The two stages are automatically performed by the 2SLS procedure, which also provides an appropriate standard error for the effect estimate. The STATA commands ivregress and ivreg2 ( Baum, Schaffer, & Stillman, 2007 ) or the sem package in R ( Fox, 2006 ) perform the 2SLS regression.
One challenge in implementing an IV design is to find a valid instrument that satisfies the assumptions just discussed. In particular, the exclusion restriction is untestable and frequently hard to defend in practice. In our example, if high-income families live in suburban areas with bad public transportation connections, then the availability of the public transportation is likely related to the science score via household income (or socioeconomic status). Although conditioning on the observed household income can transform public transportation into a conditional IV (see next), one can frequently come up with additional scenarios that explains why the IV is related to the outcome and thus violates the exclusion restriction.
Another issue arises from “weak” IVs that are only weakly related to treatment. Weak IVs cause efficiency problems ( Wooldridge, 2012 ). If the availability of public transportation barely affects camp attendance because most parents give their children a ride anyway, the IV’s effect on the treatment ( γ ) is close to zero. Because γ ^ is the denominator in the CATE estimator, τ ^ C = δ ^ / γ ^ , an imprecisely estimated γ ^ results in a considerable over- or underestimation of CATE. Moreover, standard errors will be large.
One also needs to keep in mind that the substantive meaning of CATE depends on the chosen IV. Consider two slightly different IVs with respect to public transportation: the availability of (a) a bus service and (b) subway service. For the first IV, the complier population consists of students who choose to (not) attend the camp depending on the availability of a bus service. For the second IV, the complier population refers to the availability of a subway service. Because the two complier populations are very likely different from each other (students who are willing to take the subway might not be willing to take the bus), the corresponding CATEs refer to different subpopulations.
Strengthening IV Designs
Given the challenges in identifying a valid instrument from observed data, researchers should consider creating an IV at the design stage of a study. Although it might be impossible to directly assign subjects to treatment conditions, one might still be able to encourage participants to take the treatment. Subjects are randomly encouraged to sign up for treatment, but whether they actually comply with the encouragement is entirely their own decision ( Imai et al., 2011 ). Random encouragement qualifies as an IV because it very likely meets the exclusion restriction. For example, instead of collecting data on public transportation, researchers may advertise and recommend the science camp in a letter to the parents of a randomly selected sample of students.
With observational data it is hard to identify a valid IV because covariates that strongly predict the treatment are usually also related to the outcome. However, these covariates can still qualify as an IV if they affect the outcome only indirectly via other observed variables. Such covariates can be used as conditional IVs, that is, they meet the IV requirements conditional on the observed variables ( Brito & Pearl, 2002 ). Assume the availability of public transportation (IV) is associated with the science score via household income. Then, controlling for the reliably measured household income in both stages of the 2SLS analysis blocks the IV’s relation to the science score and turns public transportation into a conditional IV. However, controlling for a large set of variables does not guarantee that the exclusion restriction is more likely met. It may even result in more bias as compared to an IV analysis with fewer covariates ( Ding & Miratrix, 2015 ; Steiner & Kim, in press ). The choice of a valid conditional IV requires researchers to carefully select the control variables based on subject-matter theory.
The seminal article by Angrist et al. (1996) provides a thorough discussion of the IV design, and Steiner, Kim, et al. (2015 ) proved the identification result using graphical models. Excellent introductions to IV designs can be found in Angrist and Pischke (2009 , 2015) . Angrist and Krueger (1992) is an example of a creative application of the design with birthday as the IV. For encouragement designs, see Holland (1988) and Imai et al. (2011) .
MATCHING AND PROPENSITY SCORE DESIGN
This section considers quasi-experimental designs in which researchers lack control over treatment selection but have good knowledge about the selection mechanism or at least the confounders that simultaneously determine the treatment selection and the outcome. Due to self or third-person selection of subjects into treatment, the resulting treatment and control groups typically differ in observed but also unobserved baseline covariates. If we have reliable measures of all confounding covariates, then matching or propensity score (PS) designs balance groups on observed baseline covariates and thus enable the identification of causal effects ( Imbens & Rubin, 2015 ). Regression analysis and the analysis of covariance can also remove the confounding bias, but because they rely on functional form assumptions and extrapolation we discuss only nonparametric matching and PS designs.
Suppose that students decide on their own whether to attend the science camp. Although many factors can affect students’ decision, teachers with several years of experience of running the camp may know that selection is mostly driven by students’ science ability, liking of science, and their parents’ socioeconomic status. If all the selection-relevant factors that also affect the outcome are known, the question mark in Figure 2 can be replaced by the known confounding covariates.
Given the set of confounding covariates, causal inference with matching or PS designs is straightforward, at least theoretically. The basic one-to-one matching design matches each treatment subject to a control subject that is equivalent or at least very similar in observed covariates. To illustrate the idea of matching, consider a camp attendee with baseline measures of 80 on the science pre-test, 6 on liking science, and 50 on the socioeconomic status. Then a multivariate matching strategy tries to find a nonattendee with exactly the same or at least very similar baseline measures. If we succeed in finding close matches for all camp attendee, the matched samples of attendees and nonattendees will have almost identical covariate distributions.
Although multivariate matching works well when the number of confounders is small and the pool of control subjects is large relative to the number of treatment subjects, it is usually difficult to find close matches with a large set of covariates or a small pool of control subjects. Matching on the PS helps to overcome this issue because the PS is a univariate score computed from the observed covariates ( Rosenbaum & Rubin, 1983 ). The PS is formally defined as the conditional probability of receiving the treatment given the set of observed covariates X : PS = Pr( Z = 1 | X ).
Matching and PS designs usually investigate ATE = E [ Y i (1)] − E [ Y i (0)] or ATT = E [ Y i (1) | Z i = 1] – E [ Y i (0) | Z i = 1]. Both causal effects are identified if (a) the potential outcomes are statistically independent of the treatment indicator given the set of observed confounders X , { Y (1), Y (0)}⊥ Z | X ( unconfoundedness ; ⊥ denotes independence), and (b) the treatment probability is strictly between zero and one, 0 < Pr( Z = 1 | X ) < 1 ( positivity ).
By the positivity assumption we get E [ Y i (1)] = E X [ E [ Y i (1) | X ]] and E [ Y i (0)] = E X [ E [ Y i (0) | X ]]. If the unconfoundedness assumption holds, we can write the inner expectations as E [ Y i (1) | X ] = E [ Y i (1) | Z i =1; X ] and E [ Y i (0) | X ] = E [ Y i (0) | Z i = 0; X ]. Finally, because the treatment (control) outcomes of the treatment (control) subjects are actually observed, ATE is identified because it can be expressed in terms of observable quantities: ATE = E X [ E [ Y i | Z i = 1; X ]] – E X [ E [ Y i | Z i = 0; X ]]. The same can be shown for ATT. The unconfoundedness and positivity assumption are frequently referred to jointly as the strong ignorability assumption. Rosenbaum and Rubin (1983) proved that if the assignment is strongly ignorable given X , then it is also strongly ignorable given the PS alone.
Estimating ATE and ATT
Matching designs use a distance measure for matching each treatment subject to the closest control subject. The Mahalanobis distance is usually used for multivariate matching and the Euclidean distance on the logit of the PS for PS matching. Matching strategies differ with respect to the matching ratio (one-to-one or one-to-many), replacement of matched subjects (with or without replacement), use of a caliper (treatment subjects that do not have a control subject within a certain threshold remain unmatched), and the matching algorithm (greedy, genetic, or optimal matching; Sekhon, 2011 ; Steiner & Cook, 2013 ). Because we try to find at least one control subject for each treatment subject, matching estimators typically estimate ATT. Once treatment and control subjects are matched, ATT is computed as the difference in the mean outcome of the treatment and control group. An alternative matching strategy that allows for estimating ATE is full matching, which stratifies all subjects into the maximum number of strata, where each stratum contains at least one treatment and one control subject ( Hansen, 2004 ).
The PS can also be used for PS stratification and inverse-propensity weighting. PS stratification stratifies the treatment and control subjects into at least five strata and estimates the treatment effect within each stratum. ATE or ATT is then obtained as the weighted average of the stratum-specific treatment effects. Inverse-propensity weighting follows the same logic as inverse-probability weighting in survey research ( Horvitz & Thompson, 1952 ) and requires the computation of weights that refer to either the overall population (ATE) or the population of treated subjects only (ATT). Given the inverse-propensity weights, ATE or ATT is usually estimated via weighted least squares regression.
Because the true PSs are unknown, they need to be estimated from the observed data. The most common method for estimating the PS is logistic regression, which regresses the binary treatment indicator Z on predictors of the observed covariates. The PS model is specified according to balance criteria (instead of goodness of fit criteria), that is, the estimated PSs should remove all baseline differences in observed covariates ( Imbens & Rubin, 2015 ). The predicted probabilities from the PS model represent the estimated PSs.
All three PS designs—matching, stratification, and weighting—can benefit from additional covariance adjustments in an outcome regression. That is, for the matched, stratified or weighted data, the outcome is regressed on the treatment indicator and the additional covariates. Combining the PS design with a covariance adjustment gives researchers two chances to remove the confounding bias, by correctly specifying either the PS model or the outcome model. These combined methods are said to be doubly robust because they are robust against either the misspecification of the PS model or the misspecification of the outcome model ( Robins & Rotnitzky, 1995 ). The R packages optmatch ( Hansen & Klopfer, 2006 ) and MatchIt ( Ho et al., 2011 ) and the STATA command teffects , in particular teffects psmatch ( StataCorp, 2015 ), can be useful for matching or PS analyses.
The most challenging issue with matching and PS designs is the selection of covariates for establishing unconfoundedness. Ideally, subject-matter theory about the selection process and the outcome-generating model is used for selecting a set of covariates that removes all the confounding ( Pearl, 2009 ). If strong subject-matter theories are not available, selecting the right covariates is difficult. In the hope to remove a major part of the confounding bias—if not all of it—a frequently applied strategy is to match on as many covariates as possible. However, recent literature shows that thoughtless inclusion of covariates may increase rather than reduce the confounding bias ( Pearl, 2010 ; Steiner & Kim, in press). The risk of increasing bias can be reduced if the observed covariates cover a broad range of heterogeneous construct domains, including at least one reliable pretest measure of the outcome ( Steiner, Cook, et al., 2015 ). Besides having the right covariates, they also need to be reliably measured. The unreliable measurement of confounding covariates has a similar effect as the omission of a confounder: It results in a violation of the unconfoundedness assumption and thus in a biased effect estimate ( Steiner, Cook, & Shadish, 2011 ; Steiner & Kim, in press ).
Even if the set of reliably measured covariates establishes unconfoundedness, we still need to correctly specify the functional form of the PS model. Although parametric models like logistic regression, including higher order terms, might frequently approximate the correct functional form, they still rely on the linearity assumption. The linearity assumption can be relaxed if one estimates the PS with statistical learning algorithms like classification trees, neural networks, or the LASSO ( Keller, Kim, & Steiner, 2015 ; McCaffrey, Ridgeway, & Morral, 2004 ).
Strengthening Matching and PS Designs
The credibility of matching and PS designs heavily relies on the unconfoundedness assumption. Although empirically untestable, there are indirect ways for assessing unconfoundedness. First, unaffected (nonequivalent) outcomes that are known to be unaffected by the treatment can be used ( Shadish et al., 2002 ). For instance, we may expect that attendance in the science camp does not significantly affect the reading score. Thus, if we observe a significant group difference in the reading score after the PS adjustment, bias due to unobserved confounders (e.g., general intelligence) is still likely. Second, adding a second but conceptually different control group allows for a similar test as with the unaffected outcome ( Rosenbaum, 2002 ).
Because researchers rarely know whether the unconfoundedness assumption is actually met with the data at hand, it is important to assess the effect estimate’s sensitivity to potentially unobserved confounders. Sensitivity analyses investigate how strongly an estimate’s magnitude and significance changes if a confounder of a certain strength would have been omitted from the analyses. Causal conclusions are much more credible if the effect’s direction, magnitude, and significance is rather insensitive to omitted confounders ( Rosenbaum, 2002 ). However, despite the value of sensitivity analyses, they are not informative about whether hidden bias is actually present.
Schafer and Kang (2008) and Steiner and Cook (2013) provided a comprehensive introduction. Rigorous formalization and technical details of PS designs can be found in Imbens and Rubin (2015) . Rosenbaum (2002) discussed many important design issues in these designs.
COMPARATIVE INTERRUPTED TIME SERIES DESIGN
The designs discussed so far require researchers to have either full control over treatment assignment or reliable knowledge of the exogenous (IV) or endogenous part of the selection mechanism (i.e., the confounders). If none of these requirements are met, a comparative interrupted time series (CITS) design might be a viable alternative if (a) multiple measurements of the outcome ( time series ) are available for both the treatment and a comparison group and (b) the treatment group’s time series has been interrupted by an intervention.
Suppose that all students of one class in a school (say, an advanced science class) attend the camp, whereas all students of another class in the same school do not attend. Also assume that monthly measures of science achievement before and after the science camp are available. Figure 3 illustrates such a scenario where the x -axis represents time in Months and the y -axis the Science Score (aggregated at the class level). The filled symbols indicate the treatment group (science camp), open symbols the comparison group (no science camp). The science camp intervention divides both time series into a preintervention time series (circles) and a postintervention time series (squares). The changes in the levels and slopes of the pre- and postintervention regression lines represent the camp’s impact but possibly also the effect of other events that co-occur with the intervention. The dashed lines extrapolate the preintervention growth curves into the postintervention period, and thus represent the counterfactual situation where the intervention but also other co-occurring events are absent.
A hypothetical example of comparative interrupted time series design.
The strength of a CITS design is its ability to discriminate between the intervention’s effect and the effects of co-occurring events. Such events might be other potentially competing interventions (history effects) or changes in the measurement of the outcome (instrumentation), for instance. If the co-occurring events affect the treatment and comparison group to the same extent, then subtracting the changes in the comparison group’s growth curve from the changes in the treatment group’s growth curve provides a valid estimate of the intervention’s impact. Because we investigate the difference in the changes (= differences) of the two growth curves, the CITS design is a special case of the difference-in-differences design ( Somers et al., 2013 ).
Assume that a daily TV series about Albert Einstein was broadcast in the evenings of the science camp week and that students of both classes were exposed to the same extent to the TV series. It follows that the comparison group’s change in the growth curve represents the TV series’ impact. The comparison group’s time series in Figure 3 indicates that the TV series might have had an immediate impact on the growth curve’s level but almost no effect on the slope. On the other hand, the treatment group’s change in the growth curve is due to both the science camp and the TV series. Thus, in differencing out the TV series’ effect (estimated from the comparison group) we can identify the camp effect.
Let t c denote the time point of the intervention, then the intervention’s effect on the treated (ATT) at a postintervention time point t ≥ t c is defined as τ t = E [ Y i t T ( 1 ) ] − E [ Y i t T ( 0 ) ] , where Y i t T ( 0 ) and Y i t T ( 1 ) are the potential control and treatment outcomes of subject i in the treatment group ( T ) at time point t . The time series of the expected potential outcomes can be formalized as sum of nonparametric but additive time-dependent functions. The treatment group’s expected potential control outcome can be represented as E [ Y i t T ( 0 ) ] = f 0 T ( t ) + f E T ( t ) , where the control function f 0 T ( t ) generates the expected potential control outcomes in absence of any interventions ( I ) or co-occurring events ( E ), and the event function f E T ( t ) adds the effects of co-occurring events. Similarly, the expected potential treatment outcome can be written as E [ Y i t T ( 1 ) ] = f 0 T ( t ) + f E T ( t ) + f I T ( t ) , which adds the intervention’s effect τ t = f I T ( t ) to the control and event function. In the absence of a comparison group, we can try to identify the impact of the intervention by comparing the observable postintervention outcomes to the extrapolated outcomes from the preintervention time series (dashed line in Figure 3 ). Extrapolation is necessary because we do not observe any potential control outcomes in the postintervention period (only potential treatment outcomes are observed). Let f ^ 0 T ( t ) denote the parametric extrapolation of the preintervention control function f 0 T ( t ) , then the observable pre–post-intervention difference ( PP T ) in the expected control outcome is P P t T = f 0 T ( t ) + f E T ( t ) + f I T ( t ) − f ^ 0 T ( t ) = f I T ( t ) + ( f 0 T ( t ) − f ^ 0 T ( t ) ) + f E T ( t ) . Thus, in the absence of a comparison group, ATT is identified (i.e., P P t T = f I T ( t ) = τ t ) only if the control function is correctly specified ( f 0 T ( t ) = f ^ 0 T ( t ) ) and if no co-occurring events are present ( f E T ( t ) = 0 ).
The comparison group in a CITS design allows us to relax both of these identifying assumptions. In order to see this, we first define the expected control outcomes of the comparison group ( C ) as a sum of two time-dependent functions as before: E [ Y i t C ( 0 ) ] = f 0 C ( t ) + f E C ( t ) . Then, in extrapolating the comparison group’s preintervention function into the postintervention period, f ^ 0 C ( t ) , we can compute the pre–post-intervention difference for the comparison group: P P t C = f 0 C ( t ) + f E C ( t ) − f ^ 0 C ( t ) = f E C ( t ) + ( f 0 C ( t ) − f ^ 0 C ( t ) ) If the control function is correctly specified f 0 C ( t ) = f ^ 0 C ( t ) , the effect of co-occurring events is identified P P t C = f E C ( t ) . However, we do not necessarily need a correctly specified control function, because in a CITS design we focus on the difference in the treatment and comparison group’s pre–post-intervention differences, that is, P P t T − P P t C = f I T ( t ) + { ( f 0 T ( t ) − f ^ 0 T ( t ) ) − ( f 0 C ( t ) − f ^ 0 C ( t ) ) } + { f E T ( t ) − f E C ( t ) } . Thus, ATT is identified, P P t T − P P t C = f I T ( t ) = τ t , if (a) both control functions are either correctly specified or misspecified to the same additive extent such that ( f 0 T ( t ) − f ^ 0 T ( t ) ) = ( f 0 C ( t ) − f ^ 0 C ( t ) ) ( no differential misspecification ) and (b) the effect of co-occurring events is identical in the treatment and comparison group, f E T ( t ) = f E C ( t ) ( no differential event effects ).
Estimating ATT
CITS designs are typically analyzed with linear regression models that regress the outcome Y on the centered time variable ( T – t c ), the intervention indicator Z ( Z = 0 if t < t c , otherwise Z = 1), the group indicator G ( G = 1 for the treatment group and G = 0 for the control group), and the corresponding two-way and three-way interactions:
Depending on the number of subjects in each group, fixed or random effects for the subjects are included as well (time fixed or random effect can also be considered). β ^ 5 estimates the intervention’s immediate effect at the onset of the intervention (change in intercept) and β ^ 7 the intervention’s effect on the growth rate (change in slope). The inclusion of dummy variables for each postintervention time point (plus their interaction with the intervention and group indicators) would allow for a direct estimation of the time-specific effects. If the time series are long enough (at least 100 time points), then a more careful modeling of the autocorrelation structure via time series models should be considered.
Compared to other designs, CITS designs heavily rely on extrapolation and thus on functional form assumptions. Therefore, it is crucial that the functional forms of the pre- and postintervention time series (including their extrapolations) are correctly specified or at least not differentially misspecified. With short time series or measurement points that inadequately capture periodical variations, the correct specification of the functional form is very challenging. Another specification aspect concerns serial dependencies among the data points. Failing to model serial dependencies can bias effect estimates and their standard errors such that significance tests might be misleading. Accounting for serial dependencies requires autoregressive models (e.g., ARIMA models), but the time series should have at least 100 time points ( West, Biesanz, & Pitts, 2000 ). Standard fixed effects or random effects models deal at least partially with the dependence structure. Robust standard errors (e.g., Huber-White corrected ones) or the bootstrap can also be used to account for dependency structures.
Events that co-occur with the intervention of interest, like history or instrumentation effects, are a major threat to the time series designs that lack a comparison group ( Shadish et al., 2002 ). CITS designs are rather robust to co-occurring events as long as the treatment and comparison groups are affected to the same additive extent. However, there is no guarantee that both groups are exposed to the same events and affected to the same extent. For example, if students who do not attend the camp are less likely to watch the TV series, its effect cannot be completely differenced out (unless the exposure to the TV series is measured). If one uses aggregated data like class or school averages of achievement scores, then differential compositional shifts over time can also invalidate the CITS design. Compositional shifts occur due to dropouts or incoming subjects over time.
Strengthening CITS Designs
If the treatment and comparison group’s preintervention time series are very different (different levels and slopes), then the assumption that history or instrumentation threats affect both groups to the same additive extent may not hold. Matching treatment and comparison subjects prior to the analysis can increase the plausibility of this assumption. Instead of using all nonparticipating students of the comparison class, we may select only those students who have a similar level and growth in the preintervention science scores as the students participating in the camp. We can also match on additional covariates like socioeconomic status or motivation levels. Multivariate or PS matching can be used for this purpose. If the two groups are similar, it is more likely that they are affected by co-occurring events to the same extent.
As with the matching and PS designs, using an unaffected outcome in CITS designs helps to probe the untestable assumptions ( Coryn & Hobson, 2011 ; Shadish et al., 2002 ). For instance, we might expect that attending the science camp does not affect students’ reading scores but that some validity threats (e.g., attrition) operate on both the reading and science outcome. If we find a significant camp effect on the reading score, the validity of the CITS design for evaluating the camp’s impact on the science score is in doubt.
Another strategy to avoid validity threats is to control the time point of the intervention if possible. Researchers can wait with the implementation of the treatment until they have enough preintervention measures for reliably estimating the functional form. They can also choose to intervene when threats to validity are less likely (avoiding the week of the TV series). Control over the intervention also allows researchers to introduce and remove the treatment in subsequent time intervals, maybe even with switching replications between two (or more) groups. If the treatment is effective, we expect that the pattern of the intervention scheme is directly reflected in the time series of the outcome (for more details, see Shadish et al., 2002 ; for the literature on single case designs, see Kazdin, 2011 ).
A comprehensive introduction to CITS design can be found in Shadish et al. (2002) , which also addresses many classical applications. For more technical details of its identification, refer to Lechner (2011) . Wong, Cook, and Steiner (2009) evaluated the effect of No Child Left Behind using a CITS design.
CONCLUDING REMARKS
This article discussed four of the strongest quasi-experimental designs for causal inference when randomized experiments are not feasible. For each design we highlighted the identification strategies and the required assumptions. In practice, it is crucial that the design assumptions are met, otherwise biased effect estimates result. Because most important assumptions like the exclusion restriction or the unconfoundedness assumption are not directly testable, researchers should always try to assess their plausibility via indirect tests and investigate the effect estimates’ sensitivity to violations of these assumptions.
Our discussion of RD, IV, PS, and CITS designs made it also very clear that, in comparison to RCTs, quasi-experimental designs rely on more or stronger assumptions. With prefect control over treatment assignment and treatment implementation (as in an RCT), causal inference is warranted by a minimal set of assumptions. But with limited control over and knowledge about treatment assignment and implementation, stronger assumptions are required and causal effects might be identifiable only for local subpopulations. Nonetheless, observational data sometimes meet the assumptions of a quasi-experimental design, at least approximately, such that causal conclusions are credible. If so, the estimates of quasi-experimental designs—which exploit naturally occurring selection processes and real-world implementations of the treatment—are frequently better generalizable than the results from a controlled laboratory experiment. Thus, if external validity is a major concern, the results of randomized experiments should always be complemented by findings from valid quasi-experiments.
- Angrist JD, Imbens GW, & Rubin DB (1996). Identification of causal effects using instrumental variables . Journal of the American Statistical Association , 91 , 444–455. [ Google Scholar ]
- Angrist JD, & Krueger AB (1992). The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples . Journal of the American Statistical Association , 87 , 328–336. [ Google Scholar ]
- Angrist JD, & Lavy V (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievment . Quarterly Journal of Economics , 114 , 533–575. [ Google Scholar ]
- Angrist JD, & Pischke JS (2009). Mostly harmless econometrics: An empiricist’s companion . Princeton, NJ: Princeton University Press. [ Google Scholar ]
- Angrist JD, & Pischke JS (2015). Mastering’metrics: The path from cause to effect . Princeton, NJ: Princeton University Press. [ Google Scholar ]
- Baum CF, Schaffer ME, & Stillman S (2007). Enhanced routines for instrumental variables/generalized method of moments estimation and testing . The Stata Journal , 7 , 465–506. [ Google Scholar ]
- Black D, Galdo J, & Smith JA (2007). Evaluating the bias of the regression discontinuity design using experimental data (Working paper) . Chicago, IL: University of Chicago. [ Google Scholar ]
- Brito C, & Pearl J (2002). Generalized instrumental variables In Darwiche A & Friedman N (Eds.), Uncertainty in artificial intelligence (pp. 85–93). San Francisco, CA: Morgan Kaufmann. [ Google Scholar ]
- Calonico S, Cattaneo MD, & Titiunik R (2015). rdrobust: Robust data-driven statistical inference in regression-discontinuity designs (R package ver. 0.80) . Retrieved from http://CRAN.R-project.org/package=rdrobust
- Coryn CLS, & Hobson KA (2011). Using nonequivalent dependent variables to reduce internal validity threats in quasi-experiments: Rationale, history, and examples from practice . New Directions for Evaluation , 131 , 31–39. [ Google Scholar ]
- Dimmery D (2013). rdd: Regression discontinuity estimation (R package ver. 0.56) . Retrieved from http://CRAN.R-project.org/package=rdd
- Ding P, & Miratrix LW (2015). To adjust or not to adjust? Sensitivity analysis of M-bias and butterfly-bias . Journal of Causal Inference , 3 ( 1 ), 41–57. [ Google Scholar ]
- Fox J (2006). Structural equation modeling with the sem package in R . Structural Equation Modeling , 13 , 465–486. [ Google Scholar ]
- Hahn J, Todd P, & Van der Klaauw W (2001). Identification and estimation of treatment effects with a regression–discontinuity design . Econometrica , 69 ( 1 ), 201–209. [ Google Scholar ]
- Hansen BB (2004). Full matching in an observational study of coaching for the SAT . Journal of the American Statistical Association , 99 , 609–618. [ Google Scholar ]
- Hansen BB, & Klopfer SO (2006). Optimal full matching and related designs via network flows . Journal of Computational and Graphical Statistics , 15 , 609–627. [ Google Scholar ]
- Ho D, Imai K, King G, & Stuart EA (2011). MatchIt: Nonparametric preprocessing for parametric causal inference . Journal of Statistical Software , 42 ( 8 ), 1–28. Retrieved from http://www.jstatsoft.org/v42/i08/ [ Google Scholar ]
- Holland PW (1986). Statistics and causal inference . Journal of the American Statistical Association , 81 , 945–960. [ Google Scholar ]
- Holland PW (1988). Causal inference, path analysis and recursive structural equations models . ETS Research Report Series . doi: 10.1002/j.2330-8516.1988.tb00270.x [ CrossRef ] [ Google Scholar ]
- Horvitz DG, & Thompson DJ (1952). A generalization of sampling without replacement from a finite universe . Journal of the American Statistical Association , 47 , 663–685. [ Google Scholar ]
- Imai K, Keele L, Tingley D, & Yamamoto T (2011). Unpacking the black box of causality: Learning about causal mechanisms from experimental and observational studies . American Political Science Review , 105 , 765–789. [ Google Scholar ]
- Imbens GW, & Lemieux T (2008). Regression discontinuity designs: A guide to practice . Journal of Econometrics , 142 , 615–635. [ Google Scholar ]
- Imbens GW, & Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences . New York, NY: Cambridge University Press. [ Google Scholar ]
- Kazdin AE (2011). Single-case research designs: Methods for clinical and applied settings . New York, NY: Oxford University Press. [ Google Scholar ]
- Keller B, Kim JS, & Steiner PM (2015). Neural networks for propensity score estimation: Simulation results and recommendations In van der Ark LA, Bolt DM, Chow S-M, Douglas JA, & Wang W-C (Eds.), Quantitative psychology research (pp. 279–291). New York, NY: Springer. [ Google Scholar ]
- Lechner M (2011). The estimation of causal effects by difference-in-difference methods . Foundations and Trends in Econometrics , 4 , 165–224. [ Google Scholar ]
- Lee DS, & Lemieux T (2010). Regression discontinuity designs in economics . Journal of Economic Literature , 48 , 281–355. [ Google Scholar ]
- McCaffrey DF, Ridgeway G, & Morral AR (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies . Psychological Methods , 9 , 403–425. [ PubMed ] [ Google Scholar ]
- McCrary J (2008). Manipulation of the running variable in the regression discontinuity design: A density test . Journal of Econometrics , 142 , 698–714. [ Google Scholar ]
- Nichols A (2007). rd: Stata modules for regression discontinuity estimation . Retrieved from http://ideas.repec.org/c/boc/bocode/s456888.html
- Pearl J (2009). C ausality: Models, reasoning, and inference (2nd ed.). New York, NY: Cambridge University Press. [ Google Scholar ]
- Pearl J (2010). On a class of bias-amplifying variables that endanger effect estimates In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (pp. 425–432). Corvallis, OR: Association for Uncertainty in Artificial Intelligence. [ Google Scholar ]
- Robins JM, & Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data . Journal of the American Statistical Association , 90 ( 429 ), 122–129. [ Google Scholar ]
- Rosenbaum PR (2002). Observational studies . New York, NY: Springer. [ Google Scholar ]
- Rosenbaum PR, & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects . Biometrika , 70 ( 1 ), 41–55. [ Google Scholar ]
- Schafer JL, & Kang J (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example . Psychological Methods , 13 , 279–313. [ PubMed ] [ Google Scholar ]
- Sekhon JS (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R . Journal of Statistical Software , 42 ( 7 ), 1–52. [ Google Scholar ]
- Shadish WR, Cook TD, & Campbell DT (2002). Experimental and quasi-experimental designs for generalized causal inference . Boston, MA: Houghton-Mifflin. [ Google Scholar ]
- Somers M, Zhu P, Jacob R, & Bloom H (2013). The validity and precision of the comparative interrupted time series design and the difference-in-difference design in educational evaluation (MDRC working paper in research methodology) . New York, NY: MDRC. [ Google Scholar ]
- StataCorp. (2015). Stata treatment-effects reference manual: Potential outcomes/counterfactual outcomes . College Station, TX: Stata Press; Retrieved from http://www.stata.com/manuals14/te.pdf [ Google Scholar ]
- Steiner PM, & Cook D (2013). Matching and propensity scores In Little T (Ed.), The Oxford handbook of quantitative methods in psychology (Vol. 1 , pp. 237–259). New York, NY: Oxford University Press. [ Google Scholar ]
- Steiner PM, Cook TD, Li W, & Clark MH (2015). Bias reduction in quasi-experiments with little selection theory but many covariates . Journal of Research on Educational Effectiveness , 8 , 552–576. [ Google Scholar ]
- Steiner PM, Cook TD, & Shadish WR (2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores . Journal of Educational and Behavioral Statistics , 36 , 213–236. [ Google Scholar ]
- Steiner PM, & Kim Y (in press). The mechanics of omitted variable bias: Bias amplification and cancellation of offsetting biases . Journal of Causal Inference . [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Steiner PM, Kim Y, Hall CE, & Su D (2015). Graphical models for quasi-experimental designs . Sociological Methods & Research. Advance online publication . doi: 10.1177/0049124115582272 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- West SG, Biesanz JC, & Pitts SC (2000). Causal inference and generalization in field settings: Experimental and quasi-experimental designs In Reis HT & Judd CM (Eds.), Handbook of research methods in social and personality psychology (pp. 40–84). New York, NY: Cambridge University Press. [ Google Scholar ]
- Wing C, & Cook TD (2013). Strengthening the regression discontinuity design using additional design elements: A within-study comparison . Journal of Policy Analysis and Management , 32 , 853–877. [ Google Scholar ]
- Wong M, Cook TD, & Steiner PM (2009). No Child Left Behind: An interim evaluation of its effects on learning using two interrupted time series each with its own non-equivalent comparison series (Working Paper No. WP-09–11) . Evanston, IL: Institute for Policy Research, Northwestern University. [ Google Scholar ]
- Wong VC, Wing C, Steiner PM, Wong M, & Cook TD (2012). Research designs for program evaluation . Handbook of Psychology , 2 , 316–341. [ Google Scholar ]
- Wooldridge J (2012). Introductory econometrics: A modern approach (5th ed.). Mason, OH: South-Western Cengage Learning. [ Google Scholar ]
IMAGES
VIDEO
COMMENTS
In 1994, Angrist and Imbens developed a mathematical formalization for extracting reliable information about causation from natural experiments, even if their 'design' is limited and ...
results of a natural experiment. In an innovative study from 1994, Joshua Angrist and Guido Imbens showed what conclusions about causation can be drawn from natural experiments in which people cannot be forced to participate in the programme being studied (nor forbidden from doing so). The
Natural experiments (NEs) have a long history in public health research, stretching back to John Snow's classic study of London's cholera epidemics in the mid-nineteenth century. ... There is usually no clearly defined intervention and there may be the potential for reverse causation (i.e., the health outcome may be a cause of the exposure ...
Natural experiment, observational study in which an event or a situation that allows for the random or seemingly random assignment of study subjects to different groups is exploited to answer a particular question. Natural experiments are often used to study situations in which controlled ... The major limitation in inferring causation from ...
Natural experiments offer unique opportunities to combine features of randomized experiments and observational studies. A natural experiment is a "naturally" occurring event or condition (i.e., an event or condition not created by researchers) that affects some but not all units of a population (e.g., Dunning, 2012; Sieweke & Santoni, 2020 ...
1.1. Why are natural experiments important in research? As an applied scientist working in public health, I continually hear about how public health decision-makers are increasingly pushed to make evidence-based decisions around interventions despite there being a large gap between the type of research that is available and the type of research they need to make real-world decisions.
The natural experiment agenda required researchers to understand the process determining which units receive which treatments. The new approach thus required an understanding of the source of identifying information, i.e., it required institutional knowledge about the natural experiment.
Causality and natural experiments: the 2021 Nobel Prize in Economic Sciences 26 November 2021. ... Natural experiments arise when nature, or policies, result in situations where a treatment is automatically assigned to participants in a manner that is 'almost as good as random'. The term 'treatment' is thereby not limited to medical ...
Correlation does not imply causation; but often, observational data are the only option, even though the research question at hand involves causality. ... Which method—randomized experiment, natural experiment, or observational study—is suited for drawing a causal inference regarding a specific research question must be decided on a case-by ...
This module describes two kinds of "natural experiments," which are situations where we can analyze observational data as if it was generated from a randomiz...
A natural experiment is a study in which individuals (or clusters of individuals) ... Second, labor market outcomes themselves may affect family size (called "reverse causality"). For example, a woman may defer having a child if she gets a raise at work. The authors observed that two-child families with either two boys or two girls are ...
The counterfactual theory of causation asserts that the causal effect of a driver is the difference between the actual outcome and the hypothetical outcome of the counterfactual scenario in which the driver takes on a different value or is excluded entirely. ... Natural experiments require that the observational data meet specific criteria (i.e ...
The term 'natural experiment' has traditionally referred to the occurrence of an event with a natural cause; a 'force of nature'(Fig. 1a) . These make for some of the most compelling studies of causation from non-randomised experiments.
A framework of observational causality that has largely been developed in the field of economics, called quasi-experiments, leverages randomness occurring naturally in observed data to estimate ...
Natural experiments can sometimes offer unique opportunities for dealing with this ... causality, nonexperimental, regression discontinuity design, instrumental variable estimation, observational studies Received 12/21/22; Revision accepted 11/1/23. 2 Grosz et al.
This chapter contains sections titled: Introduction Noncausal interpretations of an association Dealing with confounders 'Natural experiments' Overall conclusion on 'natural experiments' ...
17. The common practice in comparative research of comparing the same unit at two different time points, where the time points are separated by some natural event or policy intervention, can only rise to the standard of a natural experiment if the analyst can make a convincing case that the event or intervention occurred "as if" randomly with respect to other potential causes of different ...
Be careful with causality. Think of all the possible factors that could influence the outcome of interest, then try to account for them. ... Natural experiments require clever approaches to ...
Determining causality from a natural experiment. Ask Question Asked 13 years, 4 months ago. Modified 10 months ago. Viewed 357 times 4 $\begingroup$ Suppose I thought that ingesting greater than 100 mg of chemical X annually noticeably decreased one's weight. Also, I had data (from a "natural" experiment) from 100 people (some male and some ...
Wikipedia describes a natural experiment: A natural experiment is an empirical study in which individuals (or clusters of individuals) are exposed to the experimental and control conditions that are determined by nature or by other factors outside the control of the investigators. The process governing the exposures arguably resembles random ...
79 In "Causation in the Social Sciences: Evidence, Inference, and Purpose," Philosophy of the Social Sciences 39, no. 1 (2009): 20 - 40, Julian Reiss explains that the natural experiment method is based on an interventionist conception of causality.
results of a natural experiment. In an innovative study from ˛˝˝ˇ, Joshua Angrist and Guido Imbens showed what conclusions about causation can be drawn from natural experiments in which people cannot be forced to participate in the programme being studied (nor forbidden from doing so). The
When randomized experiments are infeasible, quasi-experimental designs can be exploited to evaluate causal treatment effects. ... Unpacking the black box of causality: Learning about causal mechanisms from experimental and observational studies. American Political Science Review, 105, 765-789. [Google Scholar] Imbens GW, & Lemieux T (2008).