Published on Development Impact

Getting beyond the mirage of external validity

Markus Goldstein

May 20, 2015

This page in:

This post is coauthored with Eliana Carranza

No thoughtful technocrat would copy a program in every detail for a given context in her or his country. That's because they know (among other things) that economics is not a science but a social (or dismal even) science, and so replication in the fashion of chemistry isn't an option. For economics, external validity in the strict scientific sense is a mirage.

What the technocrats and politicians are interested in though, is what has worked elsewhere so they can draw on that when designing their program. So here we posit three ways to think about how this knowledge can be organized that give us the closest we can get to external validity.

Adaptive replications: This is the case when the sense and spirit of the program design is kept the same (and yes, that is somewhat vague) but, far from adopting a cookie cutter approach, it is replicated in a different context with significant local variation. This type of replication is best illustrated by the nice recent graduation from poverty multi-country study in Science that Berk blogged about on Monday. Banerjee, et. al. look at a set of programs characterized by a core set of interventions (give people livestock, training, consumption support, etc) but their local implementation differs in who actually gets the asset, how the program is targeted, what the consumption support looks like and other factors. The variation in program design isn't mind-blowingly huge, but some of it is large enough that evaluations could (or have) asked questions that just focus on the particularly variation (e.g. how long and when to give consumption support to poor households). But to us, wearing our social science hats (and not our lab coats), this kind of variation is exactly what we want because the next policy maker to adopt this kind of program is probably going to tweak things along the same dimensions. In the end, for a sufficient diverse number of cases, this approach lets us see whether fairly similar (not identical) programs have similar effects in a number of contexts, and infer the probability that this program will work in a new context. And that's really useful.
Evolutionary learning: In this case, the next iteration of a program draws on previous implementation and impact evaluation results to inform and try a significant design variation in the same or a different context. When the program results are null or negative, thoughtful evaluation tries to figure out why this was the case (was some other constraint missed? was there an implementation failure? was the "dosage" too weak? were impacts heterogeneous?), giving the next version of the program something to build on. The same holds true for positive program impacts, where a tweak will be tried or a complementary intervention will be added to the next iteration of the program. And the learning occurs when this next version of the program is evaluated. The literature on business training is a good example of this. Most early evaluations of training were not so promising. So folks tried other interventions in the same vein (e.g. providing management consulting as in Bloom, et. al.). They tried/are trying training combined with capital, and training that takes a very different approach to what it is trying to teach. Evolution - the natural order of things (even economics).
Learning at scale: Here an initial positive result(s) leads to scale-up the implementation and evaluation of a program design in the same or a different context. This type of experimentation is seeking to answer how small scale programs (and the evaluation results) turn out when a lot of folks participate. General equilibrium effects, less tailored implementation and a host of other considerations and effects might mean that program impacts are different under this scenario, so this is obviously an important way to learn.

Taking stock of these three approaches, clearly what is lacking is more adaptive replication studies and more learning at scale.

There are a couple of reasons why this is the case. First, for learning at scale, convincing policymakers to do an evaluation that large and complex is orders of magnitude harder than doing one for a pilot program for a few thousand beneficiaries.

Second (and this applies to both), is how publication and policymaking incentives play out. Once the initial program is implemented and/or its evaluation results are published, the visibility of each additional iteration of program X (whether an adaptive replication or scaled-up version) drops significantly. In the publication space this may mean moving down the ladder of journal rankings or not being able to publish results at all. In the policy and evaluation funding spaces, it may become more difficult to justify and secure funding for another evaluation of program X when there is already evidence from the same or a similar context.

Banerjee, et. al. show us a way out of the challenge faced by adaptive replications: they can be made more appealing by setting up all the evaluations to take place at the same time, although this requires a massive amount of coordination (and no small amount of funding). For learning at scale though, it’s still an uphill battle. Some things that might help get more of this is capitalizing on cases where groups within the government/implementing organization have viable, competing visions of a project or finding evidence minded policymakers who are senior enough to push for this.

Here's hoping we get more of both in the future.

Authors

Markus Goldstein

Lead Economist, Africa Gender Innovation Lab and Chief Economists Office

More Blogs By Markus

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.