Why Self-Testing Works

Self-testing is the study strategy that 89% of students ignore. More than a century of research shows it outperforms every alternative for long-term memory.

Introduction

Close the book. Put the notes away. Now write down everything you remember.

That simple act may be the most powerful study technique ever documented. It is called self-testing, and its track record stretches back more than a hundred years. In 1909, a graduate student at the University of Illinois named Edwina Abbott showed that students who tried to recall poetry from memory retained more than students who simply reread it [1]. In 2017, a team of researchers at Kent State University analyzed 272 separate experiments and confirmed what Abbott had glimpsed: self-testing produces learning gains roughly half a standard deviation above passive review, an effect size that, in educational terms, can mean the difference between a B and an A [2].

And yet almost nobody does it.

When Jeffrey Karpicke surveyed 177 undergraduates at Washington University in St. Louis, he found that only 11% spontaneously used any form of self-testing. Eighty-four percent relied on rereading their notes [3]. The most effective study strategy sits in plain sight, backed by hundreds of studies and thousands of participants, and the overwhelming majority of learners walk right past it.

This article tells the story of how we came to know this. The researchers who discovered it, the arguments that delayed its acceptance, the brain scans that revealed why it works, and the stubborn metacognitive illusion that keeps students from using it.

Open notebook on wooden desk with ink smudges and pencil.

The Experiment Nobody Noticed

The story of self-testing begins in a place no one would expect: an early-twentieth-century schoolhouse.

Arthur Gates was a young psychologist at Columbia University. In 1917, he recruited children aged eight to sixteen and gave them a straightforward task: memorize biographical passages and lists of nonsense syllables [4]. Some children spent all their time reading the material. Others spent part of the time reading and part of it trying to recite what they had read, with their eyes closed. Gates varied the proportion: some children spent 20% of their time reciting, some 40%, some 60%, some 80%.

The results were unambiguous. Children who spent 60% to 80% of their time actively reciting remembered far more than children who spent all their time reading. The effect was especially large for nonsense syllables, where no existing knowledge could carry the learner through. For meaningful prose, the sweet spot was closer to 40% recitation. But in every case, the children who tested themselves outperformed the passive readers.

Twenty-two years later, Herbert Spitzer took the idea out of the laboratory and into the field. In 1939, he ran what would remain for decades the largest study of the testing effect ever conducted [5]. His participants: 3,605 sixth-graders in 91 elementary schools across Iowa. Each student read a 600-word article and then either took a test immediately or waited for varying periods before the first test. Spitzer tracked retention over two months.

The students who were tested immediately after reading showed dramatically less forgetting over the following weeks. No extra study. No rereading. Just one test, right after the initial exposure, and the forgetting curve bent.

1909

Abbott shows recall beats rereading for poetry

1917

Gates finds 60-80% recitation time is optimal

1939

Spitzer tests 3,605 Iowa students in 91 schools

1967

Tulving shows retrieval strengthens memory traces

2006

Roediger and Karpicke publish landmark Science paper

2008

Karpicke and Roediger confirm continued retrieval is key

2011

Karpicke and Blunt show testing beats concept mapping

2013

Dunlosky rates practice testing as highest-utility strategy

2021

Wiklund-Hörnqvist maps testing effect in hippocampus

2025

Zhang et al. link retrieval to sleep consolidation via EEG

And then something strange happened. The field forgot.

Behaviorism swept through psychology. Interest shifted to stimulus-response associations and reinforcement schedules. The testing effect sat in journals, gathering dust, for nearly half a century. It would take until the early 2000s for cognitive psychologists to rediscover what Gates and Spitzer had known all along.

The Paper That Changed Everything

In 2006, Henry Roediger III and Jeffrey Karpicke at Washington University published a study that would redirect an entire field [6].

The design was elegant. Undergraduates read short prose passages about scientific topics. One group read the passage four times. Another group read it once and then took three recall tests. Five minutes later, both groups performed about the same on a final test. But a week later, the gap was enormous. The students who had been tested three times retained roughly 50% more than the students who had read the passage four times.

Here was the cruel irony. After the study session, the repeated-readers felt more confident. They believed they had learned the material better. The tested students, who had struggled through three recall attempts, felt less confident. The strategy that felt worse was the strategy that worked better.

Karpicke pushed the finding further. In 2008, he and Roediger published in Science [7]. Students learned Swahili-English word pairs and were assigned to different conditions. The critical comparison: students who continued retrieving words they had already recalled versus students who dropped those words from further practice. Continued retrieval produced massive long-term gains. Dropping items after one successful recall produced almost none.

The message was blunt. Retrieval is not just a way to check what you know. Retrieval is itself a learning event. Every act of pulling information from memory changes that memory, making it stronger and easier to find the next time.

Then came 2011. Karpicke and Janell Blunt published in Science again [8], this time with a result that startled even advocates of retrieval practice. Students studied a science text using one of four methods: reading once, reading four times, creating a concept map, or practicing retrieval. On a delayed test that required drawing inferences, the retrieval group outperformed the concept-mapping group by an effect size of d = 1.50. Not a small edge. A chasm.

Two years later, John Dunlosky and his colleagues at Kent State published a 47-page review in Psychological Science in the Public Interest [9]. They evaluated ten common study techniques. Only two earned the highest rating: practice testing and distributed practice. Highlighting, rereading, and summarization all received the lowest rating. The evidence was now overwhelming, and it was sitting in one of the most widely read psychology journals in the world.

Magnifying glass on laboratory bench with journals and index cards.

The Brain That Rebuilds Itself

For most of its history, the testing effect was purely behavioral. Students tested themselves, and they remembered more. But nobody knew why.

The first clue came from a lab at Duke University. In 2013, Elizabeth Wing, Elizabeth Marsh, and Roberto Cabeza scanned participants with functional MRI as they studied word pairs [10]. Some participants restudied the pairs. Others took a cued-recall test. The next day, everyone took a final memory test inside the scanner.

The results split along a clear anatomical line. Items that had been tested and later remembered activated the anterior hippocampus, the lateral temporal cortex, and the medial prefrontal cortex. Items that had merely been restudied and later remembered activated a different, less distinctive pattern. Retrieval practice was engaging a broader, more interconnected brain network.

Eight years later, Carola Wiklund-Hörnqvist and her team in Sweden went deeper [11]. They scanned fifty high-school students learning Swahili-Swedish word pairs and found something striking along the length of the hippocampus. The posterior hippocampus, the back end, increased activity proportionally with the number of successful retrievals. It coded individual episodes. But the anterior hippocampus, the front end, only kicked in after many successful retrievals. It appeared to build something more abstract, a generalized representation stripped of episodic detail. Two systems, working in tandem. One for the specific memory. One for the gist.

The behavioral effect in that study was enormous: Cohen's d = 1.22.

In the same year, Marin-Garcia, Mattfeld, and Gabrieli at MIT found yet another piece of the puzzle [12]. After one week, the brains of tested participants showed a unique network linking the left putamen, a structure deep in the basal ganglia, and the left inferior parietal cortex near the supramarginal gyrus. The restudy group showed no such network. The putamen is part of the brain's reward and reinforcement circuitry. The fact that successful retrieval engaged this circuit suggested a mechanism: each time you successfully pull a memory from storage, a small dopamine-mediated reinforcement signal stamps that memory as worth keeping.

A 2025 EEG study by Zhang and colleagues added the final piece: self-testing changes how the brain consolidates memories during sleep [13]. Participants who had practiced retrieval before sleep showed different patterns of overnight consolidation than those who had merely restudied. The testing effect does not end when you close the book. It follows you into sleep.

What does this mean for you? When you close your textbook and try to recall what you just read, you are not merely checking a box. You are triggering a cascade of neural events: hippocampal encoding, prefrontal context restoration, striatal reinforcement, and cortical trace strengthening. Rereading triggers none of this. The brain treats passive review as noise. It treats retrieval as a signal worth remembering.

Why the Hardest Strategy Feels Wrong

In 1994, Robert Bjork at UCLA coined a term that would become one of the most cited ideas in learning science: desirable difficulties [14].

The concept is simple but counterintuitive. Some study strategies feel easy and produce quick results that vanish within days. Other strategies feel hard, slow, even frustrating, and produce results that last months or years. Self-testing belongs firmly in the second category.

When you reread your notes, the material feels familiar. Familiar feels like learned. But familiarity is a liar. Benjamin, Bjork, and Schwartz showed in 1998 that people systematically confuse the ease of processing information right now with the probability of remembering it later [15]. They called this fluency illusion. Items that come to mind easily during study are judged as well-learned, even when the ease is entirely due to recency, not durable storage.

Self-testing shatters this illusion. When you close the book and try to recall the causes of World War I, you will find gaps. You will struggle. You will get things wrong. This feels bad. It feels like evidence that you haven't learned the material. Students experience this feeling and reach the obvious but incorrect conclusion: self-testing doesn't work. They return to rereading.

Bjork's framework explains why the struggle is precisely the point. The effortful search through memory, even when it fails partially, strengthens the pathways that lead to that information. The harder the retrieval, provided it eventually succeeds, the larger the long-term benefit. Rowland's 2014 meta-analysis of the entire testing-effect literature confirmed this directly: recall tests, which require more effort, produced significantly larger effects than recognition tests, which require less [16].

Think of it like exercise. Nobody enjoys the burn of a hard workout while it is happening. But the burn is the adaptation signal. Remove the difficulty, and you remove the growth.

The 11% Problem

If self-testing works so well, why does almost nobody use it?

Karpicke's 2009 survey gives the clearest answer. When 177 undergraduates were asked, in an open-ended question, to describe their primary study strategy, only 11% mentioned any form of self-testing. The dominant strategy, by a huge margin, was rereading [3]. When given a forced-choice scenario, most students who did choose testing said they did so to identify gaps for further study, not because they believed retrieval itself was a learning event.

The problem has a name: metacognition. Or more precisely, bad metacognition. Students are poor judges of their own learning. They overestimate what they know, underestimate what they have forgotten, and systematically choose the strategy that feels most productive over the strategy that is most productive.

Nate Kornell and Robert Bjork showed in a series of studies that even after students experience the benefits of testing, they often revert to passive strategies under real exam pressure [17]. Teaching students about the testing effect helps briefly. But the pull of fluency is strong. Rereading makes you feel safe. Self-testing forces you to confront what you don't know.

This is not stupidity. It is a deep feature of how human metacognition works. The same systems that allow us to monitor our own thinking also mislead us about the relationship between effort and learning. The connection to the Dunning-Kruger effect is direct: those who know the least are often the most confident in what they know, precisely because they have never tested themselves and discovered the gaps.

Not All Self-Tests Are Equal

Self-testing is not one technique. It is a family of techniques, and they differ in how much effort they demand and how much learning they produce.

At one end sits free recall. Close the book. Take a blank page. Write everything you remember. No cues. No prompts. Just you and your memory. This is the hardest form of self-testing, and both Rowland's meta-analysis [16] and Adesope's meta-analysis [2] agree: it produces the largest effects.

Next comes cued recall. A question, a keyword, or a partial prompt helps direct the search. Flashcards belong here. They work well when used correctly, but Kornell and Bjork showed in 2008 that students often misuse them, dropping cards from the stack as soon as they get the answer right once [18]. Karpicke and Roediger's 2008 Science paper proved that dropping items after one correct recall eliminates most of the long-term benefit [7].

Short-answer and essay questions sit in the middle of the effort spectrum. Multiple-choice and true-false tests sit near the bottom. Recognition-based tests produce smaller benefits than recall-based tests, though Little, Bjork, Bjork, and Angello showed in 2012 that multiple-choice tests with well-constructed distractors can still produce meaningful learning [19].

Then there are hybrid formats. Self-generated questions, where you write your own quiz before answering it. Concept mapping from memory, which Karpicke and Blunt showed outperforms concept mapping from notes [8]. The teach-back method, where you explain a concept as if teaching it to someone else. Each combines retrieval with elaboration, and each produces strong results.

Self-Testing Format	Effort Level	Effect Size	Best For
Free recall (brain dump)	Very high	Large (g = 0.70)	Prose, conceptual material
Cued recall (flashcards)	High	Moderate-large (g = 0.55)	Vocabulary, paired associates
Short-answer questions	High	Moderate-large (g = 0.50)	Factual and applied knowledge
Concept mapping from memory	High	Large (d = 1.50)	Complex relationships
Teach-back / Feynman method	High	Large (estimated)	Deep conceptual understanding
Multiple-choice with distractors	Low-moderate	Small-moderate (g = 0.30)	Broad content coverage
Recognition / true-false	Low	Small (g = 0.20)	Quick review only

The practical lesson is clear. If you want the largest return on your study time, choose formats that demand the most from your memory. A blank page and a closed book will always beat a highlighted textbook.

Study objects in a semicircle on a wooden table, arranged by height.

Testing Today Helps You Learn Tomorrow

The testing effect has a lesser-known cousin. It is called the forward testing effect, and it may be even more surprising.

In 2008, Karl Szpunar, Kathleen McDermott, and Henry Roediger ran a clever experiment [20]. Participants studied five lists of words. Between lists, some participants took a short recall test on the material they had just studied. Others did nothing. At the end, all participants were tested on the fifth list, the one that everyone had studied identically.

The result was startling. Participants who had been tested between lists recalled roughly twice as much of the fifth list as participants who had not been tested: 39% versus 19% in one experiment, 54% versus 24% in another. Testing on old material had improved learning of new material. And intrusion errors, memories from earlier lists bleeding into the current one, were reduced by a factor of ten.

Five years later, Szpunar teamed up with Daniel Schacter at Harvard for a study published in Proceedings of the National Academy of Sciences [21]. They showed participants lecture videos, with or without brief quizzes between segments. When quizzed, mind-wandering dropped by 50%. Note-taking tripled. Final-test performance improved significantly. The quizzes were not graded. They carried no stakes. Their only function was to trigger retrieval, and that single act reset the learner's attention and primed the brain for what came next.

Bernhard Pastötter and Karl-Heinz Bäuml formalized the mechanism in a 2014 review [22]. They proposed that interim testing reduces proactive interference, the tendency of old memories to block new learning. It also appears to reset the encoding process, giving subsequent material a fresh start in working memory.

What does this mean in practice? It means that quizzing yourself on chapter one before reading chapter two does not just strengthen chapter one. It makes chapter two easier to learn. The testing effect is not just backward-looking. It reaches forward.

Five glowing doorways in an ethereal corridor, each unlocking new possibilities.

When Self-Testing Meets Spacing

Self-testing alone works. Self-testing distributed across time is transformational.

Katherine Rawson and John Dunlosky at Kent State developed a protocol they called successive relearning [23]. The procedure is simple: study a set of items, test yourself, get feedback, then return to the same items on a different day and test yourself again. Repeat until each item has been successfully recalled in at least three spaced sessions.

The results were staggering. In one classroom study, students who completed three spaced relearning sessions retained 78% of the material at delay. The control group retained 20%. That is not a marginal improvement. That is the difference between passing and failing.

In 2022, Rawson and Dunlosky summarized a decade of evidence in Current Directions in Psychological Science [24]. Their conclusion: successive relearning produces "more than a letter-grade boost" regardless of whether students use the technique in a lab or on their own. The combination of retrieval and spacing exploits both the testing effect and the spacing effect simultaneously. Each spaced retrieval is harder than a massed one, because the memory has partially faded, and that added difficulty drives deeper consolidation.

The optimal spacing interval is not fixed. Cepeda and colleagues showed in 2008 that the best gap between study sessions is roughly 10 to 20% of the intended retention interval [25]. If you need to remember something for a month, space your retrieval sessions three to six days apart. If you need to remember for a year, space them two to four weeks apart.

Four plant pots on a windowsill showcasing growth stages with watering cans.

When Self-Testing Doesn't Work

No strategy works always and everywhere. Honesty about limits is what separates science from sales.

Tamara van Gog and John Sweller argued in a 2015 review that the testing effect weakens or disappears as the complexity of the material rises [26]. Their reasoning drew on cognitive load theory. When material has many interacting elements, a novice learner may not have enough in memory to attempt a meaningful retrieval. Trying to recall a proof you have never understood is not a desirable difficulty. It is just difficulty. In these cases, studying worked examples before attempting retrieval may be essential.

Karpicke and Aue responded sharply [27]. They argued that the studies van Gog and Sweller cited had methodological problems and that element interactivity was poorly defined. The debate, published side by side in Educational Psychology Review, remains unresolved. But the practical takeaway is sensible: self-testing works best when the learner has at least a baseline understanding of the material. For entirely novel, highly complex content, initial study should precede retrieval practice, not replace it.

A second boundary condition involves retrieval success. Both Rowland [16] and Adesope [2] found that the benefits of testing depend on actually retrieving something. If retrieval success is very low, the testing effect shrinks. The practical threshold appears to be around 50%. If you are getting fewer than half the items right during practice, the material is too difficult for self-testing. Go back and study first.

A third concern involves multiple-choice tests with misleading distractors. Roediger and Marsh showed in 2005 that when students choose a wrong answer on a multiple-choice test, they sometimes consolidate the error [28]. Feedback eliminates this problem, but without feedback, poorly designed multiple-choice tests can do more harm than good.

And then there is mathematics. A 2025 meta-analysis by Murray, Horner, and Göbel specifically examined self-testing in mathematics and found a testing-versus-restudy effect of only g = 0.18, with the confidence interval crossing zero [29]. The authors cautioned that the evidence base was small, just seven studies, but the finding suggests that the testing effect may not transfer uniformly to procedural, calculation-heavy domains.

None of this undermines self-testing as a strategy. It simply means, like every tool, it works best when used with judgment.

From Laboratory to Lecture Hall

Does the testing effect survive outside the lab?

The answer, across more than a dozen controlled classroom studies, is yes.

In medical education, Douglas Larsen, Andrew Butler, and Henry Roediger gave neurology residents either repeated tests or repeated study sessions on emergency topics [30]. Six months later, the tested group outperformed the study group by a wide margin. Azzam and Easteal confirmed similar results in gross anatomy with 248 medical students [31]. A 2023 systematic review catalogued retrieval-practice benefits across nursing, pharmacy, dental, and medical education [32].

In K-12 education, Mark McDaniel, Pooja Agarwal, and colleagues embedded low-stakes quizzes into middle-school science and social studies classes [33]. Tested material was retained better on unit exams, semester exams, and even on standardized state assessments given months later. The quizzes took only a few minutes of class time.

In language learning, the canonical domain of the testing effect, Karpicke and Roediger's Swahili-English studies have been replicated and extended dozens of times across languages from Swedish to Japanese [7]. Barenberg and colleagues showed in 2021 that retrieval practice for English vocabulary in German classrooms transferred to new test formats, not just the practiced format [34].

In 2024, Bates and Shea surveyed 153 teachers in English schools and found that 100% reported using retrieval practice regularly, with 63% using it in every lesson [35]. The most common way they learned about the strategy was from colleagues. After decades in journals, the testing effect is finally reaching classrooms, carried not by policy mandates but by word of mouth.

Bird's-eye view of an empty lecture hall with warm afternoon light.

Beyond Recall: Self-Testing and Transfer

A persistent criticism of self-testing has been that it only helps with verbatim recall of studied facts. Andrew Butler's 2010 series of experiments put that objection to rest [36].

Students studied prose passages and either restudied or were tested. One week later, they took a final test that included three types of questions: the same questions they had practiced, new inference questions within the same domain, and new inference questions in a different domain entirely. Tested students outperformed restudied students in all three conditions, including the cross-domain transfer test.

Shana Carpenter reviewed the growing evidence in 2012 and concluded that retrieval practice supports transfer whenever initial retrieval succeeds [37]. The mechanism she proposed is elaborative retrieval: each time you pull a memory from storage, you activate not just the target information but also its surrounding context, related facts, and alternative interpretations. This enriched network gives you more routes to reach the information later, even in novel situations.

Pan and Rickard's 2018 meta-analysis of transfer studies found a positive but somewhat smaller effect, around d = 0.40, compared to the larger effects seen on same-content tests [38]. Transfer was strongest when retrieval involved free recall, when feedback was provided, and when the material had a clear conceptual structure.

Self-testing, then, does not just help you repeat what you have memorized. It helps you think with what you know.

How to Self-Test Well

The evidence converges on a handful of principles that distinguish effective self-testing from wasted effort.

First: begin soon, but not immediately. A short delay, even fifteen minutes, between initial study and the first retrieval attempt forces slightly more effort and produces a slightly larger benefit [25].

Second: use free recall first. Before reaching for flashcards or practice questions, close the book and write everything you can remember. This uncued, open-ended retrieval activates the broadest memory search and reveals the largest gaps.

Third: do not stop after the first correct recall. Karpicke and Roediger's 2008 Science study showed that the additional retrievals after the first success are where most of the long-term benefit comes from [7].

Fourth: space your sessions. Three retrieval sessions across three days will beat three retrievals in one sitting, every time [24].

Fifth: always check your answers. Feedback is what prevents errors from becoming cemented. Butler, Karpicke, and Roediger found in 2008 that delayed feedback, given after a short interval rather than immediately, produced the largest retention benefits [39].

Sixth: mix your topics. Interleaving forces the brain to discriminate between similar concepts, which strengthens retrieval cues for each one [40].

Seventh: try pretesting. Answering questions before you have studied the material, even when you get everything wrong, improves subsequent learning [41]. The failed retrieval primes the brain to pay closer attention to the answer when it appears.

And eighth: trust the difficulty. The strategies that feel least productive in the moment, self-testing, spacing, interleaving, are precisely the ones that produce the largest long-term gains [14]. If studying feels easy, you are probably not learning.

Zen garden with concentric sand patterns and eight smooth stones.

Conclusion

The arc of self-testing research bends across 117 years and thousands of participants. From Abbott's poetry recitations in 1909 to Zhang's sleep EEG recordings in 2025, the message has not changed: testing yourself, struggling to recall, confronting what you do not know, produces deeper and more durable learning than any form of passive review.

The irony is that the strategy feels worse while it works better. This metacognitive trap, the fluency illusion, has kept generations of students rereading notes that are slowly evaporating from memory. Breaking free requires a small act of trust. Trust that difficulty is the signal of growth. Trust that the blank page, the failed recall, the uncomfortable silence before an answer arrives, these are not signs of failure. They are the sound of a memory being built.

The neuroscience confirms it. The meta-analyses quantify it. And the classroom studies replicate it. Self-testing is not a study hack. It is how memory works.

Candle flame reflection in dark window, warm and distorted glow.

Frequently Asked Questions

What is self-testing in studying?

Self-testing is the practice of trying to recall information from memory without looking at your notes or textbook. It includes techniques like flashcards, free recall, practice questions, and writing down everything you remember after reading. Research shows it is more effective for long-term retention than rereading or highlighting.

Is self-testing better than rereading?

Yes. Multiple meta-analyses covering hundreds of studies confirm that self-testing produces roughly 50% better retention than rereading over periods of days to weeks. The 2006 study by Roediger and Karpicke is the most cited demonstration: students tested three times retained far more after one week than students who read the same passage four times.

Why does self-testing work so well for memory?

Self-testing forces the brain to reconstruct information from scratch, which strengthens the neural pathways used for retrieval. Brain imaging studies show that retrieval activates the hippocampus, prefrontal cortex, and striatum in ways that rereading does not. Each successful retrieval makes the next one easier and more reliable.

How often should I self-test?

Research on successive relearning suggests three spaced retrieval sessions, spread across different days, produces the best results. The optimal gap between sessions is roughly 10 to 20 percent of the time you want to remember the material. For a test in one month, space your self-testing sessions about three to six days apart.

Does self-testing work for complex subjects like math and science?

Self-testing works across most subjects, including science, medicine, history, and language learning. However, a 2025 meta-analysis found weaker effects in mathematics. For very complex material, it helps to study and understand the basics before attempting retrieval practice, rather than testing yourself on material you have never understood.