Introduction
In 1987, a biology student in Poland wrote a formula that would schedule billions of flashcard reviews for the next thirty-five years [1]. That formula, Algorithm SM-2, became the invisible engine behind nearly every open-source spaced repetition system ever built. It worked. But "worked" is a loose word. The formula treated every learner the same. It had no way to adapt to a medical student who forgets anatomy terms faster than pharmacology facts. It had no mechanism for recovering from a bad streak of failed reviews. And it used a mathematical model of forgetting that science had already moved past.
Then in 2022, a new algorithm called FSRS appeared, built on machine learning and trained on roughly seven hundred million reviews [2]. Within two years it replaced SM-2 as the default scheduler in the world's largest open-source review platform. Benchmarks showed it outperformed SM-2 for more than 99 percent of users tested [3].
But what does "outperform" actually mean? What makes one algorithm better than another? This is not a comparison article. It is a question about criteria. What properties should a spaced repetition algorithm have to be genuinely effective? And what does the science of memory tell us about why those properties matter?
The answers turn out to be more subtle than most people expect. For a deeper look at how scheduling algorithms evolved historically, see Spaced Repetition Algorithms: From SM-2 to FSRS

The Question Nobody Thought to Ask
For decades, the spaced repetition community treated algorithm quality as a settled issue. SM-2 existed. It seemed to work. Users memorized vocabulary, passed exams, and learned languages. Nobody ran controlled experiments comparing one scheduling formula to another, because the gap between using any spaced repetition system and using none at all was so enormous that the question of which algorithm was better seemed almost trivial.
That intuition was not wrong. Meta-analyses confirm the effect is large. Cepeda and colleagues analyzed 254 studies involving over 14,000 participants and found that distributing practice across time consistently produced better retention than massing practice into a single session [4]. A 2025 meta-analysis of spaced repetition specifically in medical education, covering 21,415 learners across 85 studies, reported a standardized mean difference of 0.78 in favor of spaced repetition over conventional study methods [5]. That is a large effect by any standard.
But here is where the story gets interesting. If a medical student does 500 reviews per day for three years, and one algorithm requires 20 percent fewer reviews to maintain the same retention, that difference saves hundreds of hours. The gap between algorithms may be small in percentage terms. At scale, it is enormous in human terms.
The problem was that nobody had the data to measure which algorithm was actually better. Until someone built a benchmark large enough to find out.

Seven Properties That Define an Effective Algorithm
No single number captures whether a spaced repetition algorithm is effective. Effectiveness is a collection of properties, each measuring a different aspect of how well the algorithm does its job. Based on the open-spaced-repetition benchmark methodology [6], the scientific literature, and the practical constraints of real learners, seven properties emerge as the criteria that matter most.
The first is prediction accuracy. An effective algorithm must predict the probability that a learner will recall a specific card at a specific time. If it says there is a 90 percent chance of recall, the learner should indeed recall correctly about 90 percent of the time. This property is called calibration. It is measured by log loss and by a custom metric called RMSE in bins, where reviews are grouped by interval length and the difference between predicted and actual recall rates is computed within each group [3].
The second is discrimination. Even if the absolute probabilities are slightly off, the algorithm should rank cards correctly. A card the learner is likely to forget should always receive a lower predicted recall probability than a card the learner is likely to remember. This is measured by AUC, the area under the receiver operating characteristic curve [3].
The third is individual adaptivity. Different learners forget at different rates. A person who sleeps eight hours might consolidate memories faster than someone who sleeps five. A card about molecular biology might decay faster for a history student than for a chemistry student. An effective algorithm exposes parameters that can be fitted to each individual learner's review history, rather than applying the same fixed multipliers to everyone.
The fourth is review efficiency. For a fixed retention target, fewer reviews per card per year is strictly better. This is the property learners care about most directly, because it translates into time saved.
The fifth is lapse management. When a learner forgets a card, the algorithm must reset the card's scheduling state proportionally. Not too aggressively, trapping the card in a cycle of unnecessarily frequent reviews. Not too leniently, letting the card slip away again.
The sixth is robustness to irregular study. Real learners miss days, study in bursts, go on vacation, and return after long absences. An effective algorithm must not catastrophically miscalibrate when the learner's actual review timing deviates from the scheduled timing.
The seventh is the tradeoff between computational simplicity and accuracy. A neural network with millions of parameters might predict recall probabilities slightly better than a formula with twenty-one parameters. But if the neural network produces non-monotonic predictions, meaning it sometimes claims a card is more likely to be recalled after a longer delay, or if it requires server-side computation that prevents offline use, the accuracy gain may not be worth the cost.

The Brain's Case for Adaptivity
Why does individual adaptivity matter so much? The answer comes from neuroscience, not software engineering.
Long-term potentiation, the cellular mechanism widely accepted as the basis for synaptic memory consolidation, shows a striking spacing effect at the neural level. When researchers apply bursts of electrical stimulation to neurons in rapid succession, the resulting potentiation is weak and short-lived. But when the same total stimulation is spread across intervals, the potentiation is dramatically stronger and longer-lasting [7]. The synapse, at the molecular level, has a refractory period. A second bout of stimulation applied less than thirty minutes after the first produces almost no additional strengthening. But the same stimulation applied after sixty minutes produces significantly larger effects.
This refractory period varies between brain regions, between types of synapses, and between individuals. A 2024 study by Comyn, Preat, Pavlowsky, and Plaçais demonstrated that the spacing effect on long-term memory formation is gated by PKCdelta, a protein kinase that activates mitochondrial metabolism in memory-encoding neurons [8]. The biological argument is direct: forming long-term memories is energetically expensive. The brain restricts this investment to information that is repeatedly relevant across time. Spaced repetition exploits exactly this biological filter.
But here is the critical point for algorithm design. The rate at which PKCdelta activates, the speed of synaptic consolidation, and the depth of sleep-dependent memory replay all differ between individuals. Walker and Stickgold showed that sleep, particularly slow-wave sleep and stage N2 with sleep spindles, actively reorganizes memory traces during the night [9]. A learner who consistently gets deep slow-wave sleep will consolidate spaced reviews more effectively than one who does not. An algorithm that treats both learners identically is leaving performance on the table.
This is why individually adaptive algorithms outperform fixed-parameter ones. They do not need to know why one learner forgets faster. They simply observe that learner's review history and fit their model accordingly.

The Memory Model That Changed Everything
Every effective modern algorithm is built on some model of how memory works. The model determines what variables the algorithm tracks, how it updates them after each review, and how it predicts future recall.
The dominant theoretical scaffold behind today's best algorithms is the two-component model of memory, first formalized by Piotr Wozniak and Edward Gorzelanczyk in 1994 and 1995 [10]. The model separates memory into two independent variables. Retrievability is the current probability that a learner can recall a given piece of information right now. It corresponds roughly to what psychologists call retrieval strength. Stability is how slowly retrievability decays over time. It corresponds to storage strength.
After a successful retrieval, retrievability rebounds to nearly 100 percent. Stability increases. Between reviews, retrievability decays while stability stays approximately constant. The crucial insight is the relationship between these two variables at the moment of review. When a learner reviews a card while retrievability is still high, say 95 percent, the stability gain is small. The review was too easy. When a learner reviews at moderate retrievability, say 70 to 80 percent, the stability gain is much larger. The review was effortful but successful.
This is exactly what Robert Bjork calls a "desirable difficulty" [11]. Conditions that slow apparent acquisition, including spacing, interleaving, and effortful retrieval, produce stronger long-term retention. The two-component model gives this principle a mathematical form: schedule reviews when retrievability has dropped enough to make retrieval challenging, but not so far that the learner is likely to fail.
The three-component extension, often called the DSR model for Difficulty, Stability, and Retrievability, adds a per-item difficulty parameter. Some cards are inherently harder than others for a given learner. A card about the Krebs cycle might have high difficulty for a law student but low difficulty for a biochemistry major. The DSR model, used by both SuperMemo's modern algorithms and FSRS, tracks difficulty as a separate variable that influences how much stability increases with each successful review [12].

Why the Shape of Forgetting Matters
One of the most consequential technical details in algorithm design is the mathematical shape of the forgetting curve. This sounds abstract, but it directly affects every interval the algorithm calculates.
Ebbinghaus published the first forgetting curve in 1885, measuring his own retention of nonsense syllables at delays ranging from twenty minutes to thirty-one days [13]. The curve dropped steeply at first and then gradually leveled off. Murre and Dros replicated this experiment in 2015 with modern methods and confirmed the original data with remarkable fidelity, noting only a small upward bump at the twenty-four-hour mark, likely reflecting overnight sleep consolidation [13].
But what mathematical function best describes this curve? The question matters because the function determines when the algorithm predicts the learner will drop below their target retention. If the function decays too quickly, the algorithm schedules reviews too soon. If it decays too slowly, the learner forgets before the review arrives.
Wixted and Ebbesen demonstrated in 1991 and 1997 that forgetting is better described by a power law than by a simple exponential function [14]. An exponential function decays at a constant rate. A power-law function decays rapidly at first but then slows down progressively. When Wixted and Ebbesen analyzed individual subjects' data, not just group averages, the power-law fit consistently outperformed the exponential fit.
This has direct consequences for scheduling. An exponential model would predict that after a certain number of days, the probability of recall drops sharply to near zero. A power-law model predicts a slower, more gradual decline. An algorithm using the wrong curve shape will systematically miscalculate review timing.
FSRS through version 5 used a power-law forgetting curve. FSRS-6 introduced an optimizable curve-shape parameter that allows the mathematical form to vary between learners [3]. Some learners' data is better fit by a steeper curve. Others by a flatter one. The algorithm discovers which shape fits each user's actual forgetting pattern, rather than imposing a single mathematical assumption on everyone.

The Benchmark That Settled the Argument
For most of spaced repetition's history, claims about algorithm quality were based on intuition, theory, or small-scale comparisons. The open-spaced-repetition project changed that by building the largest public benchmark of scheduling algorithms ever constructed [6].
The dataset contains approximately 9,999 collections from real users, totaling roughly 350 million review predictions after filtering. The raw data, hosted on Hugging Face as the Anki revlogs 10k dataset, contains about 727 million reviews from 10,000 users [15]. Before benchmarking, same-day reviews, manually rescheduled reviews, filtered-deck reviews, and statistical outliers are removed.
The benchmark uses time-series cross-validation. Each user's reviews are split chronologically into five parts. The algorithm trains on growing prefixes and is evaluated on the next unseen part each time. This prevents future-leakage, the problem of testing on data the algorithm has already seen.
The headline results are striking. FSRS-6 with per-user optimization and recency weighting achieves a mean log loss of approximately 0.344. It has 99.6 percent superiority over a probability-converted SM-2, meaning that for 99.6 percent of users, FSRS-6 produces a lower log loss [3]. For a detailed comparison of how these two algorithms differ in their mechanics, see FSRS vs SM-2: Spaced Repetition Algorithm.
But context matters. SM-2 was never designed to predict probabilities. The benchmark team had to add extra formulas to convert SM-2's intervals into recall probabilities. As the benchmark authors themselves note, there is no way to have a truly fair comparison between FSRS and SM-2 because the two algorithms were designed for fundamentally different purposes [3].
The comparison with SM-17, SuperMemo's modern algorithm that does use the two-component model, is more informative but also more limited. Only 18 collections were available for that comparison, and FSRS-6 showed 83.3 percent superiority [16]. The small sample size means this result should be treated as suggestive rather than conclusive.
The Ease Hell Problem and Why Lapse Management Matters
One of the clearest demonstrations of why algorithm design matters comes from a pathology in SM-2 that the community calls "ease hell." Understanding this problem illustrates why lapse management is a critical property of any effective algorithm.
In SM-2, every card has an ease factor that starts at 2.5 and controls how quickly review intervals grow. Each time a learner presses "Again" to indicate a forgotten card, the ease factor drops by 20 percentage points. Each time they press "Hard," it drops by 15 [17]. But only pressing "Easy" raises the ease factor, and by only 15 points. Pressing "Good," which is what most learners press most of the time, leaves the ease factor almost unchanged.
The result is a one-way ratchet. After roughly six lapses, a card's ease factor hits the floor of 1.30 and stays there permanently [18]. From that point on, the card is reviewed at minimum intervals regardless of how many times the learner subsequently recalls it correctly. A card that was difficult three months ago but has since been thoroughly learned remains trapped in frequent review forever.
This is a structural defect, not a user error. The ease factor is adjusted incrementally based on the most recent review, with no mechanism for considering the full review history. DSR-based algorithms like FSRS avoid this problem entirely because difficulty is re-derived from the entire review history at every optimization step [19]. A card that was difficult initially but has been recalled correctly ten times in a row will have its difficulty parameter reduced accordingly.
What does this mean for learners? If a medical student using SM-2 struggles with a set of pharmacology cards during their first month, those cards may be trapped in ease hell for the remaining three years of study. Hundreds of unnecessary reviews per month, each consuming time that could be spent on new material or on cards that genuinely need reinforcement.

Reconsolidation and the Science of Optimal Timing
The question of when to schedule a review is not just a mathematical optimization problem. It has a biological basis that most algorithm designers have not yet fully exploited.
In 2000, Karim Nader, Glenn Schafe, and Joseph LeDoux published a paper in Nature that shook the foundations of memory science [20]. They demonstrated that consolidated memories, previously thought to be stable once stored, return to a fragile, protein-synthesis-dependent state when reactivated. Retrieving a memory does not merely read it passively. It destabilizes the memory trace and requires active reconsolidation to re-store it.
This has profound implications for spaced repetition. Each retrieval event is not a neutral observation. It is an opportunity to rewrite and strengthen the memory trace. But it is also a moment of vulnerability. If the reconsolidation process is disrupted, the memory can actually be weakened.
The practical question for algorithms is: when does retrieval produce the maximum reconsolidation-driven strengthening? The desirable difficulties framework suggests the answer is when retrieval is effortful but successful [11]. Too easy and the reconsolidation produces minimal change. Too hard and the retrieval fails entirely, providing no reconsolidation opportunity at all.
The DSR model's finding that stability gains are largest when retrievability is moderate is a quantitative restatement of this principle. An algorithm that schedules reviews at the optimal retrievability level is, in effect, timing each review to maximize the reconsolidation-driven strengthening of the memory trace.
No published spaced repetition algorithm explicitly models reconsolidation phases. This is a clear gap between what neuroscience knows and what algorithms do. Future algorithms that incorporate reconsolidation timing could potentially produce even larger gains than the current DSR-based approaches.

What the Benchmark Cannot Tell You
The open-spaced-repetition benchmark is the most rigorous evaluation of scheduling algorithms ever conducted. But it has important limitations that anyone evaluating algorithms should understand.
First, the benchmark measures prediction quality, not learning outcomes. Log loss tells you how accurately the algorithm predicts the probability of recall. It does not directly tell you how much a learner will remember after six months of using one algorithm versus another. The connection between better prediction and better learning is plausible and likely, but it has not been demonstrated in a preregistered randomized controlled trial comparing FSRS head-to-head against SM-2 with matched retention targets [3].
Second, the commonly cited claim that FSRS reduces reviews by 20 to 30 percent compared to SM-2 for the same retention comes from simulation, not from a randomized experiment with real learners [21]. The simulation uses FSRS's own memory model to generate synthetic review histories and then counts how many reviews each algorithm would have required. This is informative but circular: the simulation assumes FSRS's model of memory is correct, which is the very thing being evaluated.
Third, the dataset overrepresents power users. Medical students, language learners, and software engineers who voluntarily export their review logs are not representative of casual learners. Algorithm rankings on this dataset may not generalize identically to someone using flashcards for twenty minutes a week.
Fourth, proprietary algorithms cannot be benchmarked. SuperMemo's SM-17, SM-18, SM-19, and SM-20 are closed source. The 18-collection comparison with SM-17 is the best available evidence, but it is far less robust than the 9,999-collection comparison with SM-2 [16].
Fifth, AUC scores in this domain are intrinsically modest. All spaced repetition algorithms score in the 0.6 to 0.75 range on AUC [3]. This reflects the fundamental noisiness of single-card recall outcomes, not a failure of any particular algorithm. A learner might forget a well-known card because they were distracted, tired, or simply unlucky with the retrieval cue.
These limitations do not invalidate the benchmark. They contextualize it. The benchmark provides strong evidence that DSR-based algorithms predict recall probabilities more accurately than fixed-multiplier schemes. Whether that prediction advantage translates into proportionally better learning outcomes remains an open empirical question.
The Frontier: Content-Aware and Neural Approaches
The next generation of spaced repetition algorithms is moving beyond scheduling based solely on interval lengths and grades. Several research lines are converging on algorithms that understand something about the content of what is being learned.
Shu, Balepur, Feng, and Boyd-Graber published KAR3L in 2024, an algorithm that integrates text embeddings of card content into the recall predictor [22]. The key insight is that semantically similar cards share underlying knowledge state. If a learner forgets a card about mitochondrial function, cards about cellular respiration and ATP synthesis are also more likely to be forgotten. KAR3L uses BERT embeddings to capture these relationships. In a study with 543 learners and over 123,000 study logs, it outperformed FSRS v4 on both AUC and calibration error.
In August 2025, Zhao published LECTOR, which uses large language models to detect semantic similarity between flashcards and adjust scheduling accordingly [23]. When two cards are semantically close enough to cause confusion, LECTOR spaces them apart to reduce interference. The reported results are promising: 90.2 percent success rate versus 88.4 percent for the best baseline. But the results come from simulations with 100 synthetic learners over 100 days, not from deployment with real users.
Xiao and Wang took a different approach with DRL-SRS in 2024, framing spaced repetition as a reinforcement learning problem [24]. Instead of deriving scheduling rules from a memory model, they trained a deep Q-network with LSTM components to learn optimal review intervals directly from data. The agent discovers its own scheduling policy by maximizing long-term retention as a reward signal.
The benchmark provides an interesting perspective on neural approaches. RWKV, a general-purpose recurrent neural network, matches or exceeds FSRS-6 on every prediction metric [3]. But it comes with significant practical drawbacks. It requires substantially more computation. It can produce non-monotonic predictions, sometimes claiming a card is more likely to be recalled after a longer delay. And it lacks the smooth, interpretable forgetting curve that learners expect when they examine their review statistics.
This tension between accuracy and interpretability is likely to define the next phase of algorithm development. The ideal algorithm would combine the prediction power of neural approaches with the interpretability and deployability of DSR-style models.

What This Means for Learners
After examining the science, the benchmarks, and the neuroscience, what practical conclusions can a learner draw?
The most important conclusion is also the simplest. Any spaced repetition algorithm is dramatically better than no spaced repetition. The gap between distributed and massed practice is measured in standard deviations. The gap between algorithms is measured in percentage points of log loss. Dunlosky and colleagues rated distributed practice and practice testing as the two most effective learning strategies in their landmark 2013 review of ten popular study techniques [25]. Both received the highest "high utility" rating. No other technique came close.
The second conclusion is that at scale, the algorithm gap matters. A learner maintaining a collection of 5,000 cards over multiple years will accumulate tens of thousands of reviews. If one algorithm requires 20 percent fewer reviews to maintain the same retention, that translates into months of saved study time over a multi-year period.
The third conclusion involves desired retention, the parameter that controls how high the learner wants their recall probability to be. FSRS exposes this as a configurable setting, typically between 0.80 and 0.97, with 0.90 as the recommended default [26]. The relationship between desired retention and review workload is nonlinear. Moving from 0.85 to 0.90 increases workload moderately. Moving from 0.90 to 0.95 roughly doubles it. The Cepeda 2008 ridgeline study, which mapped optimal inter-study intervals across different retention periods, showed that the ideal spacing depends heavily on how long the learner needs to remember the material [27].
The fourth conclusion is about training data. Algorithms like FSRS improve as they accumulate more review data from each learner. With fewer than about 400 reviews, the default parameters trained on the global dataset perform about as well as individually optimized ones [28]. After 1,000 reviews, per-user optimization begins to show clear advantages. After 5,000 reviews, the advantage is substantial.

The Debate That Is Not Over
It would be misleading to present the question of algorithm effectiveness as settled. Several important disagreements remain active.
The most prominent is between Piotr Wozniak, the creator of SuperMemo and the two-component model, and the FSRS development team. Wozniak has argued that FSRS reproduces ideas already present in SuperMemo's algorithms and that the benchmark underestimates the quality of SM-17 and SM-18 because the comparison uses only 18 collections. The FSRS team has responded with specific cases where FSRS-6 produces better-calibrated predictions than published SM-17 benchmarks [16]. The disagreement is partly methodological, revolving around what the right universal metric for algorithm comparison should be, and partly about access, since only one side releases its code publicly.
There is also an unresolved theoretical question about the shape of forgetting. Wixted and Ebbesen's power-law finding has been challenged by researchers who argue that a sum of exponentials with heterogeneous decay rates produces power-law-looking aggregate curves [29]. If individual memories each decay exponentially but at different rates, the average across many memories will look like a power law even though no single memory follows one. This is not just an academic distinction. It affects whether the optimal forgetting curve is a property of the individual learner, of the specific card, or of the population.
A broader randomized trial directly comparing learning outcomes, not just prediction accuracy, between FSRS and SM-2 at matched retention targets would substantially advance this debate. As of 2026, that trial has not been published. The evidence is strong that FSRS predicts recall probabilities more accurately. The evidence that this translates proportionally into better learning outcomes, while plausible, remains inferential.

Conclusion
The question of what makes a spaced repetition algorithm effective turns out to have no single answer. Effectiveness is a collection of properties: calibrated predictions, correct discrimination, individual adaptivity, review efficiency, graceful lapse management, robustness to real-world study patterns, and a sensible balance between computational complexity and practical deployability.
The science underneath these properties runs deep. Long-term potentiation gives the spacing effect a cellular substrate. Reconsolidation gives each retrieval event a biological purpose. Sleep-dependent consolidation explains why intervals shorter than a day produce systematically weaker learning. And the mathematics of forgetting, whether power-law or exponential, determines the shape of every interval the algorithm calculates.
The open-spaced-repetition benchmark has provided the first rigorous, large-scale comparison of algorithms, showing that DSR-based approaches like FSRS predict recall probabilities more accurately than fixed-multiplier schemes like SM-2. But the benchmark also has limitations. It measures prediction, not learning. It tests on power users, not casual learners. And it cannot evaluate proprietary algorithms whose code is not released.
The frontier is moving toward content-aware scheduling, where algorithms understand not just when you reviewed a card but what the card contains and how it relates to your other cards. Neural approaches already match hand-crafted algorithms on prediction metrics. The challenge is building systems that combine neural prediction accuracy with the interpretability and offline deployability that learners need.
For the learner making a practical decision today, the evidence points clearly in one direction. Use a spaced repetition system. Choose one that adapts to your individual patterns. Set a desired retention that balances your workload with your goals. And know that the algorithm behind your flashcards has been shaped by a century of research, from Ebbinghaus counting nonsense syllables in 1885 to machine learning models trained on seven hundred million reviews in 2022.
The science is not finished. But the tools it has produced are already remarkable.

Frequently Asked Questions
What is the most accurate spaced repetition algorithm in 2026?
Based on the largest public benchmark covering approximately 350 million reviews from 9,999 users, FSRS-6 with per-user optimization currently achieves the lowest prediction error. It outperforms the probability-converted SM-2 in 99.6 percent of tested collections. However, proprietary algorithms like SM-18 and SM-19 cannot be directly compared because their code is not publicly available.
How does a spaced repetition algorithm predict when I will forget?
Most modern algorithms track two variables for each card. Stability measures how slowly your memory decays. Retrievability measures the current probability that you can recall the card right now. The algorithm uses a mathematical forgetting curve to estimate when retrievability will drop below your target retention, typically 90 percent, and schedules the review just before that point.
Does it matter which spaced repetition algorithm I use?
The difference between using any spaced repetition system and using none is enormous, with meta-analyses showing large effect sizes. The difference between algorithms is smaller but still meaningful at scale. For a learner maintaining thousands of cards over years, an algorithm that requires 20 percent fewer reviews saves hundreds of hours. For casual short-term study, the algorithm choice matters much less.
What is ease hell in spaced repetition?
Ease hell is a pathology specific to the SM-2 algorithm where cards become permanently trapped in frequent review. Each forgotten card loses ease factor points, but correct recalls barely restore them. After roughly six failures, the card hits a minimum ease floor and remains there indefinitely, even if the learner subsequently recalls it correctly dozens of times. Modern DSR-based algorithms avoid this by recalculating difficulty from the full review history.
How many reviews does a spaced repetition algorithm need before it can personalize to me?
Algorithms like FSRS begin with default parameters trained on hundreds of millions of reviews from thousands of users. These defaults are already effective. Per-user optimization starts showing clear advantages after approximately 400 to 1,000 reviews. After several thousand reviews, the personalized model significantly outperforms the defaults for most learners.





