Spaced Repetition Algorithms: From SM-2 to FSRS

How the algorithms that schedule human memory evolved from a Polish student's notebook in 1985 to machine-learning models trained on seven hundred million flashcard reviews — and why it took thirty-five years.

Introduction

In 1987, a twenty-three-year-old biology student in Poznań, Poland, wrote a small program in Turbo Pascal that would quietly shape how tens of millions of people study. The program contained a single formula. It told the user when to review a piece of information based on how easily they had recalled it the last time. That formula — known as Algorithm SM-2 — became the most widely deployed scheduling equation in the history of learning technology [1]. For thirty-five years, almost every open-source flashcard system on earth used some version of it. And for thirty-five years, the science of memory moved on while the formula stayed frozen. New models of how the brain stores and retrieves information were published. New mathematical frameworks appeared. But the algorithm that actually scheduled billions of daily reviews remained the one a graduate student had scribbled in a master's thesis.

Evolution of memory science: 1980s computer to modern neural networks.

Then, in 2022, a Chinese undergraduate published a paper at one of the world's top data-mining conferences. His algorithm — FSRS, the Free Spaced Repetition Scheduler — used machine learning to fit a personalized model of memory to each individual learner. Within two years it had replaced SM-2 as the default scheduler in the world's largest open-source review platform. Benchmarks on roughly seven hundred million reviews showed it outperformed SM-2 for more than ninety-nine percent of users tested [2].

This is the story of how that happened. Not just the engineering. The science underneath.

The Forgetting Curve That Started Everything

The story begins a century before any computer was involved. In 1885, a German psychologist named Hermann Ebbinghaus published a slim monograph called Über das Gedächtnis — On Memory. It was one of the most unusual experiments in the history of science, because Ebbinghaus was both the researcher and the only subject [3].

To eliminate the influence of meaning on memory, he invented nonsense syllables — roughly 2,300 consonant-vowel-consonant combinations like ZOF, BOK, and DAX. He memorized lists of them until he could recite each list twice without error, then tested himself again after delays of twenty minutes, one hour, nine hours, one day, two days, six days, and thirty-one days. His measure was elegant: how much time did relearning save compared to original learning? If a list took ten minutes to learn originally and only four minutes to relearn, the savings was sixty percent.

The result was a curve. It dropped sharply in the first hour — roughly forty-four percent of material was lost after twenty minutes, fifty-six percent after one hour. Then it flattened. After a day, about sixty-seven percent was gone. After a month, seventy-nine percent [4].

That curve — steep at first, then gradually leveling — would become the foundation of every scheduling algorithm that followed. In 2015, Jaap Murre and Joeri Dros at the University of Amsterdam replicated Ebbinghaus's experiment with modern methods. One subject memorized seventy hours of nonsense syllables following the original protocol. The resulting curve matched Ebbinghaus's data with striking fidelity, with one interesting exception: a slight upward bump at the twenty-four-hour mark, likely reflecting sleep-dependent memory consolidation [4].

But Ebbinghaus measured forgetting. He did not study spaced repetition directly. The question of when to review — and whether distributing reviews over time actually beats cramming — took another half-century to answer rigorously.

Abstract representation of the Ebbinghaus forgetting curve in flowing sand.

From Iowa Schoolchildren to Expanding Intervals

In 1939, Herbert Spitzer tested 3,605 sixth-graders across Iowa on their retention of factual articles about peanuts and bamboo [5]. He manipulated the timing of recall tests across nine conditions. The results were unambiguous: students who took a test soon after reading, then again at expanding intervals, retained dramatically more than those who simply reread the material. Spitzer had demonstrated the spacing effect in a real classroom at scale — but his paper was largely ignored for thirty years.

The theoretical machinery came later. In 1978, Thomas Landauer and Robert Bjork proposed that expanding intervals — short gaps at first, growing longer as retrieval succeeds — might be the optimal schedule for learning new information [6]. Their "expanding retrieval practice" schedule influenced nearly every digital scheduler that followed.

Robert Bjork and Elizabeth Bjork then introduced a framework that would prove essential for understanding why spacing works. Their "New Theory of Disuse" distinguished between two independent properties of every memory: storage strength — how deeply encoded a memory is, which only increases with use — and retrieval strength — how easily the memory can be accessed right now, which fluctuates with time and context [7]. The critical insight: retrieving a memory when retrieval strength is low produces a disproportionately large gain in storage strength. Struggling to remember something and succeeding makes the memory far more durable than recalling it effortlessly.

This dissociation — storage versus retrieval — would later be independently formalized by the creator of SM-2 as stability versus retrievability. Two different researchers, working in different decades, converged on the same fundamental architecture of human memory.

The quantitative capstone came from Nicholas Cepeda and colleagues. Their 2006 meta-analysis in Psychological Bulletin synthesized 839 effect-size contrasts from 317 experiments across 184 articles, confirming that distributed practice consistently outperforms massed practice for long-term retention [8]. A follow-up study in 2008 tested 1,350 participants and found that the optimal gap between study sessions scales with the desired retention interval — roughly ten to twenty percent of the time you want to remember the material [9]. Want to remember something for a year? Space your reviews about a month apart. Want to remember for a week? A day or two is enough.

The science was settled. Spacing works. The question became: can we automate it?

Overhead view of a vintage 1930s classroom with wooden desks.

The Birth of the First Algorithm

Sebastian Leitner, a German science journalist, provided the first practical system in 1972 with his book So lernt man lernen — How to Learn to Learn. His method was mechanical: physical flashcards move through a series of numbered boxes. Get a card right and it advances to the next box, which is reviewed less frequently. Get it wrong and it returns to box one [10]. The intervals attached to each box were fixed — perhaps daily for box one, every three days for box two, weekly for box three. Leitner's system was not truly an algorithm. It had no per-item parameters, no difficulty estimation, no adaptive scheduling. But it encoded the core principle: successful recall should push an item toward less frequent review, and failure should pull it back toward more frequent review.

The leap from physical boxes to computational scheduling happened in a university dormitory in Poznań. Piotr Woźniak, a molecular biology student at Adam Mickiewicz University, had calculated in 1982 that at his prevailing rate of forgetting, it would take roughly one hundred twenty years to master the English vocabulary he needed [11]. Desperate for a solution, he ran two experiments in 1985 on his own learning. The first tracked how different spacing patterns affected retention. The second attempted to approximate the optimal intervals between reviews.

On July 31, 1985, Woźniak began applying the results to his own study — the date he later called the birthday of computational spaced repetition. The system was entirely paper-based: pages of forty question-answer pairs reviewed at intervals of roughly one, seven, sixteen, and thirty-five days. He called it SM-0 [12].

Two years later, Woźniak acquired an IBM PC and wrote the first computerized version in Turbo Pascal 3.0 over sixteen evenings in late 1987. The scheduling logic — Algorithm SM-2 — introduced three innovations that became canonical.

First, each item carried its own ease factor, initialized at 2.5. This number controlled how quickly intervals grew. A card with a high ease factor would be reviewed less and less frequently. A card with a low ease factor would stay in frequent rotation.

Second, the user graded each recall on a scale from zero to five — zero meaning total blackout, five meaning perfect recall. The algorithm used this grade to adjust the ease factor after every review. The update equation was simple:

EF' = EF + (0.1 - (5 - q) × (0.08 + (5 - q) × 0.02))

where q is the grade. For a grade of four, the ease factor stays unchanged. For a five, it increases by 0.1. For a three, it decreases by 0.14 [1].

Third, intervals followed a recursive formula: the first review came after one day, the second after six days, and every subsequent interval was the previous interval multiplied by the ease factor. A card with an ease factor of 2.5 would follow a schedule roughly like: 1 day, 6 days, 15 days, 38 days, 94 days.

1885

Ebbinghaus publishes the forgetting curve

1939

Spitzer tests spacing effect in 3,605 Iowa students

1972

Leitner publishes the cardboard box system

1978

Landauer and Bjork propose expanding retrieval practice

1985

Wozniak creates SM-0 — the first paper-based scheduler

1987

SM-2 becomes the first computerized scheduling algorithm

1992

Bjork and Bjork publish the New Theory of Disuse

2006

Cepeda meta-analysis confirms spacing effect across 317 experiments

In his 1990 master's thesis, Woźniak reported the results: during the first year of using SM-2, he memorized 10,255 English vocabulary items at forty-one minutes per day, with overall retention of 92 percent when excluding items still in their first three weeks of learning [13].

The decision that sealed SM-2's dominance came in 1990, when Woźniak placed the full algorithm description in the public domain as an appendix to his thesis. Every subsequent algorithm he created remained proprietary. SM-2 was the last one anyone could freely use.

Thirty Years of Invisible Progress

What happened next is one of the strangest stories in the history of applied science. The creator of SM-2 spent the next three decades building better algorithms — dramatically better algorithms — and almost none of them reached the outside world.

Algorithm SM-4 appeared in 1989, the first version whose scheduling adapted to user performance. SM-5, released the same year, replaced the matrix of optimal intervals with a matrix of optimal factors — the ratio between successive intervals rather than the intervals themselves. This seemingly small change eliminated a mathematical inconsistency in SM-4 and roughly doubled acquisition speed [14].

SM-6 in 1991 was the first to use regression on actual forgetting-curve data. Instead of adjusting intervals through trial and error, it plotted how recall probability decayed over time for each difficulty category and derived optimal intervals directly from the curves [15].

SM-8 in 1995 was the first algorithm designed entirely from previously collected data — millions of repetitions accumulated by users of earlier versions. SM-11 in 2002 added robustness against delayed or early reviews. SM-15 in 2011 extended the system to handle review delays of up to fifteen years [16].

But the real breakthrough came in 2016 with SM-17. This was the first algorithm built explicitly on what its creator called the two-component model of long-term memory [17]. The model proposed that every memory has two measurable properties: stability — how long it takes for the probability of recall to drop from 100 percent to 90 percent — and retrievability — the current probability that the memory can be accessed right now. These two numbers, combined with a third variable representing item difficulty, formed the DSR model (Difficulty, Stability, Retrievability) that would later become the theoretical core of FSRS as well.

SM-17 computed a three-dimensional stability-increase matrix — a function that predicted how much stability would grow when a card was reviewed at a given level of retrievability and difficulty. The mathematics were sophisticated: hill-climbing optimization on a 3D surface fitted to millions of data points [17].

SM-18 in 2019 refined difficulty estimation. SM-19 in 2024 improved post-lapse stability calculations. And SM-20, announced in 2026, is described as the first version where all parameters are computed by machine learning rather than hand-tuned heuristics — a convergence, four years later, with the same methodology that FSRS had introduced from outside [15].

Every one of these algorithms was proprietary. The source code was never published. The mathematical descriptions in the public wiki were detailed enough to understand the concepts but incomplete enough to prevent faithful reimplementation. And so the open-source ecosystem remained stuck on SM-2.

Algorithm	Year	Key Innovation	Published?
SM-0	1985	Paper-based fixed interval schedule	Yes (described in thesis)
SM-2	1987	Per-item ease factor with recursive intervals	Yes (full algorithm public)
SM-5	1989	Matrix of optimal factors — fast convergence	Partial (concept only)
SM-6	1991	Regression on forgetting curves	Partial (concept only)
SM-8	1995	Entirely data-driven design	Proprietary
SM-17	2016	Two-component memory model (DSR)	Concept public — code proprietary
SM-18	2019	Improved difficulty estimation	Proprietary
SM-20	2026	All parameters via machine learning	Proprietary

Locked vault door ajar, golden light revealing hidden algorithms in space.

The Ease Hell Problem

The consequences of freezing on SM-2 were not just theoretical. A well-documented pathological behavior emerged over years of real-world use.

Remember the ease factor update formula. Every time a user presses "Again" — indicating they forgot the answer — the ease factor drops by 0.20. Press "Hard" and it drops by 0.15. But the only way to raise it back is to press "Easy," which increases it by just 0.15. Most learners almost never press "Easy" because it feels dishonest — they know the answer, but not effortlessly [18].

The result is a one-way ratchet. Over months and years, the ease factor of difficult cards drifts steadily downward until it hits the hard floor of 1.3 — the minimum allowed value. Once there, intervals barely grow. A card that should have been scheduled thirty days out gets scheduled for only eight. The user reviews it again, gets it right, but the interval only inches up to ten. The card is trapped.

Users called this ease hell. In mature collections, distributions of ease factors across thousands of cards routinely showed a massive spike at exactly 1.3 — meaning a large fraction of all cards had collapsed to the minimum [18].

The community response was creative but ultimately a collection of patches. Some users manually reset all ease factors to 2.5 using database queries. Others installed add-ons that boosted ease after streaks of correct answers. Some adopted the "Low Key" approach — locking all ease factors permanently at 2.5, effectively disabling difficulty tracking entirely.

None of these solutions addressed the fundamental problem: SM-2 conflates difficulty, stability, and retrievability into a single number. The ease factor is simultaneously a measure of how hard the card is, how stable the memory is, and how the memory responds to review timing. Three separate phenomena, one variable. The proprietary algorithms had solved this by 2016 with the DSR model. The open-source world had to wait until 2022.

Spiral staircase descending into darkness, glowing floor marked 1.3.

The Outsiders Who Tried

During the three decades of SM-2 dominance, several research groups independently attacked the scheduling problem from different angles.

Philip Pavlik and John Anderson at Carnegie Mellon University used the ACT-R cognitive architecture to model memory as a sum of power-function decays from each prior practice event [19]. Their model made a counterintuitive prediction: under certain conditions, contracting intervals — getting shorter over time, not longer — could outperform expanding ones. The prediction was empirically confirmed in some laboratory settings but never deployed at scale.

Burr Settles and Ben Meeder at Duolingo published Half-Life Regression at the 2016 meeting of the Association for Computational Linguistics [20]. Their model predicted each item's exponential half-life as a log-linear function of past performance and lexeme features. Deployed in production at Duolingo, it reduced prediction error by forty-five percent relative to baseline schedulers and improved daily user engagement by twelve percent. But HLR was designed for Duolingo's specific use case — short vocabulary exercises in a gamified app — and was never released as a general-purpose scheduler.

Behzad Tabibian and colleagues published MEMORIZE in the Proceedings of the National Academy of Sciences in 2019, casting spaced repetition as a stochastic optimal-control problem on marked temporal point processes [21]. The mathematical framework was elegant, but its computational demands and lack of open-source tooling limited adoption.

Ahmed Fasih created Ebisu in 2017, a Bayesian model that maintained a Beta distribution over each item's recall probability and updated it analytically from quiz outcomes [22]. Ebisu was computationally cheap and mathematically principled, but it used only a two-component model and lacked the difficulty parameter needed for large heterogeneous collections.

Each of these approaches produced genuine scientific insights. None displaced SM-2 in practice. The reason was not intellectual but sociological: the open-source flashcard ecosystem was enormous, entrenched, and its users were accustomed to SM-2's behavior. Switching algorithms meant potentially disrupting the scheduling of millions of existing cards.

Aerial view of a labyrinth garden with diverging stone pathways.

A KDD Paper That Changed the Game

The breakthrough came from an unexpected direction. Jarrett Ye was an undergraduate researcher at MaiMemo, a Chinese vocabulary-learning company. He noticed that the company's internal scheduling model — called DHP (Difficulty, Half-life, Probability) — was essentially a variant of the DSR model that had been described in the proprietary scheduling literature. More importantly, he realized the model's parameters could be trained directly with gradient descent on time-series review data [23].

In August 2022, Ye published a paper at ACM SIGKDD — the International Conference on Knowledge Discovery and Data Mining, one of the top venues in machine learning. The paper proposed what he called SSP-MMC: a Stochastic Shortest Path algorithm to Minimize Memorization Cost. It modeled student memory as a Markov decision process with stability, difficulty, and retrievability as state variables, and used value iteration to find the scheduling policy that minimized total reviews needed to push stability past a target.

On September 18, 2022, Ye released a working implementation as a community add-on. A commenter on Reddit dismissed it as academic work nobody would actually use. That commenter — later known by the pseudonym Expertium — went on to become one of the project's most prolific contributors and its primary public documentarian [24].

FSRS evolved through rapid iterations. Version 3, released in October 2022, was the first widely used release. It modeled forgetting with a pure exponential curve. Version 4, in July 2023, replaced the exponential with a power-function curve — motivated by the same Wickelgren power law that memory scientists had documented decades earlier [25]. Version 4.5 refined the curve to a specific form where retrievability at exactly one stability unit equals 90 percent by construction. Version 5 added handling for same-day reviews. Version 6, released in 2025, introduced a twenty-first trainable parameter that personalized the forgetting curve's decay rate for each individual user [26].

In November 2023, FSRS was integrated natively as an opt-in scheduler in the world's largest open-source flashcard platform — the first time in seventeen years that a non-SM-2 algorithm was offered. By 2025, it had become the default [27].

Mathematical model of a Markov decision process on a glowing screen.

Three Numbers Instead of One

The mathematical heart of FSRS is the same three-component model that the proprietary algorithms had been using since 2016. But FSRS implements it differently — with fewer parameters, open-source code, and gradient-based optimization.

Every card in FSRS carries three numbers. The first is Difficulty — a value between 1 and 10 representing how inherently hard the card is for the specific learner. Unlike SM-2's ease factor, Difficulty in FSRS uses mean reversion: after consecutive correct answers, it gradually returns toward a baseline rather than staying permanently damaged. This single design choice eliminates ease hell architecturally [28].

The second is Stability — measured in days, defined as the time required for retrievability to fall from 100 percent to 90 percent. A card with stability of 30 means the learner has a 90 percent chance of recalling it after 30 days. After 60 days, the probability is lower. After 15 days, higher. Stability grows with each successful review and resets to a smaller value after failure.

The third is Retrievability — the predicted probability of correct recall right now, computed as a function of elapsed time since the last review and the card's current stability. The forgetting curve in FSRS-6 follows a power function:

R(t, S) = (1 + t × F / S) ^ C

where t is elapsed time in days, S is stability, and F and C are trainable shape parameters. When t equals S, retrievability equals exactly 0.9 by construction [26].

After each review, stability is updated by multiplying it by a stability increase factor — a function of difficulty, current stability, and retrievability at the moment of review. The key insight: harder items receive smaller stability boosts. Items with higher existing stability also receive proportionally smaller boosts — a phenomenon called stabilization decay. And items reviewed at lower retrievability (closer to forgetting) receive larger boosts — precisely what Bjork's desirable-difficulties framework predicts [7].

FSRS-6 has twenty-one trainable parameters. The default values were trained on roughly seven hundred million reviews from about ten thousand users via the anki-revlogs-10k dataset. But each individual user can optimize FSRS to their own review history using gradient descent — typically after accumulating about one thousand reviews. The per-user optimization accounts for which card types the learner finds harder, how quickly they forget, and how lapses affect long-term stability [29].

Reference implementations exist in Python, Rust, Go, JavaScript, Swift, Java, Dart, Ruby, Elixir, and C++. The algorithm is fully open-source under the MIT license.

Three translucent spheres in space, labeled D S R, interconnected by energy.

The Benchmark That Settled the Argument

Claims about algorithm superiority need data. The open-spaced-repetition project maintains the largest public benchmark of scheduling algorithms ever constructed [29].

The dataset contains approximately 9,999 collections from real users, totaling roughly 350 million review predictions after filtering. The primary evaluation metric is log loss — a standard machine-learning measure of how well predicted probabilities match actual outcomes. Lower is better. If an algorithm predicts a 90 percent chance of correct recall and the user gets it right, log loss is small. If the algorithm predicts 90 percent and the user gets it wrong, log loss is large.

The headline result: FSRS-6 with recency weighting achieves a mean log loss of approximately 0.344. SM-2, adapted with additional formulas to produce probability predictions, scores significantly higher. FSRS-6 produces lower log loss than SM-2 in roughly 99.6 percent of evaluated collections [2].

The practical implication, based on simulation studies: students using FSRS need approximately twenty to thirty percent fewer reviews to maintain the same retention rate. For someone reviewing five hundred cards per day, that translates to one hundred to one hundred fifty fewer reviews daily — a meaningful reduction in study time [2].

Against a re-implementation of the proprietary SM-17 algorithm — tested on a smaller subset of eighteen collections where SM-17 logs were available — FSRS achieved approximately eighty-three percent superiority [30]. Even a neural network baseline (GRU-P) outperforms FSRS-6 in raw log loss, suggesting room for further improvement.

But these numbers come with important caveats. SM-2 was never designed to predict probabilities — it was designed to schedule reviews. The comparison requires adding formulas to SM-2 that it was never built with. The twenty to thirty percent efficiency claim comes from simulation, not a randomized controlled trial with real students. And the benchmark was developed by the same team that created FSRS, raising legitimate questions about evaluation bias [31].

The Debate That Is Not Over

The creator of SM-2 and its proprietary successors has responded extensively to the FSRS benchmarks. His position is nuanced [31].

He acknowledges that the FSRS creator "truly understood Algorithm SM-17 and the three-component model of memory" and that the design "deserves praise." But he disputes the comparison methodology on three grounds.

First, he argues that standard machine-learning metrics like log loss are inappropriate for evaluating scheduling algorithms. Log loss rewards calibrated probability predictions — but a scheduling algorithm's job is not to predict probabilities. Its job is to schedule reviews at the right time. An algorithm could have poor log loss but excellent scheduling outcomes, or vice versa.

Second, he contends that the benchmark data — drawn from the largest open-source flashcard platform — is biased toward what he calls "crammers and procrastinators." Users of this platform frequently delay reviews by days or weeks, study in irregular bursts, and abandon cards for months before resuming. This behavior pattern distorts the data-generating process, potentially flattering algorithms that handle irregular use well and penalizing those tuned for disciplined daily practice.

Third, he proposes an alternative evaluation metric — the Universal Metric — based on retrievability-binned root-mean-square error computed inside his own proprietary system. He claims that the latest proprietary algorithms, SM-19 and SM-20, outperform FSRS on this metric. The FSRS community counters that these comparisons test an optimized proprietary algorithm against an unoptimized FSRS, using the proprietary system's own categorization scheme [32].

A direct head-to-head test within the same software — the only way to resolve the dispute definitively — has been planned but not yet completed as of mid-2026.

This is a genuine scientific debate, not a settled question. But for practical purposes, the relevant comparison for most learners is not FSRS versus SM-17 or SM-20. It is FSRS versus SM-2 — because SM-2 is what the open-source world used for seventeen years and what millions of existing card collections still run on. And on that comparison, the data is overwhelming.

Two contrasting chess boards mid-game, highlighting performance measurement debate.

The Brain Underneath the Math

Every scheduling algorithm, from SM-2 to FSRS-6, ultimately rests on the biology of how the brain converts experience into lasting memory. Understanding the algorithms requires understanding what happens at the synapse.

When two neurons fire together repeatedly, the connection between them grows stronger — a process called long-term potentiation, or LTP, first demonstrated by Terje Lømo in a rabbit hippocampus in 1966 and formalized by Timothy Bliss and Lømo in 1973 [33]. LTP has two phases: early-phase LTP lasts minutes to hours and depends on existing protein modification. Late-phase LTP lasts days to weeks and requires new protein synthesis — literally building new molecular structures at the synapse [34].

This is where spacing enters the picture. Paul Smolen, Douglas Baxter, and John Byrne published a landmark review in Nature Reviews Neuroscience in 2016 explaining why spaced practice works at the cellular level [34]. The signaling cascades that trigger late-phase LTP — particularly the PKA and MAPK pathways — have refractory periods. After one round of stimulation, they need time to reset before the next round can add to the effect. Massed practice triggers the cascade once and then hammers a system that is temporarily unable to respond. Spaced practice gives each round of stimulation time to consolidate before the next one builds on top of it. The result is additive potentiation rather than saturating potentiation.

The biological parallels to the DSR model are striking. Stability maps onto the structural changes at the synapse — new receptor proteins inserted, dendritic spines enlarged, new synaptic contacts formed. Retrievability maps onto the current strength of the neural pathway — which fluctuates with time, interference, and context. Difficulty maps onto the complexity of the synaptic network that encodes the memory — more complex memories require more distributed networks and are therefore harder to fully potentiate [35].

The most recent evidence from applied settings confirms the magnitude of the effect. A 2025 study in Academic Medicine followed approximately 26,000 physicians over nearly three years and found that different spacing strategies measurably altered long-term knowledge retention and transfer [36]. A meta-analysis covering more than 21,000 medical learners reported a standardized mean difference of 0.78 — a large effect by any standard — for spaced repetition over conventional study methods [37]. And John Dunlosky and colleagues, in their influential 2013 review of ten popular study techniques, rated spaced practice and practice testing as the only two methods to receive a "high utility" rating [38].

The science of spacing is not in question. The question is which algorithm spaces most efficiently. And the answer, for the first time in thirty-five years, is changing.

Microscopic view of a glowing synapse with neurotransmitter release.

What Comes After FSRS

The convergence between the proprietary and open-source worlds is now unmistakable. Both camps use the DSR model. Both fit their parameters from data. Both acknowledge that difficulty, stability, and retrievability are the minimum necessary state variables for describing a single memory trace. The remaining differences — incremental online updates versus batch gradient descent, exponential versus power forgetting curves, item-difficulty estimation strategies — are differences of degree, not kind.

But neither FSRS nor any existing algorithm accounts for several factors that memory science has identified as important. Sleep-dependent consolidation — documented by Murre and Dros's twenty-four-hour bump and by decades of research on hippocampal replay during slow-wave sleep [39] [40] — is absorbed into noise rather than modeled explicitly. Circadian rhythms, which gate synaptic plasticity at the molecular level, are ignored entirely. Proactive and retroactive interference between similar items — the reason you confuse the Spanish word for "embarrassed" with the English word "embarrassed" — is not captured by any item-level parameter.

Recent work suggests these gaps may start to close. In 2024, researchers introduced KAR³L, a scheduling model that integrates BERT-style content embeddings into the review scheduler, allowing the algorithm to use semantic similarity between cards to inform difficulty estimation [41]. In 2025, a paper on LECTOR used large language models to detect semantic confusion between flashcards and adjust scheduling accordingly, reporting improved success rates in simulation [42].

The trajectory is clear. The next generation of scheduling algorithms will likely move beyond item-level state variables to incorporate content-aware models that understand what is being learned, not just how well it has been recalled. They may incorporate sleep timing, circadian phase, and interference from related material. And they will almost certainly use the same DSR foundation that Woźniak first articulated and that Ye operationalized — because the three-component model of memory is not just a useful abstraction. It appears to map, at a surprisingly deep level, onto what actually happens at the synapse [43].

Futuristic neural constellation with active and dim memory nodes in space.

Conclusion

The forty-year arc from SM-2 to FSRS reveals a pattern that extends far beyond scheduling algorithms. Theoretical science — Ebbinghaus's curve, the spacing effect, Bjork's storage-retrieval distinction — matured decades before the technology that would exploit it. A single elegant formula, published freely, dominated practice for thirty-five years not because it was the best available, but because everything better was locked behind proprietary walls. And the eventual displacement came not from inside the original lineage, but from outside — an undergraduate who recognized that with cheap compute and open data, the right approach was not algebraic cleverness but statistical learning.

The DSR model — Difficulty, Stability, Retrievability — has now been independently validated by both the proprietary and open-source traditions. It is grounded in biological reality: stability corresponds to synaptic structural changes, retrievability to current pathway strength, and difficulty to network complexity. The mathematical convergence between SM-17 and FSRS, despite their independent origins, is perhaps the strongest evidence that the model captures something real about how human memory works.

For learners, the practical message is both simple and profound. The spacing effect is one of the most replicated findings in all of psychology. The only question is how precisely the algorithm can schedule the next review. SM-2 answered that question with one number per card and a fixed formula. FSRS answers it with three numbers per card and a model trained on the learner's own history. The difference — roughly twenty to thirty percent fewer reviews for the same retention — may not sound revolutionary. But compounded over months and years, across thousands of cards, it is the difference between sustainable practice and burnout.

The story is not over. The next frontier — algorithms that understand content, not just performance — is already visible on the horizon. But the foundation is now solid, open, and mathematically clear. A century and a half after Ebbinghaus sat alone in his study memorizing nonsense syllables, the machines we have built to fight forgetting have finally caught up with the science of memory itself.

Frequently Asked Questions

What is the difference between SM-2 and FSRS?

SM-2 uses a single ease factor per card and a fixed update formula developed in 1987. FSRS uses three variables — difficulty, stability, and retrievability — fitted by machine learning to individual review history. Benchmarks show FSRS reduces required reviews by roughly twenty to thirty percent for the same retention rate.

Why did SM-2 remain the standard for so long?

SM-2 was the last scheduling algorithm published with complete mathematical documentation. Every subsequent version from the original creator was proprietary. Open-source projects had no alternative to implement, so SM-2 became the default by necessity rather than optimality.

What is ease hell in spaced repetition?

Ease hell occurs when a card's ease factor drops irreversibly toward the minimum allowed value through repeated difficult recalls. The card becomes trapped in short review intervals that barely grow, causing review overload. FSRS eliminates this through mean-reverting difficulty estimation.

How does the forgetting curve relate to scheduling algorithms?

The forgetting curve describes how recall probability decays over time after learning. Scheduling algorithms use mathematical models of this curve to predict when a card is about to be forgotten and schedule the review just before that point. FSRS models the curve as a power function with trainable parameters.

Can spaced repetition algorithms account for sleep and circadian rhythms?

Current algorithms including SM-2 and FSRS do not explicitly model sleep or circadian effects. These factors are absorbed into general noise or difficulty estimates. Emerging research using content-aware models and biological timing data suggests future algorithms may incorporate these variables directly.

Cookies ... Yumm!