Introduction
In the fall of 1994, a cognitive psychologist at Carnegie Mellon University asked a question that would quietly reshape education technology for the next three decades. Albert Corbett wanted to know: can a computer figure out, in real time, whether a student has actually learned something, or merely guessed the right answer? The mathematics of knowledge tracing was his answer. He built a probabilistic model with just four numbers, and those four numbers turned out to be powerful enough to drive personalized tutoring for hundreds of thousands of students [1].
That original equation has since been challenged, extended, replaced, and resurrected. Researchers at Stanford wrapped it in neural networks. A team in Seoul fed it through transformers trained on 131 million student interactions. A high school student in China rewrote parts of it to schedule flashcard reviews for millions of users worldwide. And in 2025, a group at the University of Vienna plugged it into an architecture inspired by the human brain's own gating mechanisms [2].
This is the story of the equations behind those systems. Not a textbook treatment, but the actual narrative: who built them, why, what broke, what got fixed, and what the math means for anyone trying to learn anything.

A Professor, a Tutor, and a Hidden Variable
The story starts not with Corbett alone, but with his mentor, John Anderson. Anderson had spent the 1980s building ACT-R, a theory of how human cognition works as a set of production rules. Think of it like this: if you see "7 × 8" and automatically think "56," your brain has fired a production rule. Anderson believed that learning a skill meant acquiring the right set of these rules, and that a computer could track which rules a student had acquired by watching their problem-solving behavior [3].
The problem was observation. A student gets a math problem right. Does that mean they know the underlying rule? Maybe. Or maybe they got lucky. A student gets a problem wrong. Does that mean they lack the skill? Maybe. Or maybe they slipped, knew the answer but made a careless mistake.
Corbett formalized this uncertainty using a Hidden Markov Model, a statistical framework that had already proven itself in speech recognition and genetics. In an HMM, there is a hidden state you cannot observe directly, and an observable output that gives you imperfect clues about that state. For knowledge tracing, the hidden state is binary: the student either knows the skill or does not. The observable output is also binary: the student answers correctly or incorrectly. The relationship between the two is uncertain, governed by probabilities [1].
What made this model elegant was its simplicity. The entire system runs on four numbers.

Four Numbers That Predict Learning
The four parameters of Bayesian Knowledge Tracing, or BKT, are deceptively simple. P(L₀) is the prior, the probability that the student already knows the skill before any practice. P(T) is the transit rate, the probability of transitioning from not knowing to knowing after each practice opportunity. P(G) is the guess rate, the probability of answering correctly despite not knowing. And P(S) is the slip rate, the probability of answering incorrectly despite knowing [4].
Here is how the math works, step by step. Suppose a student is learning single-digit multiplication. The system sets initial parameters: P(L₀) = 0.10 (most students start not knowing), P(T) = 0.20 (a 20% chance of learning per opportunity), P(G) = 0.25 (one-in-four chance of guessing right on a four-option question), and P(S) = 0.05 (a small chance of careless error).
The student tries the first problem and gets it right. What does the system conclude?
It applies Bayes' theorem. The probability that the student knew the skill, given a correct answer, equals the probability of knowing times the probability of not slipping, divided by the total probability of a correct answer. Plugging in: (0.10 × 0.95) / (0.10 × 0.95 + 0.90 × 0.25) = 0.095 / 0.32 = 0.297.
So after one correct answer, the estimated mastery jumps from 10% to about 30%. Still below threshold. The system then propagates forward: the new P(L) = 0.297 + (1 − 0.297) × 0.20 = 0.438. After just one correct response, the model estimates a 44% chance of mastery [5].
If the student gets the next problem right too, the same update runs again with the new prior. After several consecutive correct answers, P(L) climbs toward 0.95, the typical mastery threshold. The system then stops drilling that skill and moves on. If the student gets one wrong along the way, the posterior drops back, reflecting the possibility that earlier correct answers were lucky guesses.
This table shows a student reaching mastery threshold (0.95) after five opportunities, with one error along the way. The slip on opportunity four drops the estimate, but one more correct response recovers it. The standard fitting procedure uses Expectation-Maximization, the same Baum-Welch algorithm used to train speech recognizers, iterating between estimating hidden states and re-estimating parameters until convergence [6].
One assumption baked into the original model deserves attention. BKT assumes that once a student learns a skill, they never forget it. P(L_{n+1} = 1 given L_n = 1) equals exactly 1.0. No forgetting. This made the math tractable in 1995. It also contradicted everything known about human memory since Ebbinghaus mapped the forgetting curve in 1885. That tension would take two decades to resolve.

The Identifiability Trap
Within a decade of BKT's publication, researchers started noticing something troubling. When they fit the model to real student data, the parameters sometimes came out wrong in ways that made no sense.
Joseph Beck and Ka-Yee Chang at Worcester Polytechnic Institute published the problem formally in 2007 [7]. They showed that different sets of the four BKT parameters could produce identical sequences of predicted correct and incorrect answers. The system was, in technical terms, non-identifiable. Multiple parameter configurations fit the data equally well. Worse, some of the configurations that Expectation-Maximization converged to were semantically absurd. A guess rate of 0.6 would mean students who do not know the skill get it right more often than not, which undermines the entire logic of the model. A slip rate higher than the guess rate would mean knowing the skill makes you worse at answering, which is nonsensical [8].
This sparked something close to a crisis in the knowledge tracing community. If the parameters do not mean what they claim to mean, how can anyone trust the model's mastery estimates?
Ten years later, Shayan Doroudi and Emma Brunskill at Carnegie Mellon published a rebuttal that clarified the situation [8]. They showed that under reasonable constraints, specifically that P(G) + P(S) is less than 1.0, the BKT model is actually identifiable. The practical problem was not mathematical non-identifiability but semantic degeneracy. EM can converge to local optima where parameters are technically valid but pedagogically meaningless. The fix was not to abandon the model but to constrain parameter search spaces and use better initialization.
Brett van de Sande at Arizona State University had already shown in 2013 that BKT is simple enough to solve analytically rather than numerically [5]. The learning trajectory follows an exponential function determined by just three parameters: P(S), P(T), and P(L₀). This closed-form solution both simplified implementation and made the identifiability terrain easier to map.
What does this mean for practice? Any system using BKT must constrain its parameter space. Guess rates above 0.40 should trigger warnings. Slip rates above 0.30 are suspect. And fitting procedures should be initialized from multiple starting points to avoid degenerate local optima. The model works. But it requires careful engineering, not blind application.

When Regression Beat the Hidden Markov Model
Not everyone was convinced that hidden states were necessary. In 2006, Hao Cen, Kenneth Koedinger, and Brian Junker at Carnegie Mellon proposed Learning Factor Analysis, a logistic regression approach that sidestepped the HMM entirely [9]. LFA modeled the log-odds of a correct response as a linear combination of student ability, skill difficulty, and the number of prior practice opportunities on each skill. No hidden state. No Bayesian updates. Just regression on observable features.
Three years later, Philip Pavlik, Cen, and Koedinger took this further with Performance Factor Analysis, or PFA [10]. PFA replaced the simple opportunity count with separate counts of prior successes and failures on each knowledge component. This was a meaningful improvement. A student who has attempted a skill ten times and succeeded eight times is in a very different state than a student who attempted it ten times and succeeded twice. PFA captured this distinction.
PFA brought another advantage. Real problems often require multiple skills simultaneously. A word problem might test both fraction arithmetic and reading comprehension. BKT, which models each skill independently with its own HMM, struggled with multi-skill items. PFA handled them naturally through its feature vector: just sum the contributions from each involved skill.
In head-to-head comparisons on Carnegie Learning Cognitive Tutor data, PFA slightly outperformed standard BKT in cross-validated prediction accuracy [10]. The logistic approach was simpler, faster, and more naturally extensible.
In 2019, Jill-Jénn Vie and Hisashi Kashima at RIKEN and Kyoto University unified the entire family. Their Knowledge Tracing Machines framework showed that Item Response Theory, LFA, PFA, and multidimensional IRT are all special cases of a single factorization machine model [11]. By encoding student, item, skill, and temporal features into a sparse vector and learning pairwise latent-factor interactions, KTM subsumed decades of separate modeling traditions under one mathematical roof. The paper won Best Paper at EDM 2019.

The Stanford Experiment That Started a War
In 2015, Chris Piech and colleagues at Stanford published a paper at NeurIPS that split the knowledge tracing community in two [12]. Their idea was straightforward: replace the hand-engineered HMM with a recurrent neural network. Feed it a sequence of (question, answer) pairs. Let the hidden state learn whatever representation of student knowledge the data supports. No predefined skills. No expert-labeled Q-matrix mapping questions to knowledge components. Just data in, predictions out.
They called it Deep Knowledge Tracing, or DKT. And the reported results were stunning. On the ASSISTments 2009 dataset, a widely used benchmark of middle-school math interactions collected by Neil Heffernan at Worcester Polytechnic Institute [13], DKT achieved an AUC of 0.86, compared to roughly 0.67 for standard BKT. That gap of nearly 20 percentage points suggested that deep learning had made traditional knowledge tracing obsolete.
The field erupted. Dozens of follow-up papers appeared within months. But the celebration was premature.
In 2016, Mohammad Khajah, Robert Lindsey, and Michael Mozer at the University of Colorado published a sharp critique titled "How Deep is Knowledge Tracing?" [14]. They showed that much of DKT's advantage came from the dataset, not the model. The ASSISTments 2009 data contained duplicate rows that inflated prediction scores. When duplicates were removed, DKT's lead shrank dramatically. They also showed that BKT augmented with a forgetting parameter and individualized student priors achieved AUC around 0.83, closing most of the gap.
The same year, Kevin Wilson and colleagues demonstrated that simple Bayesian extensions of Item Response Theory outperformed DKT on several benchmarks [15].
Then came the pyKT bombshell. In 2022, Zitao Liu and colleagues at Guangdong Institute of Intelligent Science and Technology released pyKT, a standardized benchmarking library for knowledge tracing models, presented at NeurIPS [16]. When they re-evaluated DKT under fair conditions, using question-level evaluation without skill-label expansion (which had inflated earlier numbers by up to 8.4% on ASSISTments 2009 and 13.1% on the Algebra 2005 dataset), DKT's AUC on ASSISTments 2009 dropped to 0.7541.
The message from pyKT was sobering. Under controlled conditions, the gap between classical and deep models was real but modest. And the strongest deep baseline was not DKT but AKT, the Attentive Knowledge Tracing model by Aritra Ghosh, Neil Heffernan, and Andrew Lan, which achieved 0.7853 [17].
Meanwhile, DKT itself had serious technical flaws. Chun-Kit Yeung and Dit-Yan Yeung at Hong Kong UST identified two problems in 2018 [18]. First, the reconstruction problem: after a student answers a question correctly, DKT sometimes predicts decreased mastery of that very skill, which is logically incoherent. Second, waviness: predictions oscillate sharply between consecutive time steps, making the inferred knowledge state unreliable for any downstream decision.
Their fix, DKT+, added three regularization terms to the loss function. It helped, but it also revealed how fragile the original formulation was. A model that needs external patches to avoid contradicting itself is harder to trust than one whose structure prevents contradictions by design.

Memory Is Not a Matrix, It Is an Orchestra
The years after DKT saw an explosion of architectural innovation. Each new model tried to solve a specific weakness of what came before.
Jianwen Zhang and colleagues at Hong Kong UST introduced Dynamic Key-Value Memory Networks in 2017 [19]. DKVMN maintained two matrices: a static key matrix storing concept representations, and a dynamic value matrix tracking the student's evolving mastery of each concept. This gave the model something DKT lacked: the ability to trace mastery of individual skills rather than collapsing everything into a single hidden vector.
Shalini Pandey and George Karypis at the University of Minnesota brought transformers to knowledge tracing in 2019 with SAINT, the Self-Attentive Knowledge Tracing model [20]. The attention mechanism allowed the model to look back at the entire interaction history and decide which past responses are most relevant for predicting the next one. No more recurrence bottleneck.
Youngduck Choi and colleagues at Riiid Labs in Seoul extended this with SAINT and SAINT+, full encoder-decoder transformer architectures trained on EdNet, the largest public educational dataset with 131 million interactions from 784,309 students preparing for the Korean TOEIC English exam [21] [22]. SAINT+ added two temporal features: elapsed time on each question and lag time between consecutive interactions. That small addition improved AUC by 1.25 percentage points, demonstrating that time matters.
Chun-Kit Yeung also proposed Deep-IRT in 2019, a hybrid that used DKVMN's memory network to process the interaction sequence but outputted interpretable IRT parameters: student ability and item difficulty [23]. The prediction formula was the classic IRT sigmoid: P(correct) = σ(3.0 × ability − difficulty). This gave the model both the representation power of deep learning and the interpretability of psychometrics.
Perhaps the most surprising result came in 2023. Zitao Liu and colleagues published simpleKT, a stripped-down model that removed most of AKT's complexity while keeping its Rasch-model question embeddings and standard scaled-dot-product attention [24]. SimplyKT matched or beat AKT, ATKT, and DTransformer on most pyKT benchmarks. The implication was uncomfortable for researchers who had spent years adding complexity: much of that complexity added nothing.

The Missing Variable: Time
The single biggest blind spot in knowledge tracing has always been forgetting. Corbett's original BKT assumed that once learned, knowledge stays forever. DKT and its descendants inherited this blindness. None of them model what happens when a student stops practicing for a week, a month, or a year.
This matters because forgetting is not a bug in human memory. It is the central feature. Hermann Ebbinghaus demonstrated in 1885 that recall drops steeply in the hours after learning, then levels off into a long, slow decline [25]. Every spaced repetition system since then, from Sebastian Leitner's cardboard boxes to the algorithms running on millions of phones today, exists to fight that curve.
Mohammad Khajah showed in 2016 that simply adding a forgetting parameter to BKT, allowing P(L_{n+1} = 1 given L_n = 1) to be less than 1.0, boosted its AUC from roughly 0.69 to 0.83 on ASSISTments 2009 [14]. One parameter. That was all it took to close most of the gap between a 1995 model and a 2015 neural network.
Meanwhile, the spaced repetition world had been developing its own mathematical framework for forgetting, largely disconnected from the knowledge tracing literature. In 2016, Burr Settles and Ben Meeder at Duolingo published Half-Life Regression, a model where the probability of recall decays exponentially with a learnable half-life: p = 2^(−δ/h), where δ is time since last review and h is a half-life determined by features of the student and the item [26]. Trained on 13 million user-word pairs from Duolingo's platform, HLR reduced prediction error by more than 45% compared to the Leitner system and increased daily user engagement by 12% in an operational A/B test.
The most recent evolution in this line is FSRS, the Free Spaced Repetition Scheduler, created by Jarrett Ye. FSRS models each memory with three continuous state variables: Difficulty (how hard the item is), Stability (how many days until recall probability drops to a threshold), and Retrievability (the current probability of recall). The forgetting curve follows a power law: R(t, S) = (1 + t / (9 × S))^(−1) [27].
Here is the connection that almost nobody makes. FSRS is, mathematically, a continuous-state knowledge tracing model with explicit forgetting. Where BKT has a binary latent state (knows or does not know), FSRS has a continuous triple (D, S, R). Where BKT has no time dependence between opportunities, FSRS has R(t, S) decaying continuously. Where BKT updates its state through Bayesian inference on binary observations, FSRS updates Stability and Difficulty through differentiable functions trained by gradient descent on review outcomes [28].
Benchmarked on approximately 350 million reviews from about 10,000 users, FSRS version 6 produces more accurate recall predictions than SM-2 for 99.6% of users tested [29]. The two literatures, knowledge tracing and spaced repetition, started from different questions but converged on the same mathematical structure. Knowledge tracing asks: does the student know this right now? Spaced repetition asks: will the student remember this later? Both require modeling a latent knowledge state that changes with practice and decays with time.

From Equations to Classrooms
Mathematics means nothing if it does not work in practice. The most important question about knowledge tracing is not whether AUC improves by two percentage points on a benchmark, but whether students learn more when an algorithm decides what they practice next.
The oldest and most rigorously tested deployment is Carnegie Learning's MATHia platform, descended directly from the Cognitive Tutors that Corbett and Anderson built at CMU. MATHia uses BKT with model tracing to track student mastery of algebraic skills in real time. A randomized controlled trial conducted by RAND (Pane et al. 2014) across 147 schools found that students using the Cognitive Tutor showed a statistically significant advantage in Year 2, with an effect size of d = 0.20, equivalent to moving from the 50th to the 58th percentile [30]. The platform now serves over 500,000 students annually [31].
ASSISTments, built by Neil and Cristina Heffernan at WPI, took a different path. Rather than embedding a specific KT model, ASSISTments uses mastery-based assignment completion with real-time feedback. The platform produces the datasets that most KT researchers train on. A 2024 replication study by WestEd across 5,991 middle-school students in North Carolina found statistically significant gains on the state End-of-Grade math test [32].
A meta-analysis by James Kulik and J.D. Fletcher in the Review of Educational Research found that intelligent tutoring systems, the category that includes KT-powered platforms, produce a median effect size of +0.66 standard deviations compared to conventional instruction [33]. That is a substantial effect. For comparison, reducing class size from 25 to 15 students produces an effect of about 0.20.
The adaptive learning market built on these foundations reached an estimated $4 to $5 billion in 2024 and is projected to grow to $10 to $28 billion by 2030, depending on the analyst [34].

When the Algorithm Gets It Wrong
Every model described in this article makes assumptions that can fail. BKT assumes binary knowledge. DKT assumes no forgetting. PFA assumes skills are independent. Transformer models assume that more data always helps. And all of them assume the training data is representative.
The fairness problem is particularly serious. In 2019, Doroudi and Brunskill showed that BKT systematically underestimates the mastery of above-average learners [35]. Because the model is fit across an entire student population, students who learn faster than average appear to the model as lucky guessers rather than genuine masters. This can cause the system to assign unnecessary extra practice to exactly the students who need it least.
A 2025 study presented at the Educational Data Mining conference examined BKT fairness specifically for math learners and found that prediction accuracy varied significantly by reading ability [36]. Students with lower reading levels received less accurate mastery estimates, likely because their incorrect answers reflected language barriers rather than mathematical misunderstanding.
The pyKT benchmark also exposed a systemic problem in how the field reports results. The same model on the same dataset can produce wildly different AUC numbers depending on preprocessing choices: whether duplicate rows are removed, whether skill labels are expanded or collapsed, whether evaluation is at the question level or the skill level [16]. This means that many of the headline improvements reported in conference papers may not reflect genuine progress.
Théophile Gervet and colleagues at Stanford and École Polytechnique conducted the most careful comparison to date in 2020. Testing DKT, BKT, PFA, and logistic regression baselines across nine datasets with strict five-fold student-level cross-validation, they found that a well-tuned logistic regression model with rich features outperformed DKT on four of nine datasets [37]. Deep learning won on large, sequential datasets. Classical models won on smaller ones. The honest answer to "which model is best?" turned out to be: it depends on your data.

The 2025 Frontier: Bigger Architectures, Harder Questions
The latest chapter of this story is being written in real time. In January 2025, Yiyun Zhou, Zhaoyang Han, and colleagues published DKT2, a knowledge tracing model built on the xLSTM architecture created by Sepp Hochreiter, the inventor of the original LSTM that powered the first generation of DKT [2]. xLSTM adds exponential gating and matrix memory to the classic LSTM cell. DKT2 outperformed 17 baselines across three large-scale datasets, including transformer-based and Mamba-based models. Hochreiter himself endorsed the results publicly.
Mamba4KT, published in 2024, applied selective state-space models to knowledge tracing, achieving linear-time complexity compared to the quadratic scaling of transformer attention [38]. For platforms processing millions of interactions daily, the difference in computational cost matters.
The most talked-about development is the entry of large language models into knowledge tracing. Multiple research groups have tested whether LLMs can predict student performance. Neshaei and colleagues fine-tuned GPT-3 on student interaction data and found it matched BKT but fell short of specialized neural KT models [39]. Lee and colleagues reformulated knowledge tracing as a natural language task, fine-tuning BERT and DeBERTa-v3 on text descriptions of student interactions [40]. Their model beat DKT and AKT on text-rich datasets but underperformed on traditional numerical benchmarks.
Haoxuan Li and colleagues explored few-shot knowledge tracing with LLMs, using a three-stage cognition-guided prompting framework with GPT-4 [41]. The results were competitive with deep KT models using only a handful of student examples, suggesting that LLMs might excel precisely where traditional models struggle most: the cold-start problem, where a new student has too few interactions for reliable estimation.
The honest summary of LLM-based knowledge tracing in 2025 is this: LLMs do not yet beat specialized models on standard benchmarks [42]. They win on cold-start scenarios, cross-domain generalization, and the ability to generate natural-language explanations of their predictions. The future likely lies in hybrid systems that combine the pattern recognition of specialized architectures with the reasoning and language capabilities of large models.

The Equation That Still Needs Writing
Thirty years after Corbett and Anderson published their four-parameter model, the mathematics of knowledge tracing remains an open problem. No existing model simultaneously handles all the things real learning involves: acquiring new knowledge, forgetting old knowledge, transferring skills between domains, accounting for motivation and fatigue, and adapting to individual differences in learning speed.
The research community is converging on several frontiers. Causal knowledge tracing aims to answer not just "will this student get the next question right?" but "what caused them to learn or fail?" [43]. Multimodal knowledge tracing incorporates video, code, and equation data alongside traditional response sequences [44]. Fairness-aware tracing seeks to ensure that prediction accuracy does not vary systematically across demographic groups [45].
And perhaps most fundamentally, the artificial boundary between knowledge tracing and spaced repetition is dissolving. Both fields model the same underlying process: how a learner's memory state changes with practice and time. BKT and FSRS are different parameterizations of the same basic problem. The field is slowly recognizing this, and the next generation of models will likely treat learning, assessment, and scheduling as a single unified optimization problem.
The equation that Corbett wrote in 1994 was never meant to be the final answer. It was a first approximation, a starting point, a proof that the question itself was worth asking. Thirty years and thousands of papers later, the question has only gotten more interesting. And the math, richer.

Frequently Asked Questions
What is knowledge tracing and how does it work?
Knowledge tracing is a set of mathematical models that estimate what a student knows based on their history of correct and incorrect answers. The system maintains a hidden probability of mastery for each skill, updates it after every response using Bayesian inference or machine learning, and predicts future performance. The original model uses four parameters: prior knowledge, learning rate, guess rate, and slip rate.
What is the difference between BKT and deep knowledge tracing?
Bayesian Knowledge Tracing uses a two-state Hidden Markov Model with four interpretable parameters per skill. Deep Knowledge Tracing replaces this with a recurrent neural network that learns hidden representations from data without predefined skill labels. BKT is more interpretable and works well with small datasets. DKT handles larger, more complex interaction sequences but sacrifices transparency and can produce contradictory predictions.
Can knowledge tracing models handle forgetting?
Most standard knowledge tracing models, including the original BKT and DKT, assume that once a skill is learned it is never forgotten. This is a known limitation. Extended versions like BKT-with-forgetting add a forgetting parameter, and spaced repetition algorithms like FSRS explicitly model memory decay over time using a power-law forgetting curve with learnable stability parameters.
How accurate are knowledge tracing models?
Under fair evaluation conditions using student-level cross-validation, the best deep models achieve AUC scores of approximately 0.78 to 0.80 on standard benchmarks like ASSISTments 2009. Classical models like BKT with extensions reach 0.83. Simple logistic regression baselines achieve 0.72 to 0.77. Reported accuracy varies significantly depending on preprocessing and evaluation methodology.
What is the relationship between knowledge tracing and spaced repetition?
Both model the same underlying process: how a learner's memory state changes with practice and time. Knowledge tracing asks whether a student currently knows a skill. Spaced repetition asks when a student will forget a skill and schedules a review before that happens. Mathematically, spaced repetition algorithms like FSRS are continuous-state knowledge tracing models with explicit time-dependent forgetting curves.





