Introduction

In September 2024, a physics professor at Harvard ran an experiment that split his class in two. Half studied with a traditional active-learning format. The other half worked with an AI tutor built on a large language model. The AI group learned more than twice as much in less time [1]. Three months later, a different team in Europe ran the inverse experiment. Students who used ChatGPT to study retained significantly less information after 45 days compared to those who studied without it [2].

Same technology. Opposite results. The difference was not in the model. It was in how the model generated and delivered the content.

This contradiction sits at the center of one of the most urgent questions in education. Large language models can now produce lesson plans, quiz questions, reading materials, and tutoring dialogues at a speed no human team could match. But what are these models actually doing when they produce educational text? How does a machine trained on internet data end up writing a passable explanation of mitosis or the French Revolution? And where does that process break down in ways that matter for learning?

The answers require going deeper than most discussions about AI in education are willing to go. They require understanding the machine itself, from the mathematics of attention to the statistics of token prediction, and then connecting that machinery to what cognitive science knows about how humans actually learn. That connection is what this article builds, step by step.

Luminous neural network in deep space with glowing data streams.

Before the Transformer: When Machines Could Barely Finish a Sentence

The story of how LLMs generate educational content starts with a problem that haunted computer science for decades: predicting the next word.

In the 1990s, the best language models were built on n-grams. Count how often a specific sequence of words appears in a large body of text, then use those frequencies to guess what comes next. "The cat sat on the ____." If "mat" appeared after that phrase 47 times in the training data and "dog" appeared 3 times, the model would predict "mat." Simple. Fast. And terrible at anything longer than a few words. N-gram models had no memory. They could not connect the beginning of a paragraph to its end. They could not hold a concept across a sentence.

Recurrent neural networks changed this. Introduced in various forms through the 1980s and 1990s, RNNs processed words one at a time in sequence, passing a "hidden state" forward like a baton in a relay race. Each word updated the state, and the state influenced predictions about the next word. The Long Short-Term Memory network, proposed by Sepp Hochreiter and Jrgen Schmidhuber in 1997 [3], added gates that controlled what to remember and what to forget. LSTMs could, in theory, maintain information across hundreds of steps.

In practice, they struggled. Training was slow because each word had to be processed sequentially. Long documents overwhelmed them. And the "hidden state," no matter how sophisticated the gating, was a single compressed vector trying to carry the meaning of everything that came before. Think of it like trying to summarize an entire textbook into a single sentence and then using only that sentence to answer questions. Information gets lost.

For educational content, this mattered enormously. A model that cannot hold the thread of an argument across three paragraphs cannot explain how photosynthesis works, or why the Treaty of Versailles failed, or what makes a proof valid. It can finish sentences. It cannot teach.

Glowing batons on pedestals symbolize fading information in a dim corridor.

The Paper That Changed Everything

On June 12, 2017, a team of eight researchers at Google published a paper with a deceptively simple title: "Attention Is All You Need" [4]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, ?ukasz Kaiser, and Illia Polosukhin proposed a new architecture called the Transformer. It threw away recurrence entirely. No more processing words one at a time. No more relay batons.

Instead, the Transformer introduced a mechanism called self-attention. The idea: when processing any word in a sentence, the model should be able to look at every other word simultaneously and decide which ones matter most for understanding the current one. In the sentence "The student forgot the answer because the test was too hard," the word "forgot" needs to attend to "student" (who forgot?) and "answer" (what was forgotten?) and "test" (why?). Self-attention computes these relationships in parallel, across the entire input, all at once.

The mathematics behind it are elegant. Each word is converted into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The model multiplies each Query against every Key, scales the result, applies a softmax function to get attention weights, and then uses those weights to blend the Values. The output for each word is a weighted mixture of information from every other word, where the weights reflect relevance.

Multi-head attention runs this process multiple times in parallel (eight heads in the original paper), each with different learned projections. One head might learn to track grammatical relationships. Another might track semantic similarity. Another might track position. The outputs are concatenated and projected back into a single representation.

Why did this matter for educational content? Because self-attention solved the distance problem. In an LSTM, connecting "mitochondria" in sentence one to "ATP production" in sentence fifteen required information to survive fifteen sequential processing steps. In a Transformer, the connection is direct. Every token can attend to every other token regardless of distance. This is the architectural reason modern LLMs can write coherent multi-paragraph explanations. The model never forgets the beginning of its own output.

1997
Hochreiter and Schmidhuber publish the LSTM architecture
2003
Bengio introduces neural probabilistic language models
2017
Vaswani et al. publish Attention Is All You Need
2018
Devlin et al. release BERT with bidirectional pre-training
2020
Brown et al. release GPT-3 with 175 billion parameters
2022
Ouyang et al. publish InstructGPT using RLHF
2024
Gemini 1.5 reaches 10 million token context windows
2025
Education-specific fine-tuned models emerge
Glowing threads connect floating crystals on a deep navy background.

How a Language Model Writes a Paragraph

When a large language model generates an explanation of, say, how vaccines work, it does not retrieve the explanation from a database. It constructs it word by word, or more precisely, token by token.

The core mechanism is autoregressive generation. The model sees everything it has generated so far, passes it through its Transformer layers, and produces a probability distribution over its entire vocabulary for the next token. In GPT-3, that vocabulary contains roughly 50,000 tokens [5]. Each token gets a probability. The model then selects one and appends it to the sequence. Then it repeats the process.

This is where sampling parameters become critical for educational quality. Three settings control the character of the output.

Temperature scales the probability distribution. At temperature 0, the model always picks the highest-probability token. The output is deterministic and repetitive. At temperature 1.0, the model samples according to the raw probabilities, producing more varied but less predictable text. For educational content, researchers have found that temperatures between 0.2 and 0.5 preserve factual accuracy while allowing enough variation to avoid robotic repetition.

Top-k sampling, introduced by Angela Fan, Mike Lewis, and Yann Dauphin in 2018 [6], restricts selection to the k most probable tokens. If k=50, the model only considers the top 50 candidates at each step, ignoring the remaining 49,950. This prevents the model from occasionally selecting wildly improbable tokens that derail coherence.

Top-p (nucleus) sampling, proposed by Ari Holtzman and colleagues in 2020 [7], takes a different approach. Instead of fixing the number of candidates, it includes tokens until their cumulative probability reaches a threshold p. If p=0.9, the model considers enough tokens to cover 90% of the probability mass. On some steps, this might mean 10 tokens. On others, 500. The method adapts dynamically to the model's confidence.

What does this mean for educational output? When the model is generating a well-established fact ("The mitochondria is the..."), the probability distribution is sharply peaked. One token dominates. Low temperature and tight nucleus sampling produce accurate, consistent text. But when the model is generating a transition sentence or an analogy, the distribution is flatter. Multiple tokens are roughly equally plausible. Here, slightly higher temperature produces more natural, engaging prose. The same model, with different sampling parameters, can produce a dry textbook paragraph or a lively science article. The words come from probabilities, not understanding.

Cascading glowing probability bars with a bright particle dropping down.

Three Training Stages That Shape What the Model Knows

A language model does not arrive at educational competence through a single training process. It passes through three distinct stages, each shaping what it can and cannot do.

The first stage is pre-training. The model reads billions of words from the internet, books, academic papers, forums, and other sources. It learns to predict the next token. This stage takes weeks on thousands of GPUs and costs millions of dollars. It produces a model with broad knowledge but no particular skill. It can complete sentences about chemistry, history, coding, and cooking, but it does not know how to follow instructions or answer questions directly. Think of pre-training as building a vast, disorganized library inside the model's weights.

The second stage is supervised fine-tuning. Human contractors write thousands of example conversations: a question and an ideal answer. The model trains on these examples and learns to produce structured, helpful responses instead of raw text completions. This is where the model acquires the ability to answer "Explain photosynthesis to a 10-year-old" differently from "Explain photosynthesis to a biology graduate student."

The third stage is RLHF, or Reinforcement Learning from Human Feedback. Published by Long Ouyang and colleagues at OpenAI in 2022 [8], this process works in three steps. First, human raters compare pairs of model outputs and choose which is better. Second, those preferences train a separate "reward model" that predicts human satisfaction. Third, the language model is fine-tuned to maximize the reward model's score. The result was striking: a 1.3 billion parameter InstructGPT model, 100 times smaller than GPT-3, produced outputs that humans preferred over GPT-3's raw output.

For educational content, RLHF is where safety and accuracy get tuned. Raters penalize hallucinations. They penalize harmful content. They reward clarity, appropriate depth, and pedagogical scaffolding. Constitutional AI, proposed by Yuntao Bai and colleagues at Anthropic [9], extends this by training models to self-critique against a set of principles, enabling age-appropriate filtering and bias mitigation at scale.

But here is the catch. RLHF optimizes for what humans rate as good, not for what is true. A confident, well-structured, fluently written explanation that contains a subtle factual error will score higher than a hesitant, awkward, but accurate one. This misalignment is the root cause of a problem that plagues every educational application of LLMs.

Three platforms linked by glowing bridges, showcasing data transformation stages.

The Hallucination Problem in Education

The term "hallucination" is itself controversial. Some researchers argue the word implies the model is perceiving something false, when in reality it is simply generating statistically plausible sequences that happen to be incorrect [10]. Regardless of terminology, the problem is real, and in educational contexts, it is dangerous.

The numbers vary wildly depending on the task. In clinical note summarization with retrieval-augmented generation, Asgari and colleagues found a hallucination rate of just 1.47% [11]. But when Chelli and colleagues tested ChatGPT and Bard on generating systematic review citations, GPT-3.5 hallucinated 39.6% of references, GPT-4 hallucinated 28.6%, and Bard hallucinated a staggering 91.4% [12]. Bhattacharyya and colleagues checked 115 references generated by ChatGPT and found 47% were entirely fabricated, 46% were real but cited inaccurately, and only 7% were correct [13].

In medical education specifically, Omar and colleagues planted deliberate clinical errors into case studies and asked LLMs to evaluate them. The models repeated or elaborated on the planted errors in up to 83% of cases [14]. When Liu and colleagues tested medical trainees on their ability to detect LLM-generated hallucinations, accuracy was only 55% [15].

In legal education, the picture is equally troubling. Dahl and colleagues found hallucination rates between 69% and 88% across GPT-3.5, PaLM 2, and Llama 2 on legal questions [16]. Magesh and colleagues tested even commercial legal AI tools: Westlaw AI hallucinated in roughly 33% of responses, and raw GPT-4 hallucinated in 58 to 82% [17].

DomainModelHallucination RateSource
Clinical summarization (with RAG)Multiple1.47%Asgari et al. 2025
Systematic review citationsGPT-428.6%Chelli et al. 2024
Systematic review citationsGPT-3.539.6%Chelli et al. 2024
Systematic review citationsBard91.4%Chelli et al. 2024
General reference generationChatGPT47% fabricatedBhattacharyya et al. 2023
Legal queriesGPT-458-82%Magesh et al. 2025
Legal queriesWestlaw AI~33%Magesh et al. 2025
Clinical error elaborationMultipleup to 83%Omar et al. 2025

Why do hallucination rates vary so dramatically? The answer goes back to how LLMs generate content. The model produces the most probable next token given its context. When the context contains grounded source material (as in RAG-augmented systems), the probability distribution is anchored to real information. When the context is an open-ended question with no grounding, the model is free to generate whatever sequence maximizes fluency. And fluent nonsense is still nonsense.

Cracked crystal prism refracting light into chaotic rainbow patterns.

What Bloom's Taxonomy Reveals About LLM Limitations

If hallucination rates measure factual accuracy, Bloom's taxonomy measures cognitive depth. And here, the results challenge intuition.

Herrmann-Werner and colleagues tested GPT-4 on 307 psychosomatic medicine questions mapped to Bloom's taxonomy levels [18]. The model scored 78.9% overall. But errors were concentrated at the lowest cognitive levels, "remember" and "understand," where the model made 29 and 23 errors respectively. This is counterintuitive. A model trained on vast corpora should excel at factual recall. The researchers attributed the pattern to model biases and the tendency to generate outputs that maximize likelihood rather than accuracy.

Maity and colleagues tested GPT-4 Turbo on the EduProbe dataset, covering 1,005 question pairs from Indian school textbooks across grades 6 through 12 [19]. They found that quality was inversely related to Bloom's level. The model generated strong "remember" and "understand" questions but struggled with "apply," "analyze," and "evaluate."

Scaria and colleagues tested five different LLMs on educational question generation across Bloom's levels and found that LLMs can produce relevant questions at all cognitive levels when prompted with enough context [20]. But automated evaluation metrics failed to capture the differences that human evaluators noticed. The models produced questions that looked right on the surface but missed the pedagogical purpose.

Hwang and colleagues raised a deeper methodological concern: can Bloom's taxonomy even be meaningfully applied to LLM outputs? [21] Given that LLMs generate answers through statistical pattern matching rather than cognitive processing, assigning "cognitive levels" to their outputs may be a category error. The model is not remembering, understanding, or analyzing. It is predicting the next token in a way that resembles those cognitive activities.

This matters for educational content generation because Bloom's taxonomy is how educators think about learning objectives. If an LLM cannot reliably produce assessment questions at higher cognitive levels, or if its apparent mastery at lower levels conceals systematic biases, then using LLM-generated content without expert review introduces risks that are invisible to the teacher who trusts the output.

Translucent glass pyramid layers glow and flicker, symbolizing Bloom's taxonomy.

The Paradox of Learning Outcomes

The most important question about LLM-generated educational content is also the hardest to answer: do students actually learn more?

Gregory Kestin at Harvard built a custom AI tutor called PS2 Pal for introductory physics [1]. It was not raw ChatGPT. It was pedagogically engineered: students could only advance one step at a time, the model was fine-tuned on worked solutions, and it refused to give direct answers. In a randomized controlled trial with 194 students, the AI tutor group scored more than twice as high on post-tests and reported higher motivation and engagement than the active-learning control group.

Pardos and Bhandari at UC Berkeley tested a different setup [22]. With 274 participants, they found that the ChatGPT condition produced statistically significant learning gains compared to a no-help control. There was no significant difference between ChatGPT and a human tutor.

But Lehmann and colleagues told a different story [2]. In their study, 120 participants used either ChatGPT or traditional study methods. On immediate tests, both groups performed similarly. But after 45 days, the ChatGPT group retained significantly less: 57.5% versus 68.5%. The effect size was Cohen's d = 0.68, which is not small.

The sharpest finding came from Bastani, Bastani, and Sungu at the University of Pennsylvania [23]. In a study of roughly 1,000 Turkish high school students, those given unfettered access to GPT-4 scored 17% lower on subsequent unaided exams compared to a control group. But when a second group used a tutor-style version of GPT-4 with pedagogical guardrails that prevented it from giving direct answers, the harm disappeared.

Wang and Fan's 2025 meta-analysis of 37 studies found an overall positive effect: Hedges' g = 0.577 [24]. But the confidence interval was wide, and heterogeneity across studies was high.

The pattern across these studies tells a consistent story. Raw LLM access can hurt learning by removing the productive struggle that makes memories stick. Pedagogically designed LLM tools that scaffold the learning process can match or exceed traditional instruction. The variable is not "AI or no AI." The variable is design. As cognitive science has known since Robert and Elizabeth Bjork described desirable difficulties [25], learning that feels easy often does not last. And LLMs, by default, make everything feel easy.

Diverging paths in an abstract landscape symbolizing learning choices.

Do LLMs Understand What They Write?

This question sparked one of the fiercest debates in cognitive science since the Chomsky-Skinner exchange of the 1960s.

Kyle Mahowald, Anna Ivanova, and Evelina Fedorenko at MIT published a framework in 2024 that may be the clearest way to think about it [26]. They distinguish between formal linguistic competence and functional linguistic competence. Formal competence covers grammar, syntax, word prediction, and stylistic fluency. Functional competence covers using language to reason about the world, track physical causation, understand social dynamics, and draw valid inferences.

Their finding: LLMs are surprisingly good at formal competence. Their functional competence is spotty and inconsistent. A model can write a grammatically perfect explanation of why ice floats on water without having any representation of what water is, what ice is, or what floating means. It has learned the pattern of how those words relate to each other. It has not learned the physics.

R. Thomas McCoy, Shunyu Yao, and colleagues tested this directly [27]. They coined the phrase "embers of autoregression" to describe systematic biases in LLM behavior that trace back to the next-token prediction objective. In cipher-decoding tasks, GPT-4 scored 51% accuracy when the answer was a high-probability word and only 13% when the answer was a low-probability word. The model was not reasoning about the cipher. It was predicting which word was most likely to follow the pattern, and when the correct answer happened to be a common word, it got lucky more often. Even reasoning-optimized models like OpenAI's o1 showed the same pattern [28].

On the other side of the debate, Steven Piantadosi argued that modern language models implement genuine theories of language and refute Chomsky's claim that statistical learning cannot capture linguistic competence [29]. Jordan Kodner and colleagues pushed back sharply: humans achieve linguistic competence from orders of magnitude less data than LLMs require. Their memorable analogy: "the implications of LLMs for our understanding of the cognitive structures underlying language are like the implications of airplanes for understanding how birds fly" [30].

Emily Bender, Timnit Gebru, and colleagues framed the concern differently in their influential 2021 paper [31]. They argued that human interlocutors impute meaning where there is none, mistaking fluent text for genuine communication. The danger for education is clear. A student reading an LLM-generated explanation may experience the text as meaningful. The model experienced nothing. Whether the content is accurate depends entirely on whether the training data happened to contain correct information in patterns the model could reproduce.

For the question of how LLMs generate educational content, this debate has a practical resolution. The mechanism is pattern matching, not understanding. But the patterns were learned from millions of educational documents written by people who did understand. When the patterns align with truth, the output is useful. When they diverge, the output is confidently wrong. And the model cannot tell the difference.

Two crystal orbs showcasing a miniature city and holographic projection.

When Context Windows Swallowed Entire Textbooks

One of the most consequential recent advances is the expansion of context windows. The original Transformer processed roughly 512 tokens. GPT-3 handled 4,096. By 2024, Gemini 1.5 demonstrated near-perfect retrieval across 10 million tokens [32]. Current models handle 200,000 to 1,000,000 tokens in a single prompt.

What does this mean in practical terms? One million tokens is approximately 750,000 English words. That is roughly the length of an entire undergraduate textbook, or an entire semester's worth of lecture transcripts, or a complete curriculum framework plus rubric plus example assessments, all in a single prompt. A teacher can now paste an entire course worth of material into a model and say: "Generate a quiz on chapter 7 that avoids overlap with the quiz I gave on chapter 4."

But larger context windows introduced a new problem. Retrieval-Augmented Generation (RAG), proposed by Patrick Lewis and colleagues in 2020 [33], grounds the model's output in external documents retrieved at query time. With million-token context windows, the documents can be pasted directly into the prompt instead of retrieved. This eliminates retrieval errors but introduces the "lost-in-the-middle" problem: models attend more strongly to information at the beginning and end of the context, sometimes missing critical content buried in the middle.

Chain-of-Thought prompting, demonstrated by Jason Wei and colleagues in 2022 [34], showed that asking the model to reason step by step significantly improved accuracy on complex tasks. For educational content, this means prompting strategies matter enormously. A prompt that says "Generate 10 quiz questions" produces different quality than one that says "First, identify the five most important concepts in this chapter. Then, for each concept, explain what makes it challenging for students. Then, generate one question per concept that tests understanding at the application level."

Tree of Thoughts, proposed by Shunyu Yao and colleagues in 2023 [35], extended this further. Instead of a single reasoning chain, the model explores multiple reasoning paths and evaluates them against each other. On the Game of 24 (a mathematical puzzle), GPT-4 with standard prompting solved 4% of tasks. With Tree of Thoughts, it solved 74%. The implications for educational problem generation are significant: structured prompting can dramatically improve the quality of generated exercises.

Open book with glowing pages and magnifying lens highlighting details.

The Bias Problem Nobody Wants to Talk About

LLMs inherit the biases of their training data, and those biases show up in educational content.

Zack and colleagues tested GPT-4 on clinical vignettes and found it exaggerated demographic disease prevalence in 89% of diseases [36]. On a sarcoidosis vignette, the model assumed the patient was Black 97% of the time and Black female 81% of the time. Responses differed significantly by gender or race/ethnicity in 23% of cases. In an educational context, this means an LLM generating case studies for medical students would systematically reinforce demographic stereotypes.

The problem extends to language. Gupta and colleagues tested LLMs across six languages and four educational tasks [37]. Performance dropped substantially for lower-resource languages. Kantharuban and colleagues showed that LLMs perform worse when prompted by non-native English speakers compared to native speakers, even when asking the same question [38]. For students in the Global South who access education through LLMs in their native languages, this means systematically lower-quality content.

Liang and colleagues revealed a related concern for academic integrity: GPT detection tools have a 61.3% false-positive rate on TOEFL essays written by non-native English speakers [39]. One detector flagged 97.8% of human-written non-native essays as AI-generated. Students who write in their second language are being punished for the same statistical patterns that LLMs use.

Colorful lenses casting overlapping shadows on a white surface.

What Cognitive Science Says About AI-Generated Learning Materials

The tension between LLM-generated content and how humans actually learn runs deeper than accuracy or bias. It touches the fundamental mechanics of memory formation.

Cognitive load theory, developed by John Sweller, distinguishes between intrinsic load (the inherent difficulty of the material), extraneous load (unnecessary complexity from poor presentation), and germane load (the mental effort that directly builds learning). Sweller himself noted that LLMs may reduce extraneous load by providing direct, clear answers [40]. But the Bastani study suggests that unfettered AI access may also reduce germane load. When the model does the thinking, the student does not build the mental models that enable transfer.

Manu Kapur's productive failure framework [41] shows that students who struggle with a problem before receiving instruction learn more deeply than those who receive instruction first. The struggle itself is the learning. An LLM that instantly provides the correct answer short-circuits this process.

Lee, Sarkar, and colleagues at Microsoft Research and Carnegie Mellon tested this directly in 2025 [42]. Across 319 knowledge workers and 936 critical-thinking examples, they found that higher confidence in AI correlated with less critical thinking. The more people trusted the AI's output, the less they evaluated it independently. Gerlich found the same pattern: AI tool usage correlated negatively with critical thinking, mediated by cognitive offloading [43].

This is the central irony of LLM-generated educational content. The better the model gets at producing fluent, accurate, well-structured explanations, the less the student needs to think. And thinking is learning.

Contrasting diptych: rough stone with chisel marks vs. polished stone on velvet.

The Numbers Behind the Adoption Wave

Regardless of the research debate, adoption is accelerating.

A Pew Research Center survey from January 2025 found that the share of US teens using ChatGPT for schoolwork doubled from 13% to 26% in a single year [44]. By December 2025, roughly two-thirds of US teens had used an AI chatbot, with 30% using one daily [45]. In the UK, the Higher Education Policy Institute found that 92% of undergraduates used AI tools by 2025, up from 66% the year before [46].

On the institutional side, California State University deployed ChatGPT Edu to 460,000 students and 63,000 staff across 23 campuses in a $17 million contract [47]. The University of Oxford became the first UK university to offer ChatGPT Edu to all staff and students [48]. By late 2025, OpenAI had sold over 700,000 ChatGPT licenses to approximately 35 US public universities.

US Teen ChatGPT Use for Schoolwork (%)202320242025605550454035302520151050Percentage

The teacher side tells a parallel story. A Walton Family Foundation and Gallup survey found that 60% of US K-12 teachers used AI in 2024-2025, with weekly users reporting an average time savings of 5.9 hours per week [49]. Anthropic's education report found that 57% of educator interactions with Claude were for curriculum development [50].

But adoption does not mean readiness. EDUCAUSE found that fewer than 40% of higher education institutions have AI acceptable-use policies [51]. UNESCO's 2023 survey found that fewer than 10% of schools worldwide have formal AI policies [52]. RAND reported that AI use in schools is increasing quickly, but institutional guidance lags far behind [53].

Luminous data particles engulfing miniature schools and universities, highlighting preparedness gaps.

The Regulatory Response

Governments are beginning to respond, though at different speeds and with different priorities.

The European Union's AI Act, which entered force in phases starting February 2025, classifies educational AI as high-risk under Annex III [54]. Any AI system used to determine admission to educational institutions, evaluate learning outcomes, assess educational level, or monitor students during tests must meet stringent transparency, documentation, and human oversight requirements. Article 5(1)(f) outright prohibits AI systems that infer emotions in educational settings, except for medical or safety reasons.

UNESCO published guidance for generative AI in education and research in 2023 [52], recommending a minimum age of 13 for independent use of generative AI. In 2024, UNESCO released AI competency frameworks for both teachers [55] and students [56].

The US Department of Education published a report on AI and the future of teaching in May 2023 [57] and followed with an AI toolkit for K-12 leaders in October 2024. China took the most aggressive step: mandatory AI education for all K-12 students, minimum eight hours per year, starting September 1, 2025.

The regulatory picture reveals a tension. Education-focused AI regulation must balance two competing goals: protecting students from biased, inaccurate, or privacy-violating systems while not blocking access to tools that, when properly designed, demonstrably improve learning.

Balanced scales with innovation orb and regulation shield, marble columns backdrop.

What Comes Next

The trajectory is clear. Models are getting better at generating educational content, and educational institutions are adopting them faster than policy can keep up.

Google's LearnLM, announced at I/O 2025, is a version of Gemini specifically tuned for pedagogical interactions [58]. Instead of giving direct answers, it guides students through reasoning steps, asks follow-up questions, and adjusts difficulty based on responses. The TeachLM project, published in October 2025, fine-tuned models on 100,000 hours of real student-tutor interactions [59]. Their finding was sobering: off-the-shelf models and existing educational fine-tunes showed only marginal improvements over base models, suggesting that prompt engineering may matter more than specialized training.

Knowledge tracing with LLMs represents another frontier. Tato and Nkambou at AIED 2025 showed that LLM-constructed Bayesian knowledge models were as effective as expert-designed ones [60]. But counter-evidence appeared quickly: a 2026 study found that specialized knowledge tracing models still outperform LLM-based approaches on standard benchmarks.

Multimodal models add another dimension. Kchemann and colleagues explored multimodal LLMs for science education through the lens of Mayer's multimedia learning theory [61]. But a 2025 study testing GPT-4o and Gemini on the Korean college entrance exam found what the researchers called a "Perception-Cognition Gap." The models could recognize visual data in schematic diagrams but failed to interpret its symbolic meaning. They could see the graph. They could not read it.

The most honest summary may come from the researchers themselves. The question is no longer whether LLMs can generate educational content. They can. The question is whether the content they generate, delivered in the way they deliver it, produces the kind of learning that lasts. And on that question, the science is still being written.

Half-finished mosaic on a workshop table with colorful tiles.

Frequently Asked Questions

How do large language models generate text?

Large language models generate text through autoregressive prediction. The model processes all previous tokens through transformer layers, computes a probability distribution over its vocabulary for the next token, selects one based on sampling parameters like temperature and top-p, appends it, and repeats. Each token is chosen based on statistical patterns learned during training on billions of words.

Can LLMs replace human teachers for creating educational materials?

Current evidence suggests LLMs can accelerate content creation but not replace human oversight. Studies show pedagogically designed AI tools can match human tutors on immediate learning gains. But unstructured AI access can harm long-term retention by reducing productive struggle. Expert review remains essential for accuracy, bias detection, and alignment with learning objectives.

What is the hallucination rate of LLMs in educational contexts?

Hallucination rates vary enormously by task and model. Clinical summarization with retrieval augmentation shows rates as low as 1.47%. Open-ended reference generation shows rates from 28% to 91%. Legal and medical domains show particularly high rates. Using grounded source documents and retrieval-augmented generation significantly reduces hallucination.

Do students learn better with AI-generated content?

Results are mixed. A Harvard physics study found AI tutoring produced twice the learning gains of traditional methods. But a European retention study found ChatGPT users scored 11 percentage points lower after 45 days. A Wharton study found unfettered GPT-4 access reduced exam scores by 17%. The critical variable is pedagogical design, not AI access itself.

What regulations govern AI use in education?

The EU AI Act classifies educational AI as high-risk and bans emotion recognition in schools. UNESCO recommends minimum age 13 for generative AI use and has published competency frameworks for teachers and students. The US Department of Education issued AI guidance and toolkits. China mandated AI education for all K-12 students starting September 2025.