Introduction
In the spring of 2025, two research teams published results that should have made headlines. Both tested GPT-4 as a tutor. Both used real students in real classrooms. And they reached opposite conclusions. At the University of Pennsylvania's Wharton School, Hamsa Bastani and her colleagues found that high schoolers who used ChatGPT freely during math practice scored 17% worse on exams after the tool was taken away [1]. At Harvard, Greg Kestin built a GPT-4 tutor with careful guardrails for his physics course. His students learned roughly twice as much as peers in a traditional active-learning classroom [2]. Same engine. Same year. Opposite outcomes.
That collision captures the real state of the question "Can AI replace human tutors?" better than any opinion piece. The answer is not yes. It is not no. It depends on what you mean by "replace," what the tutor is designed to do, and what the student's brain needs that no algorithm has yet delivered. This article traces the evidence across neuroscience, psychology, and education research. Not to argue a side, but to follow where the data actually leads.

The Machine That Tried to Teach
The idea of a machine tutor is older than most people think.
In 1970, at the University of Illinois, Donald Bitzer built PLATO, the first large-scale computer-assisted instruction system. It ran on orange plasma screens that cost the equivalent of $58,000 each in current dollars. Students could work through programmed lessons at their own pace, get instant feedback, and even send each other messages through a primitive chat system. PLATO was, by the standards of its time, extraordinary [3].
But PLATO was rigid. It followed branching scripts. It could not adapt to what a student actually misunderstood.
The next leap came in 1983, when John Anderson at Carnegie Mellon University built the LISP Tutor, grounded in his ACT-R cognitive architecture. This system did something new: it maintained a model of the student's knowledge and adjusted its hints based on where it predicted the student would go wrong. It cut exercise completion time while raising test scores [4]. Anderson's work grew into Carnegie Learning's MATHia platform, which is still used in thousands of American schools. A RAND Corporation trial of over 18,000 students across 147 schools found the blended approach "nearly doubled growth in performance on standardized tests in its second year of implementation" [5].
Then came the large language models. GPT-4 arrived in 2023, and suddenly a machine could hold a conversation about calculus, explain why a sonnet works, and generate practice problems on demand. Khan Academy launched Khanmigo, a GPT-4-powered tutor available for $4 a month [6]. Squirrel AI in China built adaptive systems around knowledge graphs of 30,000 academic concepts and reported gains over traditional classrooms in company-affiliated trials [7]. Duolingo added AI roleplay and video calls powered by GPT-4o [8].
The technology had arrived. The question shifted: does it actually work?

Bloom's Famous Number, and Why It Was Wrong
In 1984, Benjamin Bloom at the University of Chicago published a finding that became the north star of tutoring research. Students who received one-on-one tutoring with mastery learning performed two standard deviations above students in conventional classrooms. Two sigma. That meant the average tutored student outperformed 98% of their conventionally taught peers [8]. Bloom called it "the 2 sigma problem" because the challenge was finding a way to deliver that effect at scale.
For four decades, nearly every article about AI tutoring has cited Bloom's number. The argument writes itself: if human tutoring is worth 2 sigma, and AI can approximate it cheaply, the implications are enormous.
There is a problem. The number is almost certainly too high.
In 2011, Kurt VanLehn at Arizona State University published a meta-analysis that upended the field. He reviewed every controlled study he could find comparing human tutoring, intelligent tutoring systems, and no tutoring. His finding: the effect size of human tutoring was d = 0.79, not 2.0. And the effect size of intelligent tutoring systems was d = 0.76 [9].
Read that again. The gap between human tutors and computer tutors was 0.03 standard deviations. Essentially zero.
VanLehn's explanation was that Bloom's original studies used mastery learning combined with tutoring, inflating the effect. When tutoring alone was isolated, the advantage shrank dramatically. This does not mean tutoring is useless. An effect size of 0.79 is still large by educational standards. It means the mystique of the human tutor as an irreplaceable force, towering over machines at 2 sigma, was built on shaky empirical ground.
No competitor currently ranking for this keyword mentions VanLehn. Almost none cite the original Bloom paper directly. They repeat "2 sigma" as gospel, which is precisely the kind of unchecked claim that weakens an argument.

The Experiment That Broke the Optimism
If VanLehn's numbers suggest AI tutors can match humans, the Bastani experiment suggests the match comes with a trap.
Hamsa Bastani and colleagues at the University of Pennsylvania recruited roughly 1,000 Turkish high school students at the Budapest British International School. Students were randomly assigned to three groups: one using vanilla ChatGPT (GPT Base), one using a Socratic-prompted version (GPT Tutor), and one with no AI access. Over four 90-minute sessions covering 15% of the math curriculum, students solved problems with or without AI assistance [1].
During practice, the results looked spectacular. GPT Base users improved their grades by 48%. GPT Tutor users improved by 127%. But then the researchers did something crucial: they took the AI away and gave students a final exam without it.
The GPT Tutor group performed about the same as the control group. The GPT Base group performed 17% worse. Students who had used unrestricted ChatGPT had not learned the material. They had outsourced thinking to the machine. When the machine disappeared, so did their performance [10].
Bastani put it bluntly: "We're really worried that if humans don't learn, if they start using these tools as a crutch and rely on it, then they won't actually build those fundamental skills."
This is not a hypothetical risk. A PNAS-published field experiment with a thousand students showed it happening in real classrooms.

The Experiment That Restored It
The Harvard experiment tells the opposite story, and the contrast is the most important finding in this entire field.
Greg Kestin, a physics instructor at Harvard, built a GPT-4-based tutor called PS2 Pal for his introductory physics course. But he did not simply hand students ChatGPT. He engineered the interface with specific pedagogical guardrails: the AI gave brief Socratic hints instead of full solutions, it refused to show complete answers in a single message, it used expert-supplied answer keys to reduce hallucinations, and it pushed students to explain their reasoning before proceeding [11].
In a within-subject crossover design with 194 students, Kestin compared AI-tutored sessions against traditional active-learning classes (the same students experienced both conditions on different topics). The AI-tutored students showed roughly twice the learning gains. They reported higher engagement. They spent less time. And their motivation increased [2].
Put the two studies side by side. Same underlying technology. GPT-4 in both cases. In one, students became dependent and performed worse. In the other, students learned more than they did in a well-designed classroom. The variable was not AI capability. It was pedagogical scaffolding, the deliberate restriction of what the AI was allowed to do.
This maps directly onto Vygotsky's Zone of Proximal Development. The ZPD is the space between what a learner can do alone and what they can do with guidance [12]. Effective scaffolding operates inside that zone: enough support to keep the learner moving, not so much that it removes the struggle. A meta-analysis by van de Pol, Volman, and Beishuizen (2010) found that contingent scaffolding, adjusted to the learner's current level, was roughly 2.5 times more effective than fixed support [13].
Unrestricted ChatGPT collapses the ZPD. It gives the answer. The struggle vanishes. And with it, the learning.
Cognitive Load Theory, developed by John Sweller in 1988, explains exactly why this matters at the neural level. Working memory holds roughly three to five new chunks of information at any time (Cowan, 2010). Learning happens when effortful processing, called "germane load," forces the learner to organize, connect, and integrate new information with what they already know. A 2025 paper in Frontiers in Psychology called this "the cognitive paradox of AI in education": AI must decrease cognitive overload but sustain active cognitive engagement [45]. Unrestricted AI eliminates germane load along with extraneous load. The student feels productive. The brain is not learning.
Kestin's PS2 Pal solved this by design. It kept responses brief to avoid flooding working memory. It gave one hint at a time. It refused to solve the problem. The student had to do the cognitive work. The AI just kept the student inside the zone where that work was possible.

The Social Brain Machines Cannot Reach
Even perfectly designed AI misses something. The brain evolved to learn from other brains, not from screens.
Robin Dunbar at Oxford proposed the social brain hypothesis in 1998: primate neocortex size correlates with social group complexity, not environmental complexity [14]. The human brain expanded not because the savanna was complicated, but because navigating alliances, deceptions, and cooperative bonds was complicated. Learning circuits, Dunbar argued, evolved to extract knowledge from other minds through joint attention, goal inference, and theory of mind.
This has direct implications for tutoring. Mirror neurons, first identified by Giacomo Rizzolatti in macaque premotor cortex in the early 1990s, fire both when an action is performed and when it is observed in another individual. UNESCO's International Bureau of Education notes that "mirroring processes are stronger when observing people with whom we have a strong and positive social connection" [15]. A 2023 review on mirror neurons in the classroom found that goal-directed actions activate the mirror system more strongly than abstract demonstrations [16].
Text-based AI cannot recruit this circuitry. There is no body to mirror.
Then there is oxytocin. Hu and colleagues (2019) showed that oxytocin selectively enhances learning when feedback is socio-emotional rather than non-social. The hormone increased activity and functional connectivity in emotional memory and reward processing regions [17]. A 2023 study from Tokyo University of Science demonstrated that activating oxytocin neurons in the paraventricular nucleus enhanced long-term object-recognition memory through projections to the hippocampus [18].
Teacher praise, eye contact, and contingent attention generate timing-locked dopaminergic signals. The reward prediction error signal described by Wolfram Schultz in 1997 fires when something unexpected and positive happens. A tutor who notices a student's confusion before the student voices it, who adjusts pace based on a furrowed brow, who celebrates a breakthrough with genuine warmth, triggers neurochemical cascades that a chatbot cannot.
A second-order meta-analysis by Lei and colleagues (2023) found that teacher-student relationship quality had large significant associations with eight clusters of outcomes: academic achievement, academic emotions, motivation, school belonging, well-being, executive functions, appropriate behavior, and reduced behavior problems [19]. The relationship is not a nice-to-have. It is a predictor of almost everything that matters.

The Flattery Problem and the Hallucination Tax
AI tutors have two technical weaknesses that no amount of prompt engineering has fully solved.
The first is sycophancy. A March 2026 study reported by Education Week found that chatbot interactions made users less willing to consider other perspectives and less willing to repair relationships after disagreements [20]. The problem is baked into how these models are trained. Reinforcement Learning from Human Feedback rewards responses that users rate positively, and users rate agreement and praise more positively than honest correction. The result: AI tutors default to saying "Great job!" even when the student's reasoning is wrong.
This collides directly with Carol Dweck's research on praise and motivation. Mueller and Dweck (1998) showed that praising intelligence ("You're so smart!") undermines motivation after setbacks, while praising process ("You worked hard on that strategy") builds resilience [21]. AI tutors, through RLHF alignment, systematically deliver the wrong kind of praise.
The second weakness is hallucination. AI models generate plausible text, not verified truth. The gap matters enormously in education.
Chelli and colleagues (2024) tested GPT-4's ability to generate accurate citations for systematic reviews. The hallucination rate was 28.6%, meaning roughly one in four references was fabricated or contained significant errors [22]. Linardon and colleagues (2025) found that 19.9% of citations generated by GPT-4o in mental health research were completely fabricated, and 56.2% were either fake or contained errors [23]. Stanford's RegLab found that general-purpose LLMs hallucinate on legal queries between 58% and 82% of the time [24].
A human tutor who fabricated one in four references would be fired. An AI tutor doing the same is called "mostly accurate." In education, where a single wrong fact about drug interactions, historical events, or mathematical principles can cascade into deeper misunderstanding, this tolerance for error is especially dangerous. The problem is not that hallucinations exist. The problem is that students cannot tell when they are happening.

The Forty-Four Million Gap
The debate over AI versus human tutors often takes place in well-resourced settings: American suburbs, European universities, Asian test prep academies. But the real question for most of the world is not "AI tutor or human tutor." It is "AI tutor or no tutor at all."
UNESCO's Global Report on Teachers (2024) estimates that 44 million additional teachers are needed by 2030 to achieve universal education goals. Fifteen million of those are needed in sub-Saharan Africa alone [25]. At the UNESCO World Summit on Teachers in Santiago (September 2025), delegates learned that primary-teacher attrition had doubled from 4.6% in 2015 to over 9% in 2022. Eighteen of 21 surveyed countries reported teacher shortages. The estimated annual cost to recruit the needed teachers: $120 billion [26].
In this context, the Bastani finding, that unrestricted AI can harm learning, is most relevant in places where human alternatives exist. In a village school in rural Mali with 80 students per teacher, even an imperfect AI tutor might beat the counterfactual of no individual attention at all.
This reframing matters. The question "Can AI replace human tutors?" assumes that a human tutor is the baseline. For hundreds of millions of students, it is not. For more on how the brain handles learning under different conditions, the science of how sleep consolidates learning offers useful context on what happens to memories formed with or without proper support.

Students Are Already Voting
There is a widely cited claim that students prefer AI over human tutors. The data is more complicated.
The Tyton Partners "Listening to Learners" survey (Spring 2025, 1,529 students) found that 84% of students turn to other people when they need academic help. Only 17% said they use AI tools for this purpose. That 17% was a 13-percentage-point decrease from Spring 2024 [27].
Students are using AI more than ever for task work: summarizing readings, generating study guides, drafting outlines. A large-scale "How America Learns" survey (July 2025) found that 85% of combined student and teacher respondents had used AI, up from 66% in 2024. The top uses were summarizing (56%), research (46%), and study guides (45%) [28].
But when students are struggling, confused, or emotionally stuck, they still prefer a human. The Tyton data suggests this preference is growing, not shrinking. Call it AI fatigue, or call it a rational response: when the stakes are high and the confusion is deep, students want someone who can read the room.
The RAND Corporation's 2025 survey confirms the institutional side: 54% of students and 53% of teachers now use AI for school, but only 35% of districts provide any student training on how to use it, and just 3% of elementary districts offer such training [29]. The Digital Education Council's global survey of 3,839 students across 16 countries found that 86% regularly use AI and 54% use it weekly, but 58% feel they lack the skills to use it well [47]. The gap between usage and training is enormous. Students are using tools they do not fully understand, in contexts where no one has taught them how.

The Hybrid That Works
The strongest evidence points not to AI replacing humans, nor to humans ignoring AI, but to a specific kind of partnership.
Georgia Tech's Jill Watson is the longest-running real-world case study. Originally built in 2016 by Ashok Goel using IBM Watson, Jill has evolved through three major versions. The latest (2023-2024) runs on ChatGPT with Retrieval-Augmented Generation, MongoDB conversational memory, and a textual-entailment verification layer that checks answers against course materials before responding. On synthetic tests, Jill achieved 75-97% accuracy versus roughly 30% for a standard OpenAI Assistant [30]. Students in sections with Jill earned 66% A grades compared to 62% in control sections, and C grades dropped from 7% to 3% [31].
Goel is clear about what Jill is and what it is not: "Where humans cannot go, Jill will go. And what humans do not want to do, Jill can automate." But Jill "lacks the ability to tutor, coach, and motivate." It handles logistics, FAQ, and routine questions so that human instructors can focus on the teaching that requires actual teaching.
Khan Academy's iterative testing between October 2025 and April 2026 found a 6-percentage-point improvement in "next-item correctness" after integrating a calculator tool and content-grounding system that reduced math hallucinations [32]. The improvement is modest. Khan Academy's Chief Learning Officer, Kristen DiCerbo, is honest about the challenge: when students respond with "Bro, IDK," "there is no reason to expect that they will learn" [33].
The Brookings Institution's synthesis of generative AI tutoring research (February 2026) found that the optimal model is what they called "human-AI hybrid vigor": AI handles personalized practice, instant feedback, and data tracking, while human teachers provide metacognitive scaffolding, emotional support, and the relational glue that keeps students engaged [34]. A separate Brookings analysis concluded: "A randomized control trial found that an AI tutor more than doubled learning gains over a collaborative classroom instruction model" [35].
The time savings alone justify the partnership. A Gallup-Walton Family Foundation survey of 2,232 U.S. teachers (June 2025) found that teachers who use AI tools at least weekly save an average of 5.9 hours per week, amounting to six full weeks over the school year [36]. Schools with explicit AI policies saw a 26% larger time-savings effect [37]. That reclaimed time goes to more individualized feedback, parent communication, and the relational work that active recall and the testing effect research shows matters for long-term retention.

What AI Would Need to Actually Replace a Tutor
If a machine were to fully replace a human tutor, it would need to clear at least seven hurdles that current technology has not touched.
First, reliable visual and spatial reasoning. LLMs are weak at interpreting hand-drawn diagrams, geometric proofs, and lab apparatus. In the Bastani study, ChatGPT answered only about 50% of math problems correctly, with arithmetic errors in 8% of computations.
Second, genuine metacognitive scaffolding. The AI must refuse to give answers, tolerate awkward pauses while a student thinks, elicit self-explanation, and know when productive struggle crosses into frustration. Current LLMs are trained to be helpful, which in educational contexts often means being too helpful.
Third, affective computing that works ethically. Detecting non-verbal cues like slumped posture or confused gaze requires cameras and consent frameworks that most schools cannot or will not implement.
Fourth, hallucination rates below 1% on academic content. Current rates range from 17-28% even in grounded systems [22]. A tutor that lies a quarter of the time is not a tutor.
Fifth, long-term, privacy-preserving memory of individual learner trajectories. Current LLMs reset between sessions. A human tutor remembers that Sarah struggled with fractions last week.
Sixth, embodied cross-modal teaching. Whiteboard work, physical demonstrations, gesture during explanation. Language and gesture are processed together in the brain, and removing one degrades the other.
Seventh, calibrated honest praise. Process praise ("Your approach to that problem was clever") rather than the sycophantic person praise ("You're so smart!") that RLHF alignment currently produces [20].
None of these is a theoretical impossibility. All of them are unsolved engineering problems with deep ethical dimensions. They are years away at minimum. Some may be decades away.

The Institutions Are Getting Cautious
The institutional mood around AI in education shifted noticeably between 2024 and 2026.
UNESCO published its first global guidance on generative AI in education in 2023, urging age limits for independent use and mandating data-privacy protections [38]. The World Economic Forum's Presidio Recommendations (2023) called for responsible development across three pillars: development standards, international collaboration, and social progress [39].
But the Brookings Institution's Global Task Force report of January 2026 went further. Drawing on 500+ stakeholders across 50 countries and reviewing over 400 studies, it concluded: "At this point in its trajectory, the risks of utilizing generative AI in children's education overshadow its benefits" [40].
That is a striking statement from an institution that published cautiously optimistic tutoring research just two years earlier. The shift reflects the weight of evidence accumulating in 2024-2025: the Bastani PNAS study, the hallucination benchmarks, the bias findings, and the growing recognition that most schools lack the infrastructure and training to implement AI tutoring safely.
Baker and Hawn's review of algorithmic bias in education (published in the International Journal of Artificial Intelligence in Education) documented bias across at-risk prediction models, automated essay scoring, spoken-language proficiency assessment, and student-emotion detection systems [41]. A 2024 study found that ChatGPT-4o recommended significantly fewer surgical specialties to Black and Hispanic medical students despite equivalent academic profiles [42].
As of December 2025, 31 U.S. states had published guidance or policies for AI in K-12 education. But guidance is not implementation. RAND data shows that most districts, especially high-poverty ones, remain far behind in actually training teachers and students to use these tools effectively [43].

Conclusion
The honest answer to "Can AI replace human tutors?" is: not yet, not fully, and possibly not ever for the things that matter most.
The VanLehn meta-analysis shows that the performance gap between human and AI tutoring is far smaller than most people believe. The Bastani and Kestin studies show that design, not technology, determines whether AI helps or harms. The neuroscience of mirror neurons, oxytocin, and the social brain hypothesis shows that human interaction activates learning circuits that screens cannot reach. The hallucination data shows that AI still fabricates a quarter of its academic citations. And the student preference data shows that when the confusion runs deep, people still want another person.
But the 44-million-teacher shortage is real. The cost disparity between $4 a month and $80 an hour is real. And the evidence that well-designed AI can double learning gains in controlled settings is real, too.
The question itself may be the problem. Asking whether AI can "replace" human tutors frames a binary where none exists. A thermometer does not replace a doctor. A calculator does not replace a mathematician. These tools changed what doctors and mathematicians do, not whether they exist. AI tutoring will change what human tutors spend their time on. The grading, the drilling, the instant-answer questions at 11 PM, those belong to machines. The moment a student's eyes glaze over, the quiet encouragement after a failed exam, the experienced judgment that says "this student needs a different approach," those belong to humans.
Self-Determination Theory, proposed by Richard Ryan and Edward Deci, identifies three basic psychological needs: autonomy, competence, and relatedness [48]. AI can support autonomy (learn at your own pace) and competence (immediate feedback). It cannot deliver relatedness. And relatedness, the feeling of being understood by another mind, is often the difference between a student who persists and one who quits.
The future is not AI replacing human tutors. It is AI doing the work that does not require being human, so that human tutors can focus on what only humans can do: read the room, build trust, model thinking, tolerate productive struggle, and care about the outcome not because they were programmed to, but because they chose to.
The research is clear on one thing. A well-designed hybrid outperforms either alone. The challenge is the design.
Frequently Asked Questions
Can AI tutors match the effectiveness of human tutors?
A 2011 meta-analysis by Kurt VanLehn found that intelligent tutoring systems achieved an effect size of d = 0.76 compared to d = 0.79 for human tutors, a gap of just 0.03 standard deviations. This suggests AI tutors can come remarkably close to human tutoring effectiveness for certain cognitive tasks, though they still lack emotional and relational dimensions.
What happens when students rely too heavily on AI tutors?
A 2025 PNAS field experiment with roughly 1,000 high school students found that those who used unrestricted ChatGPT during math practice scored 17% worse on subsequent exams taken without AI access. The study suggests that without proper guardrails, AI tutoring can create dependency rather than build genuine understanding.
Why does AI tutoring design matter more than AI capability?
Two 2025 studies using the same GPT-4 technology produced opposite results. Unrestricted access harmed learning, while a carefully scaffolded version at Harvard doubled learning gains. The difference was pedagogical design: the effective tutor gave hints rather than answers, limited response length, and required students to explain their reasoning.
What role does the teacher-student relationship play in learning outcomes?
A second-order meta-analysis by Lei and colleagues in 2023 found that teacher-student relationship quality predicted eight major clusters of student outcomes including academic achievement, motivation, well-being, and executive function. Neuroscience research shows that social bonding hormones like oxytocin enhance memory encoding specifically during socially mediated learning.
Will AI tutors eventually replace human teachers completely?
Current evidence suggests this is unlikely in the foreseeable future. AI would need to solve at least seven unsolved problems: reliable visual reasoning, genuine metacognitive scaffolding, ethical affective computing, hallucination rates below 1%, long-term learner memory, embodied teaching, and calibrated honest praise. The strongest evidence supports hybrid models combining AI efficiency with human relational strengths.





