AI in Education #7: Five Things I Learned from Bibi Groot about What Happens When You Actually Test Whether AI Tutoring Works
This is the seventh in a series of AI in Education specials, where I speak to the world’s leading experts about how AI is changing teaching and learning.
Bibi Groot started her career at the Behavioural Insights Team – the government unit that became famous for getting people to pay their taxes on time by changing a single line in a letter. She ran randomised control trials in further education colleges, earned a PhD in Behavioural Public Policy at UCL, had twins in 2022, and then joined Eedi as our first behavioural scientist because, as she put it, education is where her heart is.
I have wanted to get Bibi on the podcast for a while, partly because she is one of the smartest people I work with, but mainly because listeners have been asking me to go deeper into the Google DeepMind study that we published earlier this year. The one that made the news. The one that Daisy Christodoulou, Carl Hendrick, and others all held up as the gold standard of AI tutoring research. Most importantly, the one that was the subject of our first-ever (no doubt AI-generated) hype Tweet.
Finally, we have made it!
So here are five takeaways from our conversation that I think every classroom teacher and school leader will find interesting.
1. It took two years to prove that Eedi works – and the results only showed up after 12 months
Before we get to the AI stuff, I want to start with something that I think matters more than most people realise.
For years, I just assumed Eedi was effective. I had built it. I believed in the pedagogy behind it. Teachers told me they loved it. But we had no rigorous evidence that it actually improved student learning outcomes.
So we ran a randomised control trial. Twenty schools. 3,448 Year 7 students. Two full academic years. The study was independently evaluated by a team led by Professor Steve Higgins, who has been instrumental in developing the Education Endowment Foundation toolkit. We did not run the analysis ourselves. We pre-registered the study design and outcome measures so we could not cherry-pick the results. And we committed to publishing whatever we found.
Here is what we found. At six months, there was no clear difference between the Eedi group and the control group. At twelve months, an effect size of 0.17 – roughly two to three months of additional progress. At eighteen months, 0.46 – which is large for a platform like ours. And at twenty-four months, it stabilised at around 0.3.
Two things struck me about this. First, the control schools were not doing nothing. They were using Sparx, MathsWatch, whatever they wanted. So the comparison is not Eedi versus a vacuum. It is Eedi versus the normal crowded landscape of maths ed-tech.
Second – and this is crucial – the positive results did not show up for a year. If we had run a six-month study, we would have concluded that Eedi had no measurable effect on learning. Bibi made the point that a lot of ed-tech companies run short studies, find positive results, and announce them. But the file drawer problem is real: studies that find nothing are far less likely to be published. You end up with 10 positive papers and a hidden drawer full of 100 null results, and no one knows the full picture.
We also ran a second trial that found a smaller, non-significant effect. When we compared the two, the difference came down to implementation. In the trial that worked, teachers were using Eedi analytics to inform their classroom teaching – bringing the worst-answered homework question into the next day’s starter, for example. In the trial that did not work, only 33% of teachers were using Eedi in the classroom at all. Most were just setting and forgetting homework.
This aligns with something Adam Boxer has been saying for years: homework is only effective when integrated with classwork.
2. A constrained AI tutor outperformed expert human tutors on short-term transfer – and we think we know why
Now to the Google DeepMind study. This is the one that generated all the attention, and I want to be upfront about what it is and what it is not.
It is a randomised control trial with 165 students, run over seven weeks, comparing three conditions: Eedi’s existing static hints (you get a question wrong, you read a hint, you retry), a human tutor who chats with the student via text, and an AI tutor powered by Google’s LearnLM with a human in the loop approving every message before it reaches the student.
It is not definitive proof that AI tutoring works. It is a small first study. Bibi was clear about that.
Here are the headline results. On the retry of the same question, both the human tutor and the AI tutor achieved over 90% success, compared to 65% for the static hint. No surprise there – an interactive dialogue is harder to ignore than a sentence you can click past in two seconds.
But on the next question in the sequence – a different question testing a related but distinct concept, which is what we call short-term transfer – the AI tutor condition scored 66%, the human tutor 60%, and the static hint 56%. The AI tutor outperformed the expert human tutors.
That was not what we expected. We had hoped the AI would perform almost as well as the humans. We did not expect it to do better.
Bibi’s working hypothesis is this: the AI was prompted to be Socratic. It asked students to reflect on their own thinking, to identify where they went wrong, to work through the reasoning steps themselves. The human tutors, meanwhile – experienced maths teachers with years of practice – tended to diagnose the misconception quickly and fix it as fast as possible. They skipped the metacognitive step. One tutor said afterwards: “The AI is giving the kind of feedback I would like to give if I had more time.”
I think that line is the most important sentence in the entire study.
On safety: out of over 3,000 conversations, there were zero safeguarding issues. Five messages contained any form of incorrect maths – and those were mostly the AI misreading an image (saying a student selected answer A when they had selected answer B). As I pointed out to Bibi, if I gave a thousand answers and five contained a small mistake, I would be more than happy with that ratio. My own error rate as a classroom teacher was considerably higher.
3. Unconstrained chatbots make students worse at maths, not better
I pushed back on Bibi. How do we know we are not wasting our time building a constrained AI tutor when kids could just screenshot a question and send it to ChatGPT?
Her answer drew on two studies that I think every teacher should know about.
The first, from a Turkish high school, randomised about a thousand maths students into two groups. One group could use GPT-4 while practising for their exam. The other could not. Then both groups sat a closed-book exam with no AI access. The students who had practised with AI did worse. They had offloaded the cognitive work to the chatbot, and when it was taken away, they had less in their heads than the students who had struggled on their own.
The second study, by the same researchers (Bastani and Poultis), looked at chess practice. Students who could request hints whenever they wanted did worse than students who received tips at regular intervals but could not request them. The researchers found that students knew requesting help was hurting their learning – but they could not stop themselves, because struggling is unpleasant and humans are built to reduce friction.
This is behavioural science at work. Make it easy to get the answer, and students will get the answer. But getting the answer is not the same as learning.
Michael Pershan says that any tutor that the tutee can control is not a tutor. Adam Boxer went further in our conversation, saying schools should be communicating to students that using AI is a waste of their time. I am not sure I would go that far. But the evidence that unconstrained chatbots hurt learning is now strong enough that I think any school allowing students to use ChatGPT for maths homework should be worried.
The AI tutor we designed with Google DeepMind is constrained in five specific ways. It only activates after an incorrect diagnostic question. Students cannot send it screenshots of their homework. It addresses only the specific misconception revealed by the question. It uses guided questioning rather than giving answers. And it ends the interaction when the misconception is resolved, or progress has stalled. These are not arbitrary restrictions. They are designed to keep the cognitive work with the student.
4. The human tutors said the AI was like observing another teacher in the classroom
One of the things I was most curious about was how our human tutors – experienced maths teachers, remember – felt about having an AI suggest messages that they then had to approve or edit.
Bibi surveyed them at the start and end of the trial and found four archetypes. Enthusiasts who were all in from the beginning and stayed there. Pragmatists who saw specific use cases and stuck to them. Ambivalents who were unsure what to expect and still were unsure by the end. And sceptics – a small group who worried that their role was changing and were concerned about their future.
The sceptics are worth taking seriously. Carl Hendrick made the point in his episode that teachers seem to accept every other industry will be affected by AI but somehow think education is immune. That is not realistic. Things are going to change. The question is how.
But what struck me most was the feedback from the tutors who engaged deeply with the process. Several said it felt like observing another teacher in a classroom. Others said the AI came up with analogies or metaphors they had not thought of, and they took those ideas into their own tutoring practice. The AI was not replacing their expertise. It was extending it.
About 75% of the AI’s suggested messages were approved and sent without any edits. Of the 25% that were edited, roughly half were small tweaks – swapping “nearly” for “not quite”, removing emojis (the AI was very fond of emojis back then), trimming a three-sentence message down to two. The larger edits tended to come at the end of conversations, where the human tutor intervened to say: the student gets it now, let them go back to the lesson. The AI, prompted to be Socratic, kept asking questions. It could not read the social signals that the student was ready to move on.
Dan Meyer flagged this exact issue in his episode – that knowing when to stop intervening is one of the most sophisticated decisions a human tutor makes. The AI is not there yet. But Bibi described where Eedi is heading: from a human-in-the-loop (every message is checked) to a human-on-the-loop (the AI runs independently, with a human moderator flagged when conversations go off track). The human tutor is elevated from babysitter to expert intervener.
5. The next study will tell us what matters more: knowing the pedagogy or knowing the student
The Google DeepMind 2 study is running right now, as I write this, in 10 UK schools with 1,200 students. Results are expected by September 2026. And the question it is trying to answer fascinates me.
There are four arms. First, static content – the existing Eedi hints and videos. Second, a human tutor with no AI support. Third, an AI tutor armed with a strong pedagogy prompt – it knows about Newman’s error analysis, common misconceptions, effective questioning techniques – but knows nothing specific about the individual student. Fourth, an AI tutor with the same pedagogy prompt plus deep context on the student – their quiz history, the misconceptions they have struggled with in the past, how they have responded to previous tutoring interactions, what topics they are expected to struggle with next.
In other words: does it matter more to know how to teach, or to know who you are teaching?
Bibi put her prediction on the record. She thinks static content will come last. She thinks the human tutor and the pedagogy-only AI will be roughly on par. And she hopes the context-rich AI will come out on top – because if it does, it means we can build AI tutors that are meaningfully different from blank-slate chatbots.
I agree with her ranking, though I am slightly more bullish on the pedagogy-only AI beating the human tutor. How many teachers can diagnose misconceptions, think of several analogies, generate relevant follow-up questions, and find the time to work at the pace the student needs, in the hustle, bustle and demands of a busy classroom? That is essentially what our human tutors will be doing, supporting several students at once. The AI does not have these issues.
Bibi also flagged something from a Stanford study that I think is important: when an AI writing feedback tool was given access to student demographic data – gender and ethnicity – it gave Black students more praise and softer feedback than white students. The bias was baked into the training data. This is why Eedi’s AI tutor is given detailed information about what a student knows and does not know, but nothing about who they are. No name, no age, no gender, no ethnicity. Context on learning, not context on identity.
If static hints win, I am in trouble. But that is why we pre-register, publish regardless of results, and have an independent team running the analysis. It is also why this work matters. We are not building marketing material. We are trying to find out what actually helps students learn.
Over to you
This was a brilliant conversation, and I have only scratched the surface here. You can listen to or watch the full episode on the Mr Barton Maths Podcast. And do check out Bibi’s posts on LinkedIn, where she writes about behavioural science, AI in education, and the work we are doing at Eedi around the world.
This is the seventh in a series of conversations I am having with leading thinkers about AI in education. Here are the others in the series:
We will be publishing the results of the Google DeepMind 2 study at the start of the next academic year. I am nervous and excited in roughly equal measure. If you want to follow along, subscribe to this newsletter and I will share them here first.
In the meantime, I would love to know: does the evidence so far change how you think about AI tutoring? And what question would you most want answered by the next study?
Thanks so much for reading.
Craig
🏃🏻♂️Before you go, have you… 🏃🏻♂️
… checked out my brand-new book series: The Tips for Teachers guides to…
And checked out my all-new, ad-free, Mr Barton Maths website, with my new Topics page?






I've found all of these articles very interesting, but I think this last one "brings it home" in terms of the future of AI being useful in the classroom. I too believe 'option 4' will be the winner, look forward to seeing the results.
Given that....: "the evidence that unconstrained chatbots hurt learning is now strong enough that I think any school allowing students to use ChatGPT for maths homework should be worried." ... How on Earth can a school actually enforce allowing use of AI at home or not? To me the conclusion is clear : unsupervised work has no value in an AI age. I'm more than happy for someone to disagree with me and justify that homework has value.