New University of Georgia research explores how language models like Mixtral compare to human educators when evaluating student work
For teachers drowning in stacks of papers to grade, artificial intelligence might offer a glimmer of hope. Recent research from the University of Georgia suggests that large language models (LLMs) could help streamline the grading process, though the technology still falls short of fully replacing human judgment. The study, published in Technology, Knowledge and Learning, reveals both the potential and limitations of using AI tools to evaluate complex student assignments.
The research comes at a critical time when many educators face mounting pressure to implement interactive learning approaches while still providing timely feedback to students.
The Challenge of Modern Science Education
Science teachers face a particular challenge with the adoption of Next Generation Science Standards, which emphasize student argumentation and investigation rather than simple memorization. These complex assignments create a substantial grading burden.
“Asking kids to draw a model, to write an explanation, to argue with each other are very complex tasks,” said Xiaoming Zhai, corresponding author of the study and an associate professor and director of AI4STEM Education Center in UGA’s Mary Frances Early College of Education. “Teachers often don’t have enough time to score all the students’ responses, which means students will not be able to receive timely feedback.”
This feedback bottleneck can significantly impact student learning progress, especially when timely guidance is critical for developing scientific thinking skills.
How AI Approaches the Grading Process
The study examined how Mixtral, a large language model, evaluated middle school students’ written responses to science questions. In one example, students were asked to create a model showing what happens to particles during heat energy transfer. The researchers wanted to see how the AI’s grading process compared to human methods.
While the AI could generate assessment rubrics and assign scores almost instantly, the researchers discovered fundamental differences in how the technology approached evaluation compared to human teachers.
Key Findings About AI Grading:
- LLMs can grade responses dramatically faster than human teachers
- AI often relies on spotting keywords rather than evaluating complete understanding
- Without human-made rubrics, AI grading achieved only 33.5% accuracy
- When given access to human rubrics, accuracy improved to just over 50%
- AI tends to “over-infer” student understanding based on limited evidence
The research team found that AI graders often take shortcuts in their evaluation process, looking for specific keywords rather than assessing the underlying logic and reasoning in student responses.
“Students could mention a temperature increase, and the large language model interprets that all students understand the particles are moving faster when temperatures rise,” said Zhai. “But based upon the student writing, as a human, we’re not able to infer whether the students know whether the particles will move faster or not.”
Improving AI Grading Accuracy
Despite these limitations, the researchers see potential for improvement. By providing AI with detailed human-created rubrics that outline specific evaluation criteria, the technology’s accuracy can be significantly enhanced. These rubrics help the AI understand the deeper analytical thought processes that human graders use when evaluating student work.
Could AI eventually replace human graders entirely? The researchers caution against this approach, suggesting instead that AI might best serve as an assistant to human teachers rather than a replacement.
“The train has left the station, but it has just left the station,” said Zhai. “It means we still have a long way to go when it comes to using AI, and we still need to figure out which direction to go in.”
Real-World Impact for Educators
Despite the current limitations, teachers who participated in related research expressed enthusiasm about the potential time-saving benefits of AI grading tools.
“Many teachers told me, ‘I had to spend my weekend giving feedback, but by using automatic scoring, I do not have to do that. Now, I have more time to focus on more meaningful work instead of some labor-intensive work,'” said Zhai. “That’s very encouraging for me.”
As AI technology continues to evolve, these tools may become increasingly valuable for educators seeking to balance comprehensive assessment with manageable workloads. The key appears to be finding the right partnership between human judgment and technological efficiency—allowing teachers to focus their expertise where it matters most while leveraging AI to handle more routine aspects of evaluation.
If our reporting has informed or inspired you, please consider making a donation. Every contribution, no matter the size, empowers us to continue delivering accurate, engaging, and trustworthy science and medical news. Independent journalism requires time, effort, and resources—your support ensures we can keep uncovering the stories that matter most to you.
Join us in making knowledge accessible and impactful. Thank you for standing with us!