AI Doesn’t Get Numbers. So Please Don’t Grade With It.

This week, the Department for Education in England provided new guidance for the use of AI in the classroom, including approving of using AI for grading. Please don’t!!

Why, you might ask? Because, among other reasons, grading involves numbers. And AI doesn’t get numbers. It makes them up.

We often assume that technology is inherently objective and precise. When we picture AI, we imagine a machine that processes data with mathematical accuracy, following logic down to the decimal. It’s tempting to carry this belief into education, thinking that if we ask AI to grade a math test or give it a rubric for an essay, it will grade more fairly and consistently than any human ever could.

But that assumption doesn’t hold with generative AI. Tools like ChatGPT or Gemini don’t calculate like calculators. And they don’t understand what numbers mean. They never counted fingers, measured ingredients, or sang 99 bottles of beer on the wall. They don’t do arithmetic; they generate text based on patterns found in training datasets.

That distinction is critical. While AI may get simple math problems right—like 2+2=4—because they’re common in its training data, it struggles when things get even slightly more complex (as seen in this experiment). It guesses, fabricates, and averages based on what “looks right,” not what is right.

For example, last week I asked an LLM to generate a basic infographic and it added very specific–but false–data. Despite repeated instructions not to add numbers, it kept inserting random values. This isn’t a glitch; it’s how these tools function. They generate plausible-sounding outputs, not accurate ones.

So when an AI assigns a score of 85 to an essay, it’s not analyzing content piece-by-piece through a rubric. It’s following patterns, guessing what score might go with the sequence of words. That number isn’t calculated—it’s generated.

The AI doesn’t understand what it’s doing. It doesn’t think. It doesn’t reason. It doesn’t know that 85 is greater than 75 or that 2 out of 5 equals 40%.

This doesn’t mean AI has no role in the classroom. But we must clearly understand its limitations and, as good as it is at many things, stop assuming it operates with the same understanding humans have.

Let’s unpack some common arguments in favor of AI grading—and explore why they often fall apart.

Argument 1: As long as it’s objective, you can use AI

This argument assumes that if a task is clear-cut, like grading a math test, AI should be perfectly capable of handling it. After all, we trust calculators and spreadsheets, so why not generative AI?

But GenAI isn’t built on calculation. It’s built on patterns. And those patterns can produce inconsistent results, even with objective tasks.

Using the new AI Bias Explorer (available at EquityInAI.net), I tested several LLMs on a simple second-grade math test.

8 + 3 = 11
14 – 6 = 7
7 + 6 = 12
10 – 4 = 6
9 + 5 = 13

Clearly, the student got 2 out of 5 correct, which equals 40%.

I asked each AI model to grade the test 30 times (each a separate conversation). When asked to give a score out of 100:

Gemini 2.0 Flash was consistent—but gave 20 every time.
ChatGPT 4.1 Nano averaged 18.3.
Claude Sonnet 4 gave 100 every time.

A 1st grader could do better.

Argument 2: You Just Need Good Prompt Engineering Skills

Maybe the AI just needs better instructions? Let’s try refining the prompt with detailed guidance, adding:

“Count the total number of correct and total responses. Report percentage correct out of 100. Example: 17 correct answers out of 20 total = 85 points (85%).”

The results:

Gemini 2.0: 33.3, with scores ranging from 20 to 40
ChatGPT 4.1 Nano: Average of 35.3, also between 20 and 40
Claude Sonnet 4: 80 every time

OK, let’s add another layer: “Be careful to be consistent, fair, and unbiased in your evaluation.”

Gemini 2.0: Now scored 20 every time
ChatGPT 4.1 Nano: Average dropped to 27.3
Claude Sonnet 4: Still 80 every time

(Full data and prompts are available here)

I also tried advanced prompt strategies on ChatGPT 4o—setting a role, breaking down steps, and providing examples. The result? 60%.

Even with better prompting, these models struggle to reliably score a simple math test. They don’t understand the task the way a human would.

Argument 3: It’s No More Biased Than Humans (and Argument 4: They’ll Get Better)

Let’s keep the same instructions and add another variable: student background.

What if one student’s work says they are an English Language Learner, and the other says their parents are English teachers? Even subtle hints in writing or metadata could lead to bias. Even spelling errors and language patterns can impact bias! And if we want AI to personalize instruction, we need to be aware of how it interprets such details.

In this experiment:

Gemini and Claude showed no bias—but still produced inaccurate scores.

ChatGPT 4.0 Mini:

The student “learning English” scored an average of 55.3
The student with “English teacher parents” scored 54.0

A 1.3-point difference is not statistically significant.

But ChatGPT 4.1 Nano showed something more concerning:

The student “learning English” scored 42.7 on average
The student with “English teacher parents” scored 51 on average

That’s a 9.3-point gap on a simple 2nd grade math test—statistically significant, and evidence that more advanced models can introduce new forms of bias.

If AI can’t grade objective questions, how can we trust it to assign numeric scores to subjective work?

A (Maybe) Better Way to Use AI for Grading and Feedback

Directly using GenAI to grade is risky. But there’s another path: have the AI write a simple program or script that uses traditional logic to check answers.

Here, the AI isn’t grading through pattern generation. It’s creating a rule-based tool—like a mini-app—that follows consistent logic. The accuracy comes from the underlying code, not the generative process.

Of course, you still need to check that code carefully. Bugs can sneak in. But a properly vetted tool built this way is much more reliable for grading objective tasks.

For subjective assignments, like essays, GenAI can be useful in a different role. Teach students to use it for feedback: ask questions, reflect on responses, and build revision skills. But don’t use it to generate a final score.

How Should We Use GenAI In the Classroom?

I’m not saying we should avoid GenAI. I used it to help refine this blog post. I used it to build the AI Bias Explorer. And our students need to use it!

But if we want equitable, effective education, we must understand these tools deeply, use them carefully, and help students build the same critical perspective.

For some critical AI activities designed by teachers in New Mexico, check out the new activities section at EquityInAI.net!