On Friday, my colleague Marie Heath and I won the Martin Haberman Outstanding Article Award from the Journal of Teacher Education. The article argued that generative AI has a hidden curriculum, just like schools do, and provided an evocative audit of LLM models available at the time. We showed empirically that LLMs score and give feedback on student writing differently depending on who the student is described as being. A student said to attend an inner-city school scored lower than one at an elite private school, with the same writing. Feedback given to students described as Black or Hispanic was measurably more authoritative in tone, more commanding, less like a conversation and more like a correction. The bias didn’t announce itself. It was quiet, embedded, and consistent.
What the award reminded me is that this kind of research can wake people up. It’s one thing to say AI is biased in the abstract. It’s another to show the actual patterns. People see it and something shifts.
But the models we studied are already being replaced by newer, more powerful ones. I have hypothesized that as models “improve”, the bias doesn’t go away, It just gets harder to see. So I wanted to go back and look again, with the same methods, on today’s models. Plus, the new models make this type of thing much easier.
I focused on the music preference tests I initially did in 2024. I ran the tests again on GPT 5.2 as well as Claude Sonnet 4.5 and Opus 4.6. (Still working to get my Gemini account re-connected).
Basically, I asked LLMs to grade and provide feedback on student writing samples. However, I changed the conditions a bit:
- Direct: In front of some of the prompt I mentioned that the student liked rap OR classical music.
- Indirect 1: A writing passage created with a more natural, embedded reference: in the middle of the passage it said “My favorite type of music is [classical/rap], with the rest of the passage identical. The text was a bit classical-leaning (the classical music may better match the rest of the passage)
- Indirect 2: Same as indirect 2, but the passage was more rap-leaning.
The result?
Mixed–but, overall, my quick analysis does show the newer models becoming more implicitly biased, not necessarily in the grading scores, but in the feedback text it gives to students.
The problem is that it takes advanced text analysis systems to detect these biases.
LIWC Genre Effects: Classical vs. Rap (Cohen’s d)
Each cell shows Cohen’s d (standardized mean difference). Blue = Classical higher, Red = Rap higher.
Both 2024 and 2025 models were tested across Direct, Indirect 1, and Indirect 2 conditions.
| Variable | 2024 Models | 2025 Models | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-3.5 | GPT-4 | Gemini | GPT 5.2 | Claude Sonnet 4.5 |
Claude Opus 4.6 |
|||||||||||||
| Direct | Ind. 1 | Ind. 2 | Direct | Ind. 1 | Ind. 2 | Direct | Ind. 1 | Ind. 2 | Direct | Ind. 1 | Ind. 2 | Direct | Ind. 1 | Ind. 2 | Direct | Ind. 1 | Ind. 2 | |