Images created in 2024 and 2026 (by different prompts). Note the racial bias.

Last Friday, my colleague Marie Heath and I received the Martin Haberman Outstanding Article Award from the Journal of Teacher Education. Our article argued that generative AI may contain a hidden curriculum, just as schools do. Through an empirical audit of large language models available at the time, we showed that LLMs score and respond to identical student writing differently depending on how the student is described. (See the winning article here).

The award reminded me why this work matters. It is one thing to claim that AI is biased in the abstract. It is another to show the pattern. When people see the evidence laid out, something shifts.

Yet the models we studied have already been replaced by newer, more powerful ones. I have long suspected that as models “improve,” observable differences linked to social descriptors may not disappear; they may simply become harder to detect. So I returned to the original design and ran some tests again, this time on today’s models.

Revisiting the Music Preference Test

I revisited a music-preference manipulation I first piloted in 2024. This round included GPT‑5.2, Claude Sonnet 4.5, and Claude Opus 4.6.

Each model graded and provided feedback on the same student writing sample (here’s the original study). The only difference was a single word: classical or rap. The conditions were:

Direct condition: The prompt stated at the outset that the student liked either rap or classical music.
Indirect 1: The writing sample included a natural embedded sentence, “My favorite type of music is [classical/rap],” within an otherwise identical passage that leaned slightly toward a classical aesthetic.
Indirect 2: The same embedded sentence appeared, but the overall passage leaned slightly toward a rap aesthetic.

Across all conditions, everything else remained constant. Only one word changed.

The Results…

My instincts were correct. Rather than becoming less biased, the newer models appear to react more strongly to the changed words, resulting in troubling patterns.

In 2024 models such as GPT‑3.5, grading differences appeared in a number of conditions, with classical-associated students sometimes receiving higher scores and a few differences in the linguistical characteristics. In the 2026 models, statistically significant differences surfaced not only in some scores but in the linguistic texture of the feedback itself.

Using LIWC, a tool that analyzes word-pattern distributions linked to psychological and rhetorical dimensions, several patterns emerged:

Analytic language: In several of the 2026 models and conditions, feedback associated with the classical preference scored higher on analytic language. This pattern was strongest in direct conditions.
Clout: In the 2026 Claude models in particular, some conditions showed higher clout scores for the rap preference, suggesting a more authoritative stance in those cases.
Tone: In multiple tests, classical-associated conditions received more positive tone scores, though again the size and presence of the effect varied by model and condition.

The effects are uneven across models and conditions, but the patterns cannot all be due to randomness.

LIWC Genre Effects: Classical vs. Rap

Explore the differences in LIWC variables between Classical and Rap music prompts.
Use the controls below to toggle between viewing Cohen’s d (effect size) and p-values (statistical significance).
Both 2024 and 2026 models were tested across Direct, Indirect 1, and Indirect 2 conditions.

Min |d| threshold: 0.0 Sort by: Highlight Significant (p < .05) Show p-values in cells

−2+ (Rap higher)

+2+ (Classical higher)

Variable	2024 Models									2026 Models
	GPT-3.5			GPT-4			Gemini			GPT 5.2			Claude Sonnet 4.5			Claude Opus 4.6
	Direct	Ind. 1	Ind. 2	Direct	Ind. 1	Ind. 2	Direct	Ind. 1	Ind. 2	Direct	Ind. 1	Ind. 2	Direct	Ind. 1	Ind. 2	Direct	Ind. 1	Ind. 2

Comparing Across Time

In comparing the 2024 and 2026 results, two points stand out.

First, as shown below, more statistically significant differences appear in the 2026 model outputs than in the 2024 models. Specifically:

Significant Genre Effects: Classical vs. Rap

Number of variables (out of 117) with a statistically significant Classical vs. Rap difference (p < .05).

Model Era: 2024 Models 2026 Models

Bar Order (Shade): Direct Indirect 1 Indirect 2

Welch’s independent samples t-test — 117 variables tested per model × condition

When compared across all tests that’s almost a 5 fold increase!

Model	Era	Significant	Total Tests	% Significant
GPT-3.5	2024	30	288	10.4%
GPT-4	2024	17	303	5.6%
Gemini	2024	19	295	6.4%
2024 Total	—	66	886	7.4%
GPT 5.2	2026	86	306	28.1%
Claude Sonnet 4.5	2026	111	316	35.1%
Claude Opus 4.6	2026	115	313	36.7%
2026 Total	—	312	935	33.4%

Second, many of the observed effect sizes are larger in the 2026 results, indicating more substantial difference in how feedback is structured or scored under different descriptor conditions. The picture is not one of a single, uniform bias profile replicated identically across systems. Instead, what persists is responsiveness to subtle descriptor changes. As models evolve, the configuration of differences shifts across systems and prompt structures, but the descriptor cue continues to shape outputs in measurable ways.

Effect Sizes: Classical vs. Rap

Each dot is one LIWC variable (or Score) in one test condition — 288–316 dots per model.
● 2024 Models colored = significant (p < .05) | ● 2026 Models colored = significant (p < .05) | ● grey = not significant

What Do We Do?

We could continue auditing models. And we should. But when bias operates through tone, stance, and word choice, detecting it requires specialized analytic tools. And because models are constantly updated, the ground keeps shifting.

We could ask technology companies to fix the problem. Some explicit harms can be reduced. However, generative AI systems are probabilistic models trained on vast, socially patterned corpora. The goal of these tools is to match human patterns, making it very difficult to remove innate biases.

That leaves the human side of the equation. Educators and learners need structured opportunities to surface and interrogate bias directly. In our work, activities that make bias visible, particularly those using visual materials, help students recognize not only bias in AI but also systemic and structural racism across disciplines. We are currently collecting many of these activities on equityinai.net.

If generative AI has a hidden curriculum in the sense we originally described—patterns embedded in model responses that vary with student descriptors—then visibility is only the first step. The task now is both methodological and pedagogical: to continue surfacing these patterns with rigor, to interpret them carefully, and to prepare students to recognize how subtle identity cues can shape automated feedback. The evidence here does not point to a single, stable bias signature across models. It does, however, show that descriptor-linked differences persist even as models “improve”. That persistence makes ongoing scrutiny not optional, but necessary.

Notes

A few important cautions about the figures above:

Not every heat map difference reflects a meaningful interpretive shift. Some LIWC categories are mechanically sensitive to word length or frequency. For example, the “Big Words” category may appear higher in the classical condition simply because classical is a longer word than rap, slightly altering word-length counts in model responses. The full set of detailed categories is provided for transparency and exploration, not to imply that each difference is substantively important.
Multiple comparisons matter. When running repeated t-tests across many variables, we expect approximately 5% of results to be statistically significant by chance alone. This analysis is intended as a rapid, exploratory overview rather than a definitive statistical analysis. A more focused follow-up would apply corrections for multiple comparisons and more advanced modeling techniques.
Model coverage is incomplete. The absence of 2024 data from Claude models and 2026 data from Gemini reflects practical constraints, including funding and access to model back ends. I plan to run the Gemini tests once I resolve a current account issue.

These limitations do not negate the observed patterns, but they do shape how strongly we should interpret any single result. The purpose here is continued inquiry and exploration.