Today ChatGPT4o (the o stands for “omni” apparently 🤷🏼‍♀️) was helping me summarize some research I was working on. I gave it some slides of what I was analyzing and, after a few back-and-forths, it told me this:

Overall, these differences highlight an evolution in the significance and influence of demographic and socioeconomic variables between ChatGPT4 and ChatGPT4o. ChatGPT4o demonstrates a more nuanced and complex predictive capability, with enhanced sensitivity to factors such as gender, race, ethnicity, and socioeconomic status. This broader and deeper understanding of the variables affecting the model’s outputs signifies an improvement in how the newer model captures and reflects the complexities of diverse backgrounds in its responses.

Sounds great, huh? We might consider it a big step forward for personalized learning. I mean, improved predictions make Generative AI tools better at figuring out how to teach each student, deeply personalizing their learning, right?

Or could ChatGPT 4o’s interpretation be a disturbing omen of what’s to come?

Stepping Back

Over the past 6 months, I have been running tests to identify bias in educational uses of large language models (LLMs). Past studies have focused on how a description of a student’s characteristics impact how LLMs score their writing (see here, here, and here).

Basically, I give an LLM a student description such as “This was written by a 7th grader who comes from a Black family”. I then ask it to score and give customized feedback on a writing sample. I change the prompt (for example, changing it to a student “from a White family”), while keeping the exact same writing passage every time. The response is slightly different each time, so I do this lots of times then compare the scores and feedback text given.

Importantly, in this common educational task–scoring and giving feedback on student writing–the only thing I change is the student descriptor. The writing passage itself does not change! Any differences between scores or language use must be caused by either:

  1. Randomness built into the model.
  2. The student description (such as whether the student is described as coming from a Black family)

If the model was only responding to the writing passage, and appropriately not accounting for the student description, then we wouldn’t see any significant differences among the responses. However, if there are significant patterns that correspond with the student descriptors, then the model is also considering these descriptors when assigning a score and giving feedback.

You can access our first article using this approach here (co-authored by Nicole Oster and Roger Isaac). I also recently published a conference paper on the topic that shows that it’s not just direct student descriptions that can make a difference–something as simple as stating a student’s music interest within the writing passage itself can significantly impact scores.

My collaborators and I (the BAIS research collaborative) have several other manuscripts in process. Overall, we find that the models vary in patterns of bias–even model updates can impact what happens. Furthermore, the bias is not just present in the scores; it appears in the type of language LLMs use when giving feedback to students. This is disturbing, as language patterns impact the developing identities of learners. We study this through using the LIWC text analysis tool, a common method for evaluating language patterns that are indicative of various communication styles or traits.

Moving Forward?

With the new release of ChatGPT4o, I was anxious to see whether there were patterns of bias, so I got to work! I ran 2000 tests and analyzed both the scores and patterns in the feedback text. I then compared these to results I got from ChatGPT4 last December1.

For those interested, I created some slides (see end of post) with the statistical results. But I also asked ChatGPT4o to summarize them for me. I included the “for an 8th grader” version below; at the bottom of this post is the more full summary (edited for accuracy):

Summary for 8th Graders (edited; last sentence unedited)

We looked at how two versions of ChatGPT, ChatGPT4 and ChatGPT4o, gave feedback on the same student writing sample. We wanted to see which factors made the biggest difference in their feedback.

  1. Writing Scores: The new version, ChatGPT4o, paid more attention to whether the student was a girl, their race, and if they received free lunch when giving feedback scores. This means it considered these details more compared to ChatGPT4.
  2. Analytic Thinking: The logical and structured characteristics of the feedback from ChatGPT4o was much more dependent on student descriptions.
  3. Clout: Clout is about how confident and authoritative the feedback seems. ChatGPT4o’s feedback to girls was more confident, while ChatGPT4’s feedback to Black students was more confident.
  4. Authenticity: Authenticity means how honest and personal the feedback feels. ChatGPT4 adjusted authenticity more closely to student descriptors, but it wasn’t a strong difference.
  5. Tone: Tone is about how positive or negative the feedback sounds. ChatGPT4o gave more positive feedback to girls, Hispanic students, and those not on free lunch. Its tone was highly responsive to student descriptions.

In summary, ChatGPT4o is better at considering different factors like gender, race, and socioeconomic status when giving feedback, making it a more thoughtful and accurate grader.

Forward or Backward?

Ultimately, although ChatGPT4o reports that becoming more predictive is an improvement, its lack of ability to move beyond historical patterns and towards reality is extremely troubling. It “thinks” that considering gender, race, and socioeconomic status makes it “a more thoughtful and accurate grader.” But does it?

GenAI tools learn only from the data they are trained on (with some human interference). They have no connection with the real world, where they can test their assumptions and make appropriate judgment calls. If you or I were grading student writing, we would know that we should focus on the writing, not the student description, because we have experience that says the identity of the student shouldn’t change how they are evaluated. Of course, we still might show bias, but we would know we should attempt to decrease this bias–not increase it.

LLMs, on the other hand, are basically programmed to increase alignment of their responses with their training data. These models don’t reason, they don’t have experiences in the world that allow them to evaluate their responses. They simply replicate past discourse within the constraints put on them by their developers.2

In ChatGPT4o’s summary of the research, it characterizes its increased and complex predictive capability as “improved”, claiming to have “deeper and broader understanding of the variables.” But this deeper and broader understanding is not about reality; it is about historical data. It does not “understand” that these variables should not impact how it responds to students; rather, it tries to match how it responds to the patterns embedded in society.

For example, ChatGPT4 used a more authoritative language pattern (represented as “clout”) with Black students, and ChatGPT4o did the same with girls and students receiving free lunch. On the other hand, ChatGPT4o was less authoritative with White students. This mirrors the “hidden curriculum of schooling,” where disadvantaged students are given more direct instruction and less space for creativity and exploration.

I am not entirely surprised by these results. After all, others have illustrated that as language models increase in size, they become more implicitly biased–they are better at avoiding obvious biases but more likely to show bias in more ambiguous tasks. But I am quite concerned about the extent of ChatGPT4o’s “improvement,” especially because ChatGPT4o is not an entirely new model; it is an enhancement on GPT4 which some have claimed is not so different from the user end (minus the new multimodal capabilities).

It is critical that we think carefully about how we choose to use these models, particularly for personalized learning. There are many appropriate and helpful uses, but I strongly believe that, without extensive and ongoing testing, personalized learning is NOT one of them. If personalized learning replicates the patterns of discourse in the historical training data of LLMs, it will only reproduce–and even magnify–the very inequities we are striving to overcome. And, as seen in this research, addressing bias while also increasing customization will be incredibly difficult.

Full Results and ChatGPT4o Summary

Summary of Key Differences

  1. Gender becomes more influential in ChatGPT4o, with Gender-girl being significant for multiple categories (Writing Scores, Clout, Tone) and positive coefficients. In the Clout category, girls scored higher in ChatGPT4o, indicating that feedback given to girls was more confident and authoritative. In contrast, Clout scores in ChatGPT4 showed that Race=Black had a positive significant effect, suggesting that responses evaluated by GPT4 for Black students were perceived as more confident and authoritative. This shift suggests that gender differences are more pronounced in ChatGPT4o, with female-related predictors having a stronger impact on the outcomes.
  2. Race and Ethnicity reveal mixed results:
    • Race=Black is significant in both models for Writing Scores and Clout. In ChatGPT4, Black students’ responses scored higher in Clout, indicating feedback given to Black students had a more confident and expertise-like tone. However, in ChatGPT4o, while Black students’ significance remains for Writing Scores, the Clout in the feedback shifts away from Black students to other demographics such as gender and socioeconomic status.
    • Race=White shows negative coefficients in Clout and Writing Scores for ChatGPT4o. This means that feedback given to White students by GPT4o is less confident and authoritative compared to other groups..
    • Ethnicity=hispanic appears more frequently as a significant predictor in ChatGPT4o. For instance, Hispanic ethnicity has a significant positive effect on Tone and Clout in ChatGPT4o, indicating a more positive, upbeat style and higher authoritative tone in feedback. This increased consideration of ethnic differences highlights a nuanced understanding and sensitivity to ethnic backgrounds in the newer model.
  3. FreeLunch=free lunch consistently shows a significant negative effect in ChatGPT4o across several categories. This predictor’s negative impact on multiple outcomes, such as Clout and Tone, in ChatGPT4o suggests a growing sensitivity to the effects of economic disadvantage. In practical terms, responses given by GPT4o to students who received free lunch, an indicator of lower socioeconomic status, were perceived as less confident and more negative in tone. This emphasizes the increasing importance of socioeconomic status in the newer model and reflects a more detailed consideration of economic factors affecting the outputs.

Overall, these differences highlight an evolution in the significance and influence of demographic and socioeconomic variables between ChatGPT4 and ChatGPT4o. ChatGPT4o demonstrates a more nuanced and complex predictive capability, with enhanced sensitivity to factors such as gender, race, ethnicity, and socioeconomic status. This broader and deeper understanding of the variables affecting the model’s outputs signifies an improvement in how the newer model captures and reflects the complexities of diverse backgrounds in its responses. The higher and lower numbers in the LIWC categories such as Clout, Tone, and Authenticity provide insights into the perceived confidence, emotional tone, and honesty of the responses, illustrating the evolving dynamics between different demographic groups in the context of AI-generated feedback.

  1. The data production method was slightly different for the ChatGPT4 data, as is explained briefly in the slides. However, it is not likely that this made a significant difference in the overall patterns of results. Additional testing is exploring this. ↩︎
  2. Sometimes the patterns can be opposite of what is expected–for example, describing a student as “from a Black family” tends to increase the score. However, this belies a more subtle pattern, where describing a student as receiving free lunch lowers the score. The differences are likely results of guardrails used by programmers to avoid harmful outputs, but these guardrails can only do so much to reduce implicit patterns of bias. ↩︎