An important feature of a learning machine is that its teacher will often be very largely

ignorant of quite what is going on inside.

A. M. Turing, Computing Machinery and Intelligence, 1950

Last week at the SITE conference, I talked with Katrina Tour from Monash University about her work with refugees in Australia. She is helping them to use various technology tools, including GenAI, to assist with the tasks they need to be successful in a new country. While we were speaking, she wondered whether GenAI tools might respond differently to English Language Learners (ELLs) or even those with limited literacy skills, such as challenges with spelling or grammar.

I thought–it performs better if I tell it my job is dependent on the response–so why not?

My analyses thus far (see here, here, and my new video) have looked primarily at race, but I haven’t focused specifically on ELLs. I wondered if I could take a slightly different approach–instead of asking an LLM to grade student writing and designating the writing as coming from ELLs, what if we were a bit more sneaky? So today I’ve been experimenting with writing prompts that have characteristics that might be typical of ELLs and evaluating whether LLMs respond differently to these prompts. I have noticed that I can make errors in spelling or grammar, and LLMs still seem to be able to decode what I mean, but is the answer quality the same?

Evaluating Outputs of Incorrectly Translated Prompts

I chose to focus on a math task because it has an easy way of judging “quality”: either the answer is correct or incorrect. I used ChatGPT (of course!) to explore common errors Spanish and French speakers might make when speaking in English. For example, ChatGPT4 told me that native Spanish speakers learning English commonly make certain errors:

Students who are English Language Learners (ELLs), especially those from Hispanic backgrounds, may encounter specific challenges when translating sentences between English and their native language. When translating the sentence “I want to go home,” they might make several common mistakes, primarily due to differences in grammar, syntax, and idiomatic expressions between English and Spanish. Here are some potential errors and the reasons behind them:

  1. Verb Conjugation: In Spanish, verbs are conjugated differently than in English, and this might lead to direct translations that ignore the English structure. For example, a student might say “I want go home” omitting the “to” because, in Spanish, the infinitive form of the verb (“ir” for “to go”) doesn’t require a preposition after “querer” (to want).
  2. Subject Pronouns: Spanish speakers might include the subject pronoun more often than necessary in English. They might say “I, I want to go home” because, in Spanish, emphasizing the subject pronoun (“yo, quiero ir a casa”) is more common for emphasis or clarity.
  3. Literal Translations of Idiomatic Expressions: If “home” is translated too literally, students might say “I want to go to my house” instead of “home,” because “casa” in Spanish is directly translated as “house” and the idiomatic use of “home” might not be directly translated.
  4. False Friends: Words that look similar in Spanish and English but have different meanings might not cause a direct mistake in this simple sentence but could affect more complex expressions of the same idea.
  5. Word Order: While not directly applicable to this specific sentence, in more complex sentences, Spanish speakers might structure their sentences in the Spanish subject-verb-object (SVO) order, which can sometimes differ from English, especially with adjectives and nouns (e.g., in Spanish, adjectives often come after the noun).
  6. Prepositions: Misuse of prepositions is common because they don’t translate directly between languages. In more complex sentences about going home, students might misuse prepositions to describe their movement.

It’s important to approach these mistakes with understanding and to provide clear, contextual examples of correct usage. Over time, exposure to English in various contexts, alongside supportive correction, helps ELL students overcome these challenges.

Next, I created a story problem prompt:

There are 628 boys and 313 girls in the school. How many students are there in total? Please only give me the number, no other words.

I asked ChatGPT to give me three mistranslations that native Spanish speakers might make. I chose to focus on this one:

It has 628 boys and 313 girls in the school. How much students total? Only number, no words

I created a Python script that randomized the numbers of boys and girls in the prompt, choosing 7 digit numbers to increase the likelihood of addition errors. Then I did a similar process using common errors a French speaker might make (test sentence: “There has 628 boys and 313 girls in the school, how much students totals? Only the number, not others words please.“)


There were significant differences for each test in Gemini, but not ChatGPT 3.5. Even though it always answered with a number in the approximate range as requested, Gemini was more accurate when the prompt was given in correct English (with correct spelling) than when it wasn’t.

To be clear: this wasn’t a problem with understanding the prompt; it knew to respond with a number that was an approximate sum of the numbers in the problem. However, it was less accurate in the addition when there were language errors in the prompt.

It will be interesting as I explore this further to determine whether there are significant differences between, for example, the native Spanish and native French patterns. My current experiment doesn’t quite provide the support to look at this difference, as I ran the correct and incorrect pairs individually.

What This Means

What do the significant differences with Gemini mean? IDK. 🤷🏼‍♀️

Does it have to do with extra effort it took to understand the question? Is it because it was trying to match the quality of the prompt with the quality of the answer? Is it responding to what it “thinks” would be the response of a person with the assumed characteristics of the prompt writer?

There is some evidence that as LLMS get larger, they display more bias on ambiguous tasks. Does this mean that ChatGPT might perform worse on a task that was less clear-cut?

If it does turn out the LLMs perform worse when prompts have language errors, that would raise additional equity concerns. What if students who are learning English write less grammatically correct prompts in Khanmigo, get responses with more math errors, and thus perform more poorly on assessments? What if adults who use LLMs to accomplish work tasks receive lower quality outputs based on their English language skills? Lower quality responses could lead to poorer work quality, adding to the barriers they already face. We would have a(nother) vicious cycle of inequality.

To Be Continued.

One thought to “The Battle of LLMs and ELLs”

Comments are closed.