New York, NY – Businesses have quickly incorporated large-language models like ChatGPT, to handle time-consuming tasks such as analyzing a large amount of text. However, as the technology continues to grow in popularity in the corporate world, experts are sounding the alarm about an emerging pitfall with ChatGPT: a new Columbia Business School study reveals that when A.I. tools like ChatGPT are asked to evaluate options—for example, candidates for a job opening—the tools overwhelmingly favor the first option in the prompt, regardless of how the prompt is written. Researchers find that this phenomenon applies across a broad combination of multiple-choice questions and answers, meaning that ChatGPT and other large language models have an overwhelming bias for selecting the first answer to a question they’re given.
The study, Prompt Architecture Induces Methodological Artifacts in Large Language Models, co-authored by Columbia Business School Professors Melanie Brucks and Olivier Toubia, finds that even subtle differences in how prompts are written can significantly shape the results produced by A.I. tools like ChatGPT. The researchers discovered these models consistently favor the first option presented regardless of order, assigned labels, and framing of the question. For example, ChatGPT may tell a recruiter that ‘Candidate A’ is more qualified than ‘Candidate B’ simply because ‘Candidate A’ was first in their prompt, even when the prompt clearly states that order should not matter. However, researchers found that by feeding multiple differently phrased and structured prompts—switching the order, labels, and framing in each—then taking the average result, the ordinal bias is almost entirely eliminated.
“As A.I. becomes part of everyday decision-making from hiring and healthcare to public policy, companies and users need to recognize that no prompt is ever truly neutral,” said Olivier Toubia, the Glaubinger Professor of Business in the Marketing Division at Columbia Business School. “Rather than searching for the perfect way to phrase a question, our research shows that combining results from multiple, differently worded prompts can effectively cancel out bias and lead to more reliable outcomes.”
To test how prompt design influences A.I. behavior, the researchers conducted a series of large-scale, full-factorial experiments across two studies using OpenAI’s ChatGPT (including GPT-3 and GPT-4) and Meta’s Llama 3.1. In the first study, they asked the models to compare three sets of items—for example, three sets (A, B, C) of five countries—and decide whether the second or third set was more similar to the first. They varied how each prompt was written, changing the labels, the order of the sets, the framing of the question, and whether they asked for a justification for the answer. Across 5,447 prompts, the A.I. systems displayed strong bias: the A.I. tools chose the first option listed 63% of the time, and option ‘B’ over ‘C’ 74% of the time, while an unbiased model would choose each option 50% of the time. The researchers replicated this experiment with a simpler setup across 64,800 trials and found nearly identical results: the tools chose the first option 64.29% of the time. However, when they aggregated responses from multiple randomized prompts, the bias nearly disappeared, dropping down to 50.01% for ChatGPT and 50.06% for Llama. This suggests that combining results from several prompt variations is far more effective than the futile exercise of attempting to create a single, “perfect” prompt.
Key findings from the research include:
- Prompt Design Shapes Every A.I. Output. Each prompt fed to a large-language model carries features—such as order, framing, and labeling—that collectively form its “prompt architecture” and can significantly influence results.
- There is no such thing as a perfect prompt. Telling ChatGPT that “order doesn’t matter” or using different kinds of labels or wording does not significantly reduce bias.
- Aggregating prompts eliminates bias. Combining results from multiple, differently worded prompts effectively cancels out the bias introduced by any single prompt.
“No matter how sophisticated A.I. tools become, bias will always exist in the prompts we provide. Our hope is that businesses and users make it standard practice to aggregate results rather than relying on a single prompt,” said Melanie Brucks, Assistant Professor of Business in the Marketing Division at Columbia Business School.