Artificial intelligence systems exhibit the same “us versus them” biases that have long plagued human society, according to groundbreaking new research. But there’s hope – these biases can be reduced through careful curation of training data.
Published in Nature Computational Science | Estimated reading time: 5 minutes
The comprehensive study, published in Nature Computational Science, examined 77 different large language models (LLMs), including widely-used systems like GPT-4. The research team found that these AI systems consistently show favoritism toward perceived “ingroups” while expressing negativity toward “outgroups” – mirroring fundamental human psychological patterns that contribute to social divisions.
“Artificial Intelligence systems like ChatGPT can develop ‘us versus them’ biases similar to humans—showing favoritism toward their perceived ‘ingroup’ while expressing negativity toward ‘outgroups,'” explains Steve Rathje, a New York University postdoctoral researcher and one of the study’s authors.
Through extensive testing, the researchers discovered that when prompted with phrases beginning with “We are,” the AI systems generated more positive content. Conversely, prompts starting with “They are” led to more negative outputs. Specifically, “ingroup” sentences were 93% more likely to be positive, while “outgroup” sentences were 115% more likely to be negative.
The researchers then investigated whether these biases could be modified through changes in training data. When they fine-tuned the models using partisan social media content from Twitter (now X), both ingroup favoritism and outgroup hostility increased significantly. However, when they filtered out biased content before training, they successfully reduced these polarizing effects.
“The effectiveness of targeted data curation in reducing both ingroup solidarity and outgroup hostility suggests promising directions for improving AI development and training,” notes the authors, including Yara Kyrychenko, a doctoral Gates Scholar at the University of Cambridge.
These findings carry important implications as AI becomes increasingly integrated into daily life. Without careful attention to training data and potential biases, AI systems could inadvertently amplify existing social divisions. However, the research also demonstrates that relatively small but targeted changes to training data can substantially impact model behavior, offering a practical path forward for developing more equitable AI systems.
Glossary
- Large Language Models (LLMs)
- Advanced AI systems trained on vast amounts of text data to understand and generate human-like language.
- Social Identity Bias
- The psychological tendency to favor one’s own group (ingroup) while showing prejudice against other groups (outgroup).
- Fine-tuning
- The process of additional training on specific data to modify an AI model’s behavior or capabilities.
- Ingroup Solidarity
- The tendency to express positive attitudes and favoritism toward one’s own group.
Test Your Knowledge
What percentage were “ingroup” sentences more likely to be positive compared to “outgroup” sentences?
Ingroup sentences were 93% more likely to be positive than outgroup sentences.
What happened when researchers fine-tuned the AI models with partisan social media data?
Both ingroup favoritism and outgroup hostility increased significantly.
How many different large language models did the researchers examine in their study?
The study examined 77 different large language models (LLMs).
What was the key finding about reducing AI bias through training data modification?
The researchers found that filtering out sentences expressing ingroup favoritism and outgroup hostility from training data before fine-tuning could effectively reduce these polarizing effects.
Enjoy this story? Subscribe to our newsletter at scienceblog.substack.com.