New! Sign up for our email newsletter on Substack.

AI Model Could Add Navajo and Related Languages to Online Translators

While Google Translate can instantly recognize over 100 languages, it still completely fails to identify Navajo—the most widely spoken Native American language. But that could soon change, according to new research from Dartmouth College that demonstrates how artificial intelligence can accurately identify endangered Indigenous languages with near-perfect precision.

The study, presented May 1 at the Association for Computational Linguistics conference in Albuquerque, found that with relatively minimal resources, researchers could train an AI model to recognize Navajo with 97-100% accuracy. The findings suggest that major tech companies could easily expand their language identification tools to include Native American languages, potentially supporting preservation efforts for these endangered cultural treasures.

This research comes at a critical time when many Indigenous languages face extinction due to minimal technological integration and educational resources.

Closing the Digital Language Divide

The Dartmouth team discovered the problem when testing Google’s Language Identification tool (LangID), which powers services like Google Translate. When presented with Navajo text, LangID repeatedly misidentified it as unrelated languages like Icelandic, Lingala, or Wolof.

“By building on the ideas behind LangID, we found that it’s possible to develop a classifier to identify Indigenous languages,” says Ivory Yang, the study’s first author and a PhD candidate at Dartmouth. “From Google’s perspective, adding a new language involves rigorous verification, which makes sense given the scale. What I hope to show is that even with limited resources, meaningful progress is still possible.”

Using a dataset of 10,000 Navajo sentences, the researchers created a model that correctly identified the language with remarkable consistency. What’s more, they found this approach could potentially extend to related languages that have even less data available.

A Bridge to Related Languages

The team’s findings suggest that Navajo could serve as a linguistic bridge to help translation tools recognize other related languages in the Athabaskan family, which includes Apache and several Native Alaskan languages.

When they tested their model on languages like Western Apache, Mescalero Apache, Jicarilla Apache, and Lipan Apache—sometimes using datasets as small as 20 sentences—the model identified them as Navajo due to their linguistic similarities.

“What we noticed is that they are so linguistically similar to Navajo that it could be used to eventually identify these related languages without needing the same amount of data,” Yang explains. “That could mean that higher resource languages can act as a bridge to lower resource languages in general.”

This bridge concept could prove crucial for preserving languages with very few remaining speakers or limited written materials.

Beyond Identification to Translation

The work represents just the first step in making digital tools more inclusive of Indigenous languages. Simply being recognized by technology is a fundamental starting point for any language in the digital age.

“Many Indigenous languages lack even the basic dignity of being recognized online, a reflection of systemic bias in language technology,” says Soroush Vosoughi, the paper’s senior author and an assistant professor of computer science at Dartmouth. “Revitalization begins with visibility, and visibility begins with identification.”

The current research focuses specifically on language identification, but the team’s ambitions extend further:

  • Expanding the model to recognize additional Native American languages beyond the Athabaskan family
  • Developing translation capabilities for Navajo and related languages
  • Creating more comprehensive language tools to support language learning and preservation
  • Exploring partnerships with Indigenous communities to ensure technology respects cultural values

“The next step for the team’s latest model is to translate original sentences into Navajo,” Yang says. “Basically, we want to switch from identification to translation. The end goal is translation, but that is way, way harder. Right now, we know we can do identification.”

A Foundation for Revival

This research is part of a larger initiative at Dartmouth focused on using AI to help revitalize endangered languages. The team previously created a framework called NüshuRescue that translates Chinese into Nüshu, an endangered centuries-old script traditionally used by women in southern Hunan province.

What makes this work particularly significant? In an increasingly connected world where digital presence often determines a language’s survival prospects, technologies that support Indigenous languages could help reverse centuries of decline.

For the estimated 350,000 Navajo speakers and speakers of related Athabaskan languages, having their languages recognized by mainstream technology platforms represents more than convenience—it’s a form of cultural validation that could bolster ongoing preservation efforts and inspire younger generations to maintain their linguistic heritage.

As AI language technologies continue to advance, the question remains whether major tech companies will incorporate these endangered languages into their systems, or if more localized, community-driven approaches will ultimately prove more effective for language revitalization in the digital age.

Fuel Independent Science Reporting: Make a Difference Today

If our reporting has informed or inspired you, please consider making a donation. Every contribution, no matter the size, empowers us to continue delivering accurate, engaging, and trustworthy science and medical news. Independent journalism requires time, effort, and resources—your support ensures we can keep uncovering the stories that matter most to you.

Join us in making knowledge accessible and impactful. Thank you for standing with us!



Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.