Building AI for African Languages

African languages account for less than 1% of AI training data but grassroots communities are building what big tech can't.

By Eugene Yiga
65 Views / Published Jun 11, 2026
- Share

A show of hands at Mozilla Festival, the annual web think tank, in Barcelona last year revealed an uncomfortable truth. Nearly everyone in the room had used AI systems based on big tech models. Fewer had successfully engaged with AI in their first language. And almost nobody had used a small language model built by and for their own community.

"The narrative of ‘bigger is better’ dominates how most people understand what language AI can look like," says Daniel Brumund, an adviser on digital governance and AI at GIZ, the German development agency. "But the question is: whose languages do these big tech models really represent? Whose work are they based on? And who do they truly benefit?"

The numbers are stark. Less than 1% of the data available on the internet is in African languages, despite Africa comprising one-seventh of the world's population. The continent is home to more than 2 000 languages, yet representation in global AI models remains minimal. This exclusion creates an economic barrier that locks billions of people out of the digital economy.

Brumund and his colleagues work on FAIR Forward, a GIZ initiative that promotes open, inclusive, and sustainable approaches to AI on an international level. The project partners with organisations in Ghana, Kenya, Rwanda, South Africa, Uganda, Indonesia, and India to democratise AI for local innovation. But progress requires confronting how language AI has traditionally been built.

Current AI models are too expensive and too poorly suited for African languages

Pelonomi Moiloa, CEO and co-founder of Lelapa AI in Johannesburg, puts the problem bluntly. Her company's investor deck contains a deliberately provocative statement: language models today are too dumb for complex language environments and too expensive for the difficult ones.

“Training a GPT-type model just once uses the same amount of power as 12 000 Johannesburg homes do in a month,” she says. “And Johannesburg is one of the biggest cities in Africa, fully stable in terms of resources and economy. So determining whether you’re going to take a GPT model or power 12 000 homes: those are the types of trade-offs you’re expecting people to make.”

The sustainability problem runs deeper than energy consumption. Big tech companies are currently subsidising up to 90% of their serving costs to customers. That model can't scale indefinitely.

“What this means is that the two billion people in the global majority won’t be able to rely on this technology to access essential products and services in the digital economy,” Moiloa explains. “This is a catastrophe because in places like South Africa, technology through mobile phones is really the fastest, most efficient option for people to access products and services. Those people are situated very far away from where those services are.”

Ideally, you have a bank, a hospital, a government office where people can access their pensions, and a school right next to everybody. But because communities are spread out and don't have the funds or public transport infrastructure to travel, technology fills that gap if it works in the local language.

The alternative is hard-coding language functionality into applications. That approach scales poorly and misses the nuance that makes language work. What’s needed are models that understand African languages natively, with all their complexity and variation.

African languages break the assumptions that underpin current AI models

Moiloa’s company focuses on isiZulu, one of South Africa's official languages. It presents a fascinating case study in why big tech approaches fail.

“It’s surprisingly easy for people to learn but actually impossible for machines,” she says. “It’s what’s called an agglutinative language, where a single sentence can exist in one word. So ‘ngiyagijima’ means ‘I am running.’ When you feed that to a model expecting words to be broken down according to English or other Indo-European languages, it doesn’t know how to parse it into smaller components.”

The complexity multiplies in multilingual contexts. South Africans typically speak four to five languages and mix them together in everyday conversation. This code-switching is normal human behaviour but it confuses models trained on monolingual datasets. The legacy of colonialism adds another layer. Languages that were imposed coexist with indigenous languages, without fully conquering the culture and context in which those languages exist.

“When we’re looking at the language and trying to teach it to the machine, it consists of multiple languages mixed together in ways that models just aren’t used to understanding,” Moiloa says. “They’re used to having a dictionary definition of a language and how it exists in isolation.”

Rather than waiting for big tech to solve these problems, Lelapa AI is demonstrating that different approaches work. The company has built translation and transcription models for isiZulu using 70% less data and 70% less compute than conventional methods.

“If we can solve this for isiZulu, we can solve it for the world,” Moiloa says. “Because we’ve figured out how to learn more from less.”

The data problem extends beyond isiZulu. Across the Global South, an estimated 4 000 languages exist primarily in oral tradition rather than text. Traditional AI models have been built by scraping the internet but that approach simply doesn’t work for languages that were never written down at scale in the first place.

This is why Moiloa sees Lelapa’s work as solving a global problem, not just an African one. The techniques that allow models to learn effectively from limited data have applications wherever linguistic diversity collides with computational constraints. Moiloa’s recognition as one of TIME’s 100 Most Influential People in AI and as a Mozilla Rise 25 Award recipient suggests the wider tech world is starting to pay attention.

Building together: the Masakhane approach

While Lelapa AI tackles the commercial challenge, a parallel grassroots movement is addressing the foundational infrastructure. Masakhane, which means “we build together” in isiZulu, is a community of more than 2 000 African natural language processing researchers working to ensure African languages aren’t left behind.

“Our goal is to empower one billion Africans by 2029 with relevant AI tools and resources,” says Samantha Moyo, partnerships and community lead at the Masakhane Hub for African Languages, which launched in April 2025. “We’re unlocking opportunities for economic development, innovation, preservation, and evolution of language heritage.”

The challenges the Hub addresses are interconnected. Data scarcity means many African languages lack sufficient annotated data for training AI models. Critical underrepresentation in AI development creates barriers to equitable access. Fragmented and siloed data collection efforts compound inefficiencies. And limited compute resources and high storage costs make progress difficult.

The Hub currently works with community-driven datasets documenting and preserving fifty African languages. The approach prioritises inclusion of language speakers themselves in the development process, ensuring that technologies are contextually relevant rather than imposed from outside.

“We believe in community-centred approaches to African language AI,” Moyo says. “We are part of the success story of people coming together within their own context, responding to the realities of African language AI.”

The Hub is an offshoot of the existing Masakhane Research Foundation, which has been building grassroots capacity for years. Their grant-making initiative specifically targets marginalised communities, aiming to ensure African languages and cultures are represented in an AI-driven future. The work covers research on African language models, datasets that preserve linguistic heritage, and capacity building to strengthen the continent's technical community.

What makes the approach different is the insistence on demystifying AI for the communities it’s meant to serve. Rather than imposing solutions from outside, the Hub works to help communities understand how language AI works, address their fears and concerns, and participate meaningfully in development. The question driving this work is simple: what if every African could engage with technology in their own language?

This represents the narrative shift that advocates hope to see more widely. Rather than extracting data to feed models built elsewhere, communities are building AI that works for them. The question is whether funders, policymakers, and technologists will support approaches that prioritise communities over scale, and whether the rest of the world will recognise that language AI built with, for, and by communities offers something the big tech model simply cannot deliver.

“We must ensure that every voice, every idea, has a place at the table,” Moyo says.

Six principles shaping African language AI

Deshni Govender, country lead for FAIR Forward at GIZ South Africa, outlines six key perspectives on the African NLP ecosystem.

Languages carry identity, not just information. Languages are inseparable from culture, history, and identity. When AI systems exclude languages, they exclude the communities who speak them.

Language requirements create economic barriers. Language can be weaponised to exclude people from jobs and opportunities. The perception that someone who doesn’t speak English fluently “probably isn’t that smart” creates real barriers in hiring and advancement.
Dialects demand decentralised benchmarks. Swahili spoken in Tanzania differs from Swahili in Kenya. You can’t simply create a single Swahili dataset and call it done. You must consider who gets to define what counts as the benchmark version.
Digital invisibility has real-world consequences. If a language or community isn’t represented online, they’re effectively unseen, waiting for others to discover and share their culture rather than controlling the narrative themselves.
Openness must mean sharing, not extraction. The push for “open” data raises questions about who benefits. Communities such as Masakhane and Mozilla's Common Voice are redefining what openness means: sharing resources in ways that empower rather than extract.
What benefits big tech must also benefit communities. If openness benefits big tech, it should also benefit communities. People need power, agency, and autonomy over how their language data is used. It can’t be good for one party while adversely affecting another.