Close Menu
  • Homepage
  • News
  • Cloud & AI
  • ECommerce
  • Entertainment
  • Finance
  • Opinion
  • Podcast
  • Contact

Subscribe to Updates

Get the latest technology news from TechFinancials News about FinTech, Tech, Business, Telecoms and Connected Life.

What's Hot

Ethereum Traders Increase Leverage On-Chain As HFDX Liquidity Hits New Highs

2026-01-31

New To On-Chain Perps? HFDX Is Rapidly Emerging As The Beginner-Friendly Option

2026-01-31

Standard Chartered GBA Business Confidence Indices reveal steady business sentiment

2026-01-31
Facebook X (Twitter) Instagram
Trending
  • Ethereum Traders Increase Leverage On-Chain As HFDX Liquidity Hits New Highs
Facebook X (Twitter) Instagram YouTube LinkedIn WhatsApp RSS
TechFinancials
  • Homepage
  • News
  • Cloud & AI
  • ECommerce
  • Entertainment
  • Finance
  • Opinion
  • Podcast
  • Contact
TechFinancials
Home»Opinion»African Languages For AI: The Project That’s Gathering A Huge New Dataset
Opinion

African Languages For AI: The Project That’s Gathering A Huge New Dataset

ContributorBy Contributor2025-10-17No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Generative AI
Generative AI: Designed by Freepik
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link

Vukosi Marivate, University of Pretoria; Ife Adebara, University of Alberta, and Lilian Wanzare, Maseno University

Artificial intelligence (AI) tools like ChatGPT, DeepSeek, Siri or Google Assistant are developed by the global north and trained in English, Chinese or European languages. In comparison, African languages are largely missing from the internet.

A team of African computer scientists, linguists, language specialists and others have been working on precisely this problem for two years already. The African Next Voices project, primarily funded by the Gates Foundation (with other funding from Meta) and involving a network of African universities and organisations, recently released what’s thought to be the largest dataset of African languages for AI so far. We asked them about their project, with sites in Kenya, Nigeria and South Africa.


Why is language so important to AI?

Language is how we interact, ask for help, and hold meaning in community. We use it to organise complex thoughts and share ideas. It’s the medium we use to tell an AI what we want – and to judge whether it understood us.

We are seeing an upsurge of applications that rely on AI, from education to health to agriculture. These models are trained from large volumes of (mostly) linguistic (language) data. These are called large language models or LLMs but are found in only a few of the world’s languages.

Languages also carry culture, values and local wisdom. If AI doesn’t speak our languages, it can’t reliably understand our intent, and we can’t trust or verify its answers. In short: without language, AI can’t communicate with us – and we can’t communicate with it. Building AI in our languages is therefore the only way for AI to work for people.

If we limit whose language gets modelled, we risk missing out on the majority of human cultures, history and knowledge.

Why are African languages missing and what are the consequences for AI?

The development of language is intertwined with the histories of people. Many of those who experienced colonialism and empire have seen their own languages being marginalised and not developed to the same extent as colonial languages. African languages are not as often recorded, including on the internet.

So there isn’t enough high-quality, digitised text and speech to train and evaluate robust AI models. That scarcity is the result of decades of policy choices that privilege colonial languages in schools, media and government.

Language data is just one of the things that’s missing. Do we have dictionaries, terminologies, glossaries? Basic tools are few and many other issues raise the cost of building datasets. These include African language keyboards, fonts, spell-checkers, tokenisers (which break text into smaller pieces so a language model can understand it), orthographic variation (differences in how words are spelled across regions), tone marking and rich dialect diversity.

The result is AI that performs poorly and sometimes unsafely: mistranslations, poor transcription, and systems that barely understand African languages.

In practice this denies many Africans access – in their own languages – to global news, educational materials, healthcare information, and the productivity gains AI can deliver.

When a language isn’t in the data, its speakers aren’t in the product, and AI cannot be safe, useful or fair for them. They end up missing the necessary language technology tools that could support service delivery. This marginalises millions of people and increases the technology divide.

What is your project doing about it – and how?

Our main objective is to collect speech data for automatic speech recognition (ASR). ASR is an important tool for languages that are largely spoken. This technology converts spoken language into written text.

The bigger ambition of our project is to explore how data for ASR is collected and how much of it is needed to create ASR tools. We aim to share our experiences across different geographic regions.

The data we collect is diverse by design: spontaneous and read speech; in various domains – everyday conversations, healthcare, financial inclusion and agriculture. We are collecting data from people of diverse ages, gender and educational backgrounds.

Every recording is collected with informed consent, fair compensation and clear data-rights terms. We transcribe with language-specific guidelines and a large range of other technical checks.

In Kenya, through Maseno Centre for Applied AI, we are collecting voice data for five languages. We’re capturing the three main language groups Nilotic (Dholuo, Maasai and Kalenjin) as well as Cushitic (Somali) and Bantu (Kikuyu).

Through Data Science Nigeria, we are collecting speech in five widely spoken languages – Bambara, Hausa, Igbo, Nigerian Pidgin and Yoruba. The dataset aims to accurately reflect authentic language use within these communities.

In South Africa, working through the Data Science for Social Impact lab and its collaborators, we have been recording seven South African languages. The aim is to reflect the country’s rich linguistic diversity: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele and Tshivenda.

Importantly, this work does not happen in isolation. We are building on the momentum and ideas from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and many other organisations and individuals who have been pioneering African language models, data and tooling.

Each project strengthens the others, and together they form a growing ecosystem committed to making African languages visible and usable in the age of AI.

How can this be put to use?

The data and models will be useful for captioning local-language media; voice assistants for agriculture and health; call-centre and support in the languages. The data will also be archived for cultural preservation.

Larger, balanced, publicly available African language datasets will allow us to connect text and speech resources. Models will not just be experimental, but useful in chatbots, education tools and local service delivery. The opportunity is there to go beyond datasets into ecosystems of tools (spell-checkers, dictionaries, translation systems, summarisation engines) that make African languages a living presence in digital spaces.

In short, we are pairing ethically collected, high-quality speech at scale with models. The aim is for people to be able to speak naturally, be understood accurately, and access AI in the languages they live their lives in.

What happens next for the project?

This project only collected voice data for certain languages. What of the remaining languages? What of other tools like machine translation or grammar checkers?

We will continue to work on multiple languages, ensuring that we build data and models that reflect how Africans use their languages. We prioritise building smaller language models that are both energy efficient and accurate for the African context.

The challenge now is integration: making these pieces work together so that African languages are not just represented in isolated demos, but in real-world platforms.

One of the lessons from this project, and others like it, is that collecting data is only step one. What matters is making sure that the data is benchmarked, reusable, and linked to communities of practice. For us, the “next” is to ensure that the ASR benchmarks we build can connect with other ongoing African efforts.

We also need to ensure sustainability: that students, researchers, and innovators have continued access to compute (computer resources and processing power), training materials and licensing frameworks (Like NOODL or Esethu). The long-term vision is to enable choice: so that a farmer, a teacher, or a local business can use AI in isiZulu, Hausa, or Kikuyu, not just in English or French.

If we succeed, built-in AI in African languages won’t just be catching up. It will be setting new standards for inclusive, responsible AI worldwide.The Conversation

Vukosi Marivate, Chair of Data Science, Professor of Computer Science, Director AfriDSAI, University of Pretoria; Ife Adebara, Assistant Professor, University of Alberta, and Lilian Wanzare, Lecturer and chair of the Department of Computer Science, Maseno University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

African languages are largely missing from the internet African Languages For AI AI ChatGPT Chinese or European languages DeepSeek English Siri or Google Assistant
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Contributor

Related Posts

South Africa Could Unlock SME Growth By Exploiting AI’s Potential Through Corporate ESD Funds

2026-01-28

How Local Leaders Can Shift Their Trajectory In 2026

2026-01-23

Why Legal Businesses Must Lead Digital Transformation Rather Than Chase It

2026-01-23

Directing The Dual Workforce In The Age of AI Agents

2026-01-22

The Productivity Myth That’s Costing South Africa Talent

2026-01-21

The Boardroom Challenge: Governing AI, Data And Digital

2026-01-20

Ransomware: What It Is And Why It’s Your Problem

2026-01-19

AI Can Make The Dead Talk – Why This Doesn’t Comfort Us

2026-01-19

Can Taxpayers Lose By Challenging SARS?

2026-01-16
Leave A Reply Cancel Reply

DON'T MISS
Breaking News

Meet The €2.95M Capricorn 01 Zagato Hypercar Rebel

capricorn GROUP (capricorn), the German-based industry leader in automotive and motorsport lightweight technology, presented two…

SARB Holds Repo Rate Steady in Cautious Monetary Policy Decision

2026-01-29

Huawei Says The Next Wave Of Infrastructure Investment Must Include People, Not Only Platforms

2026-01-21

South Africa: Best Starting Point In Years, With 3 Clear Priorities Ahead

2026-01-12
Stay In Touch
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
OUR PICKS

How a Major Hotel Group Is Electrifying South Africa’s Travel

2026-01-29

Volvo C70: 30 Years Of The Car That Changed The Way Volvo Looked

2026-01-29

The EX60 Cross Country: Built For The “Go Anywhere” Attitude

2026-01-23

Mettus Launches Splendi App To Help Young South Africans Manage Their Credit Health

2026-01-22

Subscribe to Updates

Get the latest tech news from TechFinancials about telecoms, fintech and connected life.

About Us

TechFinancials delivers in-depth analysis of tech, digital revolution, fintech, e-commerce, digital banking and breaking tech news.

Facebook X (Twitter) Instagram YouTube LinkedIn WhatsApp Reddit RSS
Our Picks

Ethereum Traders Increase Leverage On-Chain As HFDX Liquidity Hits New Highs

2026-01-31

New To On-Chain Perps? HFDX Is Rapidly Emerging As The Beginner-Friendly Option

2026-01-31

Standard Chartered GBA Business Confidence Indices reveal steady business sentiment

2026-01-31
Recent Posts
  • Ethereum Traders Increase Leverage On-Chain As HFDX Liquidity Hits New Highs
  • New To On-Chain Perps? HFDX Is Rapidly Emerging As The Beginner-Friendly Option
  • Standard Chartered GBA Business Confidence Indices reveal steady business sentiment
  • AFF draws 4,000+ global political and business leaders, inaugural Global Business Summit
  • NSFW AI Chat with Advanced Memory Systems for Contextual Interaction Launches on Dream Companion
TechFinancials
RSS Facebook X (Twitter) LinkedIn YouTube WhatsApp
  • Homepage
  • Newsletter
  • Contact
  • Advertise
  • Privacy Policy
  • About
© 2026 TechFinancials. Designed by TFS Media. TechFinancials brings you trusted, around-the-clock news on African tech, crypto, and finance. Our goal is to keep you informed in this fast-moving digital world. Now, the serious part (please read this): Trading is Risky: Buying and selling things like cryptocurrencies and CFDs is very risky. Because of leverage, you can lose your money much faster than you might expect. We Are Not Advisors: We are a news website. We do not provide investment, legal, or financial advice. Our content is for information and education only. Do Your Own Research: Never rely on a single source. Always conduct your own research before making any financial decision. A link to another company is not our stamp of approval. You Are Responsible: Your investments are your own. You could lose some or all of your money. Past performance does not predict future results. In short: We report the news. You make the decisions, and you take the risks. Please be careful.

Type above and press Enter to search. Press Esc to cancel.

Ad Blocker Enabled!
Ad Blocker Enabled!
Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.