This dataset includes ASR data from rural women speaking Hindi and Bhojpuri, supporting inclusive voice recognition.
AI4Bharat
non-profit
Verified
AI & ML interests
None defined yet.
Recent Activity
View all activity
Romansetu is a collection of models address the challenge of extending Large Language Models (LLMs) to non-English languages using non-Latin scripts
A Speech Translation Dataset for 13 Indian Languages
Hercule series of Evaluation models
Largest Collections of Pretraining and Instruction Finetuning datasets for 22 Indic languages.
Models(En-Indic, Indic-En, Indic-Indic) in 2 variants (base and dist) and Benchmarks (IN22-Gen and IN22-Conv) released as a part of IndicTrans2.
-
ai4bharat/indictrans2-en-indic-1B
Translation • 1B • Updated • 7.51k • 41 -
ai4bharat/indictrans2-en-indic-dist-200M
Translation • 0.3B • Updated • 5.16k • 19 -
ai4bharat/indictrans2-indic-en-1B
Translation • 1B • Updated • 3.63k • 25 -
ai4bharat/indictrans2-indic-en-dist-200M
Translation • 0.2B • Updated • 2.97k • 6
IndicBERT v2 is a multilingual BERT model pretrained on IndicCorp v2, an Indic monolingual corpus of 20.9 billion tokens, covering 24 consitutionally
A comprehensive dataset collection for Indic language information retrieval.
Collection of Parler-TTS models adapted to Indian languages.
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams.
A collection of ASR models for 22 scheduled languages of India
-
ai4bharat/indic-conformer-600m-multilingual
Updated • 31.4k • 37 -
ai4bharat/indicconformer_stt_as_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 15 • 4 -
ai4bharat/indicconformer_stt_bn_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 84 • 1 -
ai4bharat/indicconformer_stt_brx_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 5
A collection of benchmarks used for evaluation of Airavata, an Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
IndicXTREME is a human-supervised benchmark of 9 diverse NLU tasks across 20 languages, featuring 105 evaluation sets in total.
IndicNLG Benchmark is a dataset collection designed for benchmarking Natural Language Generation (NLG) across 11 Indic languages.
This dataset includes ASR data from rural women speaking Hindi and Bhojpuri, supporting inclusive voice recognition.
A comprehensive dataset collection for Indic language information retrieval.
Romansetu is a collection of models address the challenge of extending Large Language Models (LLMs) to non-English languages using non-Latin scripts
Collection of Parler-TTS models adapted to Indian languages.
A Speech Translation Dataset for 13 Indian Languages
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams.
Hercule series of Evaluation models
A collection of ASR models for 22 scheduled languages of India
-
ai4bharat/indic-conformer-600m-multilingual
Updated • 31.4k • 37 -
ai4bharat/indicconformer_stt_as_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 15 • 4 -
ai4bharat/indicconformer_stt_bn_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 84 • 1 -
ai4bharat/indicconformer_stt_brx_hybrid_ctc_rnnt_large
Automatic Speech Recognition • Updated • 5
Largest Collections of Pretraining and Instruction Finetuning datasets for 22 Indic languages.
A collection of benchmarks used for evaluation of Airavata, an Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
Models(En-Indic, Indic-En, Indic-Indic) in 2 variants (base and dist) and Benchmarks (IN22-Gen and IN22-Conv) released as a part of IndicTrans2.
-
ai4bharat/indictrans2-en-indic-1B
Translation • 1B • Updated • 7.51k • 41 -
ai4bharat/indictrans2-en-indic-dist-200M
Translation • 0.3B • Updated • 5.16k • 19 -
ai4bharat/indictrans2-indic-en-1B
Translation • 1B • Updated • 3.63k • 25 -
ai4bharat/indictrans2-indic-en-dist-200M
Translation • 0.2B • Updated • 2.97k • 6
IndicXTREME is a human-supervised benchmark of 9 diverse NLU tasks across 20 languages, featuring 105 evaluation sets in total.
IndicBERT v2 is a multilingual BERT model pretrained on IndicCorp v2, an Indic monolingual corpus of 20.9 billion tokens, covering 24 consitutionally
IndicNLG Benchmark is a dataset collection designed for benchmarking Natural Language Generation (NLG) across 11 Indic languages.