🦢SWIM-IR Dataset [NAACL'24] - a nthakur Collection

nthakur 's Collections

🏜️MIRAGE-Bench [NAACL'25]

Multilingual SFT & DPO Datasets

🌐 NoMIRACL Dataset [EMNLP'24]

🦢SWIM-IR Dataset [NAACL'24]

GPL BEIR Datasets [NAACL'22]

🦢SWIM-IR Dataset [NAACL'24]

updated Mar 31, 2025

29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs.

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Paper • 2311.05800 • Published Nov 10, 2023 • 4
nthakur/swim-ir-cross-lingual

Viewer • Updated Apr 28, 2024 • 15.4M • 362 • 9

Note SWIM-IR (Cross-lingual) dataset, where the query is in the target language and the passage is in English.
nthakur/swim-ir-monolingual

Viewer • Updated Apr 28, 2024 • 3.17M • 163 • 10

Note SWIM-IR (Monolingual) dataset, where both the query and the passage are in the target language.
nthakur/indic-swim-ir-cross-lingual

Viewer • Updated Apr 28, 2024 • 93k • 118 • 2

Note Indic SWIM-IR (Cross-lingual) dataset, where the query is in the Indo-European language and the passage is in English.