Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
Paper • 2311.05800 • Published • 4
29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs.
Note SWIM-IR (Cross-lingual) dataset, where the query is in the target language and the passage is in English.
Note SWIM-IR (Monolingual) dataset, where both the query and the passage are in the target language.
Note Indic SWIM-IR (Cross-lingual) dataset, where the query is in the Indo-European language and the passage is in English.