--- license: apache-2.0 language: - fa library_name: sentence-transformers pipeline_tag: feature-extraction tags: - sentence-embeddings - text-retrieval - persian - e5 --- # Model Card for safora/persian-e5-large-scientific-retriever ## Model Description This model is a fine-tuned version of `safora/persian-science-qa-e5-large`, specifically optimized for **high-performance information retrieval in the Persian scientific domain**. It is designed to be a core component of a Retrieval-Augmented Generation (RAG) system, where it excels at identifying the most relevant documents from a large corpus in response to a user's query. This model was fine-tuned to address a common problem in RAG systems: the retrieval of documents that are thematically related but factually incorrect. By training on a rigorously cleaned dataset of "hard negatives," this model learns to be more precise and discriminative, significantly improving the quality of the context provided to a generative model. ## Intended Uses & Limitations This model is intended to be used for embedding Persian text for retrieval tasks. Given a query, it can be used to find the most relevant scientific abstracts or documents from a corpus using semantic search. ```python from sentence_transformers import SentenceTransformer sentences = ["این یک نمونه جمله است", "این جمله دیگری است"] model = SentenceTransformer('safora/persian-e5-large-scientific-retriever') embeddings = model.encode(sentences) print(embeddings) While highly effective for scientific text, its performance on general-purpose or conversational text may not be superior to the original base model. Fine-Tuning Methodology The performance of this model is a direct result of a meticulous data-centric fine-tuning process. Source Data and Training Dataset The model was fine-tuned on a custom-built dataset of 1,016 triplets, created from a corpus of Persian scientific documents . This dataset, named retriever_finetuning_triplets.jsonl, is also available on the Hugging Face Hub at safora/persian-scientific-qa-triplets. The creation of this dataset involved a multi-stage pipeline: Heuristic Filtering: An initial set of 10,000+ generated question-answer pairs was filtered based on length, format, and language rules. Semantic Validation: A cross-encoder (safora/reranker-xlm-roberta-large) was used to validate the semantic relevance between questions and their source abstracts. Pairs with a relevance score below 0.85 were discarded, resulting in a high-confidence set of positive pairs. Hard-Negative Mining: For each high-confidence question, we searched the entire corpus to find the most similar but incorrect documents. These "hard negatives" are crucial for teaching the model fine-grained distinctions. This process transformed the positive pairs into a robust triplet dataset of (query, positive_passage, negative_passage). Training Procedure The model was fine-tuned using the sentence-transformers library with a MultipleNegativesRankingLoss function. We split the triplet dataset into a 90% training set and a 10% evaluation set to monitor for overfitting and save the best-performing checkpoint. Evaluation Results A rigorous comparative evaluation was conducted between this fine-tuned model and the original safora/persian-science-qa-e5-large base model on a held-out test set. The results demonstrate a dramatic and consistent improvement across all standard information retrieval metrics. | | Accuracy@1 | Recall@5 | MRR@10 | MAP@100 | |:-----------------|-------------:|-----------:|---------:|----------:| | Base Model | 0.7255 | 0.9118 | 0.8167 | 0.8178 | | Fine-Tuned Model | 0.8431 | 1 | 0.9216 | 0.9216 | The most critical result for RAG applications is the Recall@5 score of 1.0, indicating that the correct document was found in the top 5 results 100% of the time. This ensures the generative component of a RAG system consistently receives the correct context. Citation If you use this model in your research or application, please cite the work: Code snippet @misc{safora_persian_sci_retriever_2025, author = {Safora jolfaei}, title = {A High-Performance Embedding Model for Persian Scientific Information Retrieval}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, url = {[https://huggingface.co/safora/persian-e5-large-scientific-retriever](https://huggingface.co/safora/persian-e5-large-scientific-retriever)} }