| | --- |
| | language: |
| | - hi |
| | - en |
| | base_model: |
| | - bharatgenai/Param-1 |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | license: apache-2.0 |
| | --- |
| | <div align="center"> |
| | <img src="./BharatGen Logo (1).png" width="60%" alt="BharatGen" /> |
| | </div> |
| | <hr> |
| | <div align="center"> |
| | <a href="https://arxiv.org/abs/2507.13390" target="_blank" style="margin: 4px;"> |
| | <img alt="Paper" src="https://img.shields.io/badge/%20Paper-arxiv-0033ad?style=flat&logo=arxiv&logoColor=white" /> |
| | </a> |
| | <a href="https://huggingface.co/bharatgenai/Param-1-2.9B-Instruct/blob/main/LICENSE" target="_blank" style="margin: 4px;"> |
| | <img alt="License" src="https://img.shields.io/badge/License-yellow.svg" /> |
| | </a> |
| | </div> |
| | |
| | # Param 1-2.9B-Instruct |
| |
|
| | **BharatGen** introduces the early checkpoint of SFT (Supervised Fine-Tuned) for **Param 1**, a bilingual language model trained from scratch in English and Hindi. With 2.9 billion parameters, this checkpoint builds upon the pretraining phase and serves as a foundation for more downstream tasks, safety testing, and customization. |
| |
|
| | --- |
| |
|
| | ## Pre-Training Details |
| | * **Dataset**: 7.5 Trillion tokens |
| | * **Data Quality**: Highly curated with standard filtering and multiple processing steps. |
| | * **Scheduler**: Cosine Annealing |
| | * **Learning_rate**: 3e-4 to 3e-6 |
| | * **Training Setup**: Running on 512 H100 GPUs |
| | * **Framework**: NVIDIA NeMo |
| | * **Precision**: bf16-mixed |
| | |
| | * For Pre-Trained Checkpoint (Param 1): https://aikosh.indiaai.gov.in/home/models/details/bharatgen_param_1_indic_scale_bilingual_foundation_model.html |
| | |
| | --- |
| | |
| | ## SFT Training Details |
| | * **Dataset**: 0.8 Million samples |
| | * **Epochs**: 3 |
| | * **Scheduler**: Cosine Annealing |
| | * **Learning Rate**: 5e-6 to 5e-8 |
| | * **Training Hardware**: 32 H200 GPUs |
| | * **Framework**: NVIDIA NeMo |
| | * **Precision**: bf16-mixed |
| | |
| | Filtered high-quality bilingual data was used, combining public and in-house sources for safety-aware and culturally relevant behavior. |
| | |
| | --- |
| | |
| | ## 🚀 Model Inference |
| | * After cloning provide the model path (model directory) in the inference script |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | # Load tokenizer and model |
| | model_name = "bharatgenai/Param-1-2.9B-Instruct" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | trust_remote_code=True, |
| | device_map="auto" |
| | ) |
| | |
| | # Conversation Input |
| | conversation = [ |
| | { |
| | "content": "You are helpful assistant.", |
| | "role": "system" |
| | }, |
| | { |
| | "content": "What is the BharatGen Mission?", |
| | "role": "user" |
| | } |
| | ] |
| | |
| | # padding special token |
| | inputs = tokenizer.apply_chat_template( |
| | conversation=conversation, |
| | return_tensors="pt", |
| | add_generation_prompt=True |
| | ) |
| | inputs = inputs.to(model.device) |
| | |
| | # --- Generate output --- |
| | with torch.no_grad(): |
| | output = model.generate( |
| | inputs, |
| | max_new_tokens=300, |
| | do_sample=True, |
| | top_k=50, |
| | top_p=0.95, |
| | temperature=0.6, |
| | eos_token_id=tokenizer.eos_token_id, |
| | use_cache=False |
| | ) |
| | |
| | # Get only the generated tokens (exclude the prompt length) |
| | generated_tokens = output[0][inputs.shape[-1]:] |
| | generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
| | |
| | print("Assistant Output:\n", generated_text) |
| | ``` |
| | |
| | --- |
| | |
| | ## 📊 Benchmarks (zero-shot) |
| | |
| | | Task | Param 1 (PT) | Gemma2-2B (PT) | llama3.2-3B (distill PT) | granite-3.1-2B (PT) | granite-3.1-3B (PT) | qwen-2.5-3B (PT) | |
| | | ----------------------- | ------------- | -------------- | ------------------------ | ------------------- | ------------------- | ---------------- | |
| | | ARC Challenge | 46.7 | 49.7 | 46.0 | 47.2 | 45.2 | 47.4 | |
| | | ARC Easy | 74.6 | 80.3 | 71.7 | 76.8 | 75.8 | 73.2 | |
| | | HellaSwag | 71.4 | 73.0 | 73.7 | 75.5 | 72.6 | 73.6 | |
| | | HellaSwag Hi | 44.1 | 38.6 | 40.0 | 31.0 | 28.5 | 32.9 | |
| | | MMLU En | 41.4 | 47.1 | 53.9 | 47.8 | 41.0 | 64.9 | |
| | | MMLU Hi | 30.7 | 30.0 | 35.0 | 29.0 | 25.7 | 38.32 | |
| | | PIQA | 79.3 | 78.3 | 77.31 | 79.4 | 78.2 | 78.84 | |
| | | TriviaQA | 38.5 | 32.9 | 50.83 | 26.2 | 27.5 | 42.27 | |
| | | TruthfulQA - Gen (BLEU) | 38.2 | 29.7 | 21.8 | 34.0 | 36.7 | 36.96 | |
| | | TruthfulQA - MC1 Acc | 28.0 | 24.0 | 25.3 | 26.1 | 26.4 | 32.07 | |
| | | TruthfulQA - MC2 Acc | 43.8 | 36.2 | 39.2 | 39.0 | 39.9 | 48.95 | |
| | | SuperGLUE - boolq | 70.6 | 73.7 | 72.7 | 71.0 | 68.5 | 77.27 | |
| | | SuperGLUE - rte | 62.5 | 61.7 | 54.5 | 69.3 | 54.9 | 75.09 | |
| | | SuperGLUE - WiC | 49.5 | 49.5 | 50.0 | 50.3 | 52.3 | 61.75 | |
| | | SuperGLUE - multirc | 56.9 | 55.9 | 57.2 | 57.2 | 57.2 | 39.52 | |
| | |
| | > **Notes:** |
| | > |
| | > * Benchmarks reflect **zero-shot** performance post-SFT. |
| | > * **PT** = Pretrained |
| | --- |
| |
|
| | ## 🧠 Model Architecture |
| |
|
| | * **Hidden size**: 2048 |
| | * **Intermediate size**: 7168 |
| | * **Attention heads**: 16 |
| | * **Hidden layers**: 32 |
| | * **Key-value heads**: 8 |
| | * **Max position embeddings**: 2048 |
| | * **Activation**: SiLU |
| | * **Positional Embeddings**: Rotary (RoPE, theta=10000) |
| | * **Attention Mechanism**: Grouped-query attention |
| | * **Precision**: bf16-mixed |
| |
|
| | --- |
| |
|
| | Important Guidelines for Early Checkpoint Release of Param-1-2.9B-Instruct |
| |
|
| | 1. **Early Development Status** |
| | * This model is in the initial phase of Param-1 Instruct Model. |
| | * It is yet to undergo full-supervised fine-tuning, safety alignment, or rigorous evaluation. |
| | * The release is intended to showcase progress, gather feedback, and encourage research and experimentation. |
| | * Outputs may at times be incoherent, irrelevant, or of suboptimal quality. |
| |
|
| | 2. **Data Sources and Potential Artifacts** |
| | * To preserve the Model's understanding on the global front, part of the training Data also includes data crawled from the Internet hence it may contain inherited artifacts; |
| | * Due to the increased prevalence of AI-generated content online in the current times, the model may occasionally mimic such statements and incorrectly identify itself. |
| | * These artifacts are natural consequences of using publicly available data found on the internet although critical but important since such data is important for the model to build a global know-how and we will be addressing issues like this in future iterations of the current Model. |
| |
|
| | 3. **Lack of Alignment and Guardrails** |
| | * A preliminary-level alignment or safety mechanisms have been implemented at this stage. |
| | * The model is yet to under go full-scale instruction tuning, supervised fine-tuning, or reinforcement learning from human feedback (RLHF). |
| | * As a result, it may occasionally: |
| | * Generate biased, offensive, or unsafe content |
| | * Be susceptible to misuse or prompt injection (jailbreaking) |
| | * Respond to harmful or unethical prompts without refusal |
| | * This model must not be deployed in any production without reading Intent use section. |
| |
|
| | 4. **Intended Use** |
| | * This release is provided exclusively for research, experimentation and contribution to the open source community. |
| | * Suggested use cases include: |
| | * Assessing early-stage LLM behavior |
| | * Debugging model training pipelines and configurations |
| | * Benchmarking or custom fine-tuning by the community |
| | * Access to early-checkpoint should embibe a sense of motivation and enthusiasm among the open source community to take such early-stage check point and build India-Specific Innovative use cases on top of it. This should also help foster innovation among the Community. |
| |
|
| |
|
| | 5. **Licensing and Responsibility** |
| | * Released under an open license with responsible usage guidelines. |
| | * License: MIT |
| | * Users are expected to: |
| | * Adhere to ethical usage practices and legal regulations |
| | * Avoid malicious or unsafe deployment |
| | * Credit the authors as per the licensing terms |
| |
|
| | 6. **Acknowledgement of Origin** |
| | * A home-grown effort initiated in India with limited resources. |
| | * This work represents a bottom-up initiative to develop LLMs from scratch within India. |
| | * It reflects our humble, resource-constrained journey to contribute meaningfully to the open-source AI ecosystem. |
| | * We hope to foster collaboration and growth within the broader community. |
| |
|
| | 7. **Transparency & Community Collaboration** |
| | * We welcome contributions and open dialogue. |
| | * We encourage the community to share feedback, report issues, and collaborate. |
| | * Future versions will introduce better alignment, improved training scale, and more curated datasets. |
| | * Together, we aim to evolve toward safer and more capable AI systems. |
| | --- |