Update README.md

f9b32a7 verified 14 days ago

9.65 kB

	---
	language:
	- hi
	- en
	base_model:
	- bharatgenai/Param-1
	pipeline_tag: text-generation
	library_name: transformers
	license: apache-2.0
	---
	<div align="center">
	<img src="./BharatGen Logo (1).png" width="60%" alt="BharatGen" />
	</div>
	<hr>
	<div align="center">
	<a href="https://arxiv.org/abs/2507.13390" target="_blank" style="margin: 4px;">
	<img alt="Paper" src="https://img.shields.io/badge/%20Paper-arxiv-0033ad?style=flat&logo=arxiv&logoColor=white" />
	</a>
	<a href="https://huggingface.co/bharatgenai/Param-1-2.9B-Instruct/blob/main/LICENSE" target="_blank" style="margin: 4px;">
	<img alt="License" src="https://img.shields.io/badge/License-yellow.svg" />
	</a>
	</div>

	# Param 1-2.9B-Instruct

	BharatGen introduces the early checkpoint of SFT (Supervised Fine-Tuned) for Param 1, a bilingual language model trained from scratch in English and Hindi. With 2.9 billion parameters, this checkpoint builds upon the pretraining phase and serves as a foundation for more downstream tasks, safety testing, and customization.

	---

	## Pre-Training Details
	* Dataset: 7.5 Trillion tokens
	* Data Quality: Highly curated with standard filtering and multiple processing steps.
	* Scheduler: Cosine Annealing
	* Learning_rate: 3e-4 to 3e-6
	* Training Setup: Running on 512 H100 GPUs
	* Framework: NVIDIA NeMo
	* Precision: bf16-mixed

	* For Pre-Trained Checkpoint (Param 1): https://aikosh.indiaai.gov.in/home/models/details/bharatgen_param_1_indic_scale_bilingual_foundation_model.html

	---

	## SFT Training Details
	* Dataset: 0.8 Million samples
	* Epochs: 3
	* Scheduler: Cosine Annealing
	* Learning Rate: 5e-6 to 5e-8
	* Training Hardware: 32 H200 GPUs
	* Framework: NVIDIA NeMo
	* Precision: bf16-mixed

	Filtered high-quality bilingual data was used, combining public and in-house sources for safety-aware and culturally relevant behavior.

	---

	## 🚀 Model Inference
	* After cloning provide the model path (model directory) in the inference script

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load tokenizer and model
	model_name = "bharatgenai/Param-1-2.9B-Instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	trust_remote_code=True,
	device_map="auto"
	)

	# Conversation Input
	conversation = [
	{
	"content": "You are helpful assistant.",
	"role": "system"
	},
	{
	"content": "What is the BharatGen Mission?",
	"role": "user"
	}
	]

	# padding special token
	inputs = tokenizer.apply_chat_template(
	conversation=conversation,
	return_tensors="pt",
	add_generation_prompt=True
	)
	inputs = inputs.to(model.device)

	# --- Generate output ---
	with torch.no_grad():
	output = model.generate(
	inputs,
	max_new_tokens=300,
	do_sample=True,
	top_k=50,
	top_p=0.95,
	temperature=0.6,
	eos_token_id=tokenizer.eos_token_id,
	use_cache=False
	)

	# Get only the generated tokens (exclude the prompt length)
	generated_tokens = output[0][inputs.shape[-1]:]
	generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)

	print("Assistant Output:\n", generated_text)
	```

	---

	## 📊 Benchmarks (zero-shot)

	\| Task \| Param 1 (PT) \| Gemma2-2B (PT) \| llama3.2-3B (distill PT) \| granite-3.1-2B (PT) \| granite-3.1-3B (PT) \| qwen-2.5-3B (PT) \|
	\| ----------------------- \| ------------- \| -------------- \| ------------------------ \| ------------------- \| ------------------- \| ---------------- \|
	\| ARC Challenge \| 46.7 \| 49.7 \| 46.0 \| 47.2 \| 45.2 \| 47.4 \|
	\| ARC Easy \| 74.6 \| 80.3 \| 71.7 \| 76.8 \| 75.8 \| 73.2 \|
	\| HellaSwag \| 71.4 \| 73.0 \| 73.7 \| 75.5 \| 72.6 \| 73.6 \|
	\| HellaSwag Hi \| 44.1 \| 38.6 \| 40.0 \| 31.0 \| 28.5 \| 32.9 \|
	\| MMLU En \| 41.4 \| 47.1 \| 53.9 \| 47.8 \| 41.0 \| 64.9 \|
	\| MMLU Hi \| 30.7 \| 30.0 \| 35.0 \| 29.0 \| 25.7 \| 38.32 \|
	\| PIQA \| 79.3 \| 78.3 \| 77.31 \| 79.4 \| 78.2 \| 78.84 \|
	\| TriviaQA \| 38.5 \| 32.9 \| 50.83 \| 26.2 \| 27.5 \| 42.27 \|
	\| TruthfulQA - Gen (BLEU) \| 38.2 \| 29.7 \| 21.8 \| 34.0 \| 36.7 \| 36.96 \|
	\| TruthfulQA - MC1 Acc \| 28.0 \| 24.0 \| 25.3 \| 26.1 \| 26.4 \| 32.07 \|
	\| TruthfulQA - MC2 Acc \| 43.8 \| 36.2 \| 39.2 \| 39.0 \| 39.9 \| 48.95 \|
	\| SuperGLUE - boolq \| 70.6 \| 73.7 \| 72.7 \| 71.0 \| 68.5 \| 77.27 \|
	\| SuperGLUE - rte \| 62.5 \| 61.7 \| 54.5 \| 69.3 \| 54.9 \| 75.09 \|
	\| SuperGLUE - WiC \| 49.5 \| 49.5 \| 50.0 \| 50.3 \| 52.3 \| 61.75 \|
	\| SuperGLUE - multirc \| 56.9 \| 55.9 \| 57.2 \| 57.2 \| 57.2 \| 39.52 \|

	> Notes:
	>
	> * Benchmarks reflect zero-shot performance post-SFT.
	> * PT = Pretrained
	---

	## 🧠 Model Architecture

	* Hidden size: 2048
	* Intermediate size: 7168
	* Attention heads: 16
	* Hidden layers: 32
	* Key-value heads: 8
	* Max position embeddings: 2048
	* Activation: SiLU
	* Positional Embeddings: Rotary (RoPE, theta=10000)
	* Attention Mechanism: Grouped-query attention
	* Precision: bf16-mixed

	---

	Important Guidelines for Early Checkpoint Release of Param-1-2.9B-Instruct

	1. Early Development Status
	* This model is in the initial phase of Param-1 Instruct Model.
	* It is yet to undergo full-supervised fine-tuning, safety alignment, or rigorous evaluation.
	* The release is intended to showcase progress, gather feedback, and encourage research and experimentation.
	* Outputs may at times be incoherent, irrelevant, or of suboptimal quality.

	2. Data Sources and Potential Artifacts
	* To preserve the Model's understanding on the global front, part of the training Data also includes data crawled from the Internet hence it may contain inherited artifacts;
	* Due to the increased prevalence of AI-generated content online in the current times, the model may occasionally mimic such statements and incorrectly identify itself.
	* These artifacts are natural consequences of using publicly available data found on the internet although critical but important since such data is important for the model to build a global know-how and we will be addressing issues like this in future iterations of the current Model.

	3. Lack of Alignment and Guardrails
	* A preliminary-level alignment or safety mechanisms have been implemented at this stage.
	* The model is yet to under go full-scale instruction tuning, supervised fine-tuning, or reinforcement learning from human feedback (RLHF).
	* As a result, it may occasionally:
	* Generate biased, offensive, or unsafe content
	* Be susceptible to misuse or prompt injection (jailbreaking)
	* Respond to harmful or unethical prompts without refusal
	* This model must not be deployed in any production without reading Intent use section.

	4. Intended Use
	* This release is provided exclusively for research, experimentation and contribution to the open source community.
	* Suggested use cases include:
	* Assessing early-stage LLM behavior
	* Debugging model training pipelines and configurations
	* Benchmarking or custom fine-tuning by the community
	* Access to early-checkpoint should embibe a sense of motivation and enthusiasm among the open source community to take such early-stage check point and build India-Specific Innovative use cases on top of it. This should also help foster innovation among the Community.


	5. Licensing and Responsibility
	* Released under an open license with responsible usage guidelines.
	* License: MIT
	* Users are expected to:
	* Adhere to ethical usage practices and legal regulations
	* Avoid malicious or unsafe deployment
	* Credit the authors as per the licensing terms

	6. Acknowledgement of Origin
	* A home-grown effort initiated in India with limited resources.
	* This work represents a bottom-up initiative to develop LLMs from scratch within India.
	* It reflects our humble, resource-constrained journey to contribute meaningfully to the open-source AI ecosystem.
	* We hope to foster collaboration and growth within the broader community.

	7. Transparency & Community Collaboration
	* We welcome contributions and open dialogue.
	* We encourage the community to share feedback, report issues, and collaborate.
	* Future versions will introduce better alignment, improved training scale, and more curated datasets.
	* Together, we aim to evolve toward safer and more capable AI systems.
	---