YALM-130M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Model Overview:

  • Architecture: Llama
  • Pretraining steps: 40k
  • Pretraining tokens: 42B
  • Precision: bfloat16
  • Number of Parameters: 130M
  • Number of Paramaters (Non-Embedding): 113M
  • Number of Layers: 16
  • Number of Attention Heads (GQA): 16 for Q and 2 for KV
  • Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain6-62M

Hyperparameters

  • learning_rate: 6e-3
  • train_batch_size: 16
  • eval_batch_size: 16
  • distributed_type: multi-GPU DDP
  • num_devices: 4
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 512
  • total_eval_batch_size: 64
  • optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_warmup_steps: 4000
  • training_steps: 40000

Hardware

  • GPUs: 4 x RTX 5090

Framework versions

  • Transformers 4.56.2
  • Pytorch 2.8.0+cu128
  • Datasets 4.1.1
  • Tokenizers 0.22.1

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

  • Loss: 2.46
  • Perplexity: 11.66

Base pre-trained model

Metrics YALM-130M YALM-80M
MMLU (cloze) 27.98 27.33
MMLU Pro 11.38 8.72
BBH (5-shot) 11.59 12.61
ARC (Average) 33.50 29.87
HellaSwag 34.08 32.16
PIQA 62.40 62.89
SCIQ 70.00 69.50
CommonsenseQA 28.75 28.75
Winogrande 50.28 50.59
OpenBookQA 31.00 29.60
TruthfulQA 21.71 22.78
TriviaQA 0.18 0.17
GSM8K (5-shot) 1.06 0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train kp7742/YALM-130M