Behind RoPE: How Does Causal Mask Encode Positional Information?
Paper
•
2509.21042
•
Published
•
8
This model is the official checkpoint accompanying the paper Behind RoPE: How Does Causal Mask Encode Positional Information?.
The model is trained without any explicit positional encoding (also known as NoPE).
It is based on the Llama-3 architecture, has 1.5 billion parameters, and was trained on 15 trillion tokens from the FineWeb-Edu dataset.
The model is based on the Llama-3 architecture, with the positional encoding (RoPE) removed. It is trained on the deduplicated version of fineweb-edu dataset. The model has 1.5 billion parameters and is trained on 15 trillion tokens with a maximum sequence length of 1024. Further training details are provided in the accompanying paper.
from transformers import LlamaForCausalLM, LlamaTokenizer
import transformers.models.llama.modeling_llama as modeling_llama
def noop_apply_rotary_pos_emb(q, k, *args, **kwargs):
return q, k
modeling_llama.apply_rotary_pos_emb = noop_apply_rotary_pos_emb
model = LlamaForCausalLM.from_pretrained(
"starmpcc/NoPE_1.5B_FW_EDU_15T",
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained("starmpcc/NoPE_1.5B_FW_EDU_15T")
@misc{kim2025ropedoescausalmask,
title={Behind RoPE: How Does Causal Mask Encode Positional Information?},
author={Junu Kim and Xiao Liu and Zhenghao Lin and Lei Ji and Yeyun Gong and Edward Choi},
year={2025},
eprint={2509.21042},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.21042},
}