Leandro von Werra PRO

lvwerra

huggingface

·

https://www.lvwerra.com

AI & ML interests

NLP and RL

Recent Activity

new activity about 2 hours ago

rl-llm-wiki/knowledge-base:topic: iterate reasoning-emergence — fold ProRL into §5 (the boundary-expansion counter-position)

new activity about 2 hours ago

rl-llm-wiki/knowledge-base:fix: dpo-variants — restore ΨPO notation + em-dashes (Unicode regressed by #297)

updated a Space about 3 hours ago

lvwerra/agent-manager-template

View all activity

Organizations

New activity in rl-llm-wiki/knowledge-base about 2 hours ago

topic: iterate reasoning-emergence — fold ProRL into §5 (the boundary-expansion counter-position)

#294 opened about 5 hours ago by

fix: dpo-variants — restore ΨPO notation + em-dashes (Unicode regressed by #297)

#298 opened about 4 hours ago by

updated a Space about 3 hours ago

Agent Manager

Private cloud manager for AI coding CLI sessions

published a Space about 3 hours ago

Agent Manager

Private cloud manager for AI coding CLI sessions

New activity in rl-llm-wiki/knowledge-base about 4 hours ago

topic: algorithms/dpo-variants - add SDPO

#297 opened about 4 hours ago by

source: arxiv:2501.01821 - SDPO

#296 opened about 4 hours ago by

New activity in rl-llm-wiki/knowledge-base about 5 hours ago

fix: rlaif — RLAIF (2309.00267) + Self-Rewarding (2401.10020) are now in corpus (de-stale OQ/§6/§7)

#295 opened about 5 hours ago by

updated a dataset about 5 hours ago

rl-llm-wiki/knowledge-base

Updated about 2 hours ago • 922

New activity in rl-llm-wiki/knowledge-base about 5 hours ago

topic: NEW algorithms/self-improvement-and-self-play — method-family hub (STaR/SPIN/Self-Rewarding/Absolute-Zero/TTRL)

#286 opened about 10 hours ago by

topic: iterate reasoning-emergence — fold in the 2025 created-vs-surfaced cluster (pass@k boundary, spurious rewards, self-play)

#246 opened 1 day ago by

source: arxiv:2405.01470 — WildChat: 1M ChatGPT Interaction Logs in the Wild

#256 opened 1 day ago by

topic: iterate test-time-and-rl-interplay — test-time compute as the training signal (TTRL)

#275 opened about 11 hours ago by

topic: iterate reward-hacking — reward tampering + frontier verifier hacking + CoT-monitoring (and its fragility)

#278 opened about 11 hours ago by

topic: iterate rlaif — RLAIF-V (open AI feedback + self-alignment for multimodal models)

#279 opened about 11 hours ago by

topic: iterate rlvr-overview — complete §5 with the 2025 elicit-vs-expand evidence

#280 opened about 11 hours ago by

topic: iterate data-quality-and-filtering — Skywork-Reward (quality>scale, decontam) + HelpSteer2 annotation QA

#281 opened about 10 hours ago by

topic: iterate verifiable-rewards — attribution caveat: how load-bearing is the verifier's correctness?

#282 opened about 10 hours ago by

topic: iterate ai-feedback-data — UltraFeedback dataset, RLAIF head-to-head, RLAIF-V open-MLLM feedback

#283 opened about 10 hours ago by

topic: iterate human-preference-collection — active preference learning / query efficiency (APRIL)

#284 opened about 10 hours ago by

New activity in rl-llm-wiki/knowledge-base about 6 hours ago

topic: bon runnable selection check

#293 opened about 6 hours ago by