Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL Paper • 2603.19470 • Published Mar 19 • 3
Self-rewarding correction for mathematical reasoning Paper • 2502.19613 • Published Feb 26, 2025 • 82