How I manage to vibe fine-tuning LLMa as a Physics PhD

Community Article Published December 6, 2025

How I reproduced cutting-edge LoRA research from Thinking Machines Lab by just prompting Orchestra with natural language.

Introduction

I'm a physics PhD who has done some AI research, but GRPO and large-scale RL fine-tuning wasn't my main area. So when I saw Thinking Machines Lab's blog post with striking claims about LoRA fine-tuning, I was fascinated. How could rank=1 LoRA really outperform full fine-tuning on math reasoning? I wanted to find it out myself.

The challenge wasn't conceptual—I understand the theory. The challenge was engineering: to validate their findings, I'd be looking at 2-3 weeks of setting up GRPO implementations, debugging training code, provisioning GPU infrastructure, and wrangling large datasets. Even with AI research experience, that overhead means interesting ideas outside my main area often stay untested.

But I built Orchestra specifically to eliminate this barrier. So I decided to test it: could I skip weeks of engineering overhead and quickly validate cutting-edge research in unfamiliar territory (GRPO RL fine-tuning) using just natural language conversation?


What I Wanted to Validate

Key Takeaways from "LoRA Without Regret"

  • Apply LoRA to MLP layers, not just attention. Most tutorials skip MLPs, leaving performance on the table.
  • Use 10x higher learning rates for LoRA. The optimal LR is consistently 10x what you'd use for full fine-tuning.
  • For RL tasks, rank=1 LoRA beats full fine-tuning. Policy gradients give you ~1 bit per episode, so low-rank is actually optimal.

I wanted to see if I could reproduce these findings—particularly the claim that rank=16 achieves 99% of rank=256's performance on supervised fine-tuning, and that rank=1 LoRA beats full fine-tuning on reinforcement learning tasks. But instead of spending weeks setting up infrastructure, I decided to use Orchestra and just have a conversation about what I wanted to test.


From Curiosity to Results: How Orchestra Made This Happen

Here's the thing that still feels a bit surreal: I didn't write a single line of training code. I didn't provision GPUs. I didn't debug CUDA errors at 2am. I just had a conversation with Orchestra and explained what I wanted to test:

The Conversations: Setting Up Two Experiments

I ran two separate experiments with two different Orchestra agents. For the supervised fine-tuning experiment, I asked:

"Fine-tune Llama 3.2 1B on Tulu3 dataset. Compare LoRA rank=16 vs rank=256 on MLP layers only."

For the reinforcement learning experiment, I started a new agent and said:

"run a GRPO RL algorithm on the Qwen2.5-0.5B instruct model with the GSM8k dataset. let's compare full fine-tuning vs. lora"

We went back and forth for about 20 minutes. I clarified which hyperparameters to use, what metrics to track, which baselines to compare against. The agent asked clarifying questions about dataset sizes, training duration, evaluation strategy. It felt more like collaborating with a research engineer than using a tool.

chat-1 chat-2

Conversation with Orchestra: defining experiments, clarifying hyperparameters, and discussing baselines

What the Orchestra Agent Did: The Full Workflow

1. Writing the Code

Orchestra generated complete training scripts for both experiments: SFT with LoRA at different ranks, and GRPO with LoRA vs full fine-tuning. It structured the code with proper experiment tracking, checkpointing, and evaluation loops. Set up configurations for Llama 3.2 1B on Tulu3 and Qwen2.5-0.5B on GSM8k.

2. Debugging and Testing

Before running full experiments, the agents ran test runs on GPU with small sample sizes and just a few training steps to catch bugs early. This caught issues in the GRPO reward function during testing—things that would have caused silent failures during full training runs.

3. Provisioning GPUs

Once tests passed, Orchestra automatically provisioned the necessary GPUs via Modal: 4x H100s for the SFT experiment, 1x H100 for the RL fine-tuning. It handled all the infrastructure setup—Docker containers, environment configuration, dependency installation. I didn't touch a single config file.

4. Running GPU Experiments in Parallel

Both experiments ran overnight in parallel. SFT with rank=16 and rank=256. GRPO with rank=1 LoRA vs two full FT baselines (low LR and high LR). All metrics were logged through Orchestra's internal SDK and rendered in real time.

5. Monitoring Progress

The agent continuously monitored training metrics. It detected when full FT with high LR flatlined at 0% correctness and flagged it. It noticed when LoRA hit 100% format compliance and predicted strong correctness would follow (it did).

monitoring

Real-time monitoring of training metrics showing the agent detecting and flagging issues

6. Plotting Results

When experiments finished, Orchestra generated 11 publication-ready plots: training/eval loss curves for SFT, correctness over time for GRPO, format compliance trends, total reward progression.

report

The agent automatically generating publication-ready plots with side-by-side comparisons

7. Writing the Analysis Report

Finally, it generated a comprehensive markdown report with statistical analysis, key findings, comparison tables, and recommendations. It even offered to create a PowerPoint presentation.

pptx

The agent generating analysis reports and offering to create PowerPoint presentations


The Results: We Reproduced Their Findings

Experiment 1: Supervised Fine-Tuning (Rank=16 vs Rank=256)

View full experiment in Orchestra →

Metric Result
Test loss gap (Rank=256 vs Rank=16) 0.60%
Parameter reduction (Rank=16) 16x fewer

llama3_tulu3_lora_comparison

Our experiment: Llama 3.2 1B on Tulu3 (rank=16 vs 256)

Thinking_Machine_Lab_SFT

Original paper: Multiple ranks on Llama 3.1 8B & 3.2 1B

Our curves basically overlap with the original paper's findings. Rank=16 achieves 99.4% of rank=256's performance with 16x fewer trainable parameters—exactly validating what Thinking Machines Lab found.

Technical Details

Experiment Setup:

  • Model: Llama 3.2 1B (1.24B parameters)
  • Dataset: Tulu3 SFT mixture (10% subset = 93,934 examples)
  • Ranks tested: 16 vs 256
  • Training: 0.25 epochs, constant LR = 1e-4
  • Hardware: 4x H100 GPUs (via Modal)
  • LoRA config: Applied to MLP layers (gate_proj, up_proj, down_proj)

Experiment 2: Reinforcement Learning (Rank=1 LoRA vs Full Fine-Tuning)

View full experiment in Orchestra →

Method Final Correctness
LoRA (Rank=1) 52.1%
Full FT (low LR) 33.3%
Full FT (high LR) 0%

56% relative improvement over best full fine-tuning baseline

01_correctness

Our experiment: Qwen2.5-0.5B on GSM8k (rank=1 LoRA vs Full FT)

Thinking_Machine_Lab_RL

Original paper: LoRA ranks vs Full FT on RL tasks

Rank-1 LoRA absolutely demolished both full fine-tuning configurations—exactly matching what Thinking Machines Lab found. Full FT with high LR flatlined at 0%. Full FT with low LR peaked at 43.8% then degraded to 33.3%. Meanwhile, LoRA shot up to 56% by step 50 and stayed stable around 52%. Same story, different model and dataset.

02_format_success

Format compliance—LoRA hit 100% by step 100, while Full FT maxed at 82.3%

03_total_reward

Total reward progression (correctness + format + reasoning quality)

Technical Details

Experiment Setup:

  • Model: Qwen2.5-0.5B-Instruct (494M parameters)
  • Dataset: GSM8k (7,473 math word problems)
  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Methods compared: LoRA (rank=1, lr=2e-5) vs Full FT (lr=2e-6 and 7e-5)
  • Training: 200 steps, 8 generations per prompt
  • Output format: Structured XML with <reasoning> and <answer> tags

The Timeline Comparison

Traditional Approach (~2-3 weeks)

Phase Duration
Rent compute, queue for cluster access, configure GPU environments Day 1-3
Write training code, implement GRPO, debug reward functions Day 4-7
Run experiments, realize something's misconfigured, rerun everything Day 8-10
Generate plots, write analysis, make comparison tables Day 11-14

With Orchestra (~2 days)

Phase Duration
20-minute conversation explaining what I want to test Evening
Agent writes code, debugs, provisions H100s, runs experiments in parallel Overnight
Complete results with plots and analysis ready. Spot 4 reward function issues Morning
Fixed experiments validate paper's claims with detailed analysis report Next day

Here is what I got: Production-ready experiments that ran to completion overnight. The code worked. The infrastructure provisioned correctly. The metrics tracked properly. The only "issue" was me changing my mind about reward weights halfway through—and the agent handled the rerun without complaint.

This eliminated weeks of engineering work that would normally block me from exploring areas outside my main expertise. I went from "I'd love to test this but don't have 3 weeks for setup" to "let me validate this today" in a single evening. That's the difference between ideas staying theoretical and actually testing them.


What I Learned About How Research Should Work

Researchers can now quickly explore adjacent fields

This is the big one for me. GRPO RL fine-tuning wasn't my specialty, but I was curious about the paper's claims. Instead of 2-3 weeks of engineering setup, I validated them in 1 day. The barrier between "interesting idea" and "empirical validation" just got dramatically lower.

Focus on the research question, not the infrastructure

I spent my time thinking about experimental design, interpreting results, and understanding the information-theoretic arguments. Not debugging GPU drivers, provisioning GPUs, or writing boilerplate training loops.

AI agents excel at iterative research workflows

The workflow that typically takes weeks (code → debug → provision → run → analyze → plot) is exactly what agents are good at: structured, multi-step processes with clear success criteria. Orchestra didn't just "assist"—it executed the entire pipeline autonomously while I slept.

Trust but verify

The back-and-forth with Orchestra felt natural. I didn't need to know a specific ML framework or GPU cluster setup details—I just explained what I wanted in plain language and let the agent translate that into working code and infrastructure setup. The agent's code worked, but I still reviewed the experimental setup, checked the metrics, and validated the conclusions. Agents handle execution; you handle scientific judgment.


We are redefining how research would be done in the future.

This experience crystallized an important shift: the bottleneck in science is moving from "can we run the experiment" to "what should we test."

Previously, many ideas remained untested because execution costs were too high. A paper would spark curiosity about generalization, but validating it required 2-3 weeks of infrastructure work—time most researchers don't have.

When you can articulate a research question clearly, you can now get empirical answers. This enables researchers to follow their curiosity, run validation studies that would otherwise be deprioritized, and accelerate the scientific process by removing artificial friction—not by cutting corners. Everyone can be a scientist.


References

  1. LoRA Without Regret — John Schulman and Thinking Machines Lab, Sep 2025
  2. LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021
  3. Group Relative Policy Optimization — Shao et al., 2024
  4. GSM8K: Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021
  5. Tulu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024

Citation

If you reference this work, please cite:

@article{zhang2025lora_reproduction,
  title={Reproducing "LoRA Without Regret" with Orchestra},
  author={Zhang, Zechen and Liu, Amber},
  journal={Orchestra Research Blog},
  year={2025},
  url={https://www.orchestra-research.com/perspectives/LLM-with-Orchestra}
}

Acknowledgements

We'd like to thank Modal for their generous support in providing cloud compute infrastructure for these experiments. Their platform made it seamless to provision H100 GPUs on-demand, which was essential for running both the supervised fine-tuning and reinforcement learning experiments at scale.


Experiments conducted using Orchestra Research, an AI-powered research platform for accelerating scientific discovery. All code, configurations, and experimental logs are available for reproducibility.

Community

Sign up or log in to comment