Dora training taking 8x time? Why?

ashwani345 · July 24, 2024, 11:14pm

Hi, I am enabling the flag of use_dora in LoRAConfig. When I disable it, the training time is 16 hours and when I enable it, the training time shows 122hours. I have kept all other configs same. What is causing this behaviour?

Training LLAMA 8b instruct.

Following are my lora and trainingArguments:

lora_config:
target_modules: “q_proj,k_proj,v_proj,o_proj,gate_proj”
r: 32
lora_alpha: 16
lora_dropout: 0.05
use_dora: True
init_lora_weights: “gaussian”
use_rslora: True
freeze_layers: 0

train_params:
learning_rate: 0.00003
per_device_train_batch_size: 1
per_device_eval_batch_size: 4
num_train_epochs: 3
gradient_accumulation_steps: 8
max_grad_norm: 1
eval_strategy: “steps”
eval_steps: 0.123
optim: ‘adamw_8bit’
save_steps: 0.123
weight_decay: 0.01
fp16: true
save_strategy: “steps”
warmup_ratio: 0.1
logging_steps: 50
gradient_checkpointing: false
report_to: ‘tensorboard’
lr_scheduler_type: ‘cosine’
save_total_limit: 100
ddp_find_unused_parameters: false

CitizenDC · November 27, 2025, 8:31am

Same here. Training time went up 10x after enabling DORA.
Did you discover the cause?

John6666 · November 27, 2025, 9:23am

Setting lora_dropout=0.0 might speed things up a bit?

Short answer:

Turning on use_dora=True switches from “plain LoRA” to DoRA, which adds extra parameters and extra math per forward/backward step.
The current Hugging Face PEFT implementation of DoRA has noticeable overhead and is less optimized than the plain LoRA path; the official docs explicitly say “DoRA introduces a bigger overhead than pure LoRA.” (Hugging Face)
On top of that, DoRA’s adapter construction and first steps are much slower, so the Hugging Face Trainer often computes a very pessimistic ETA from those initial steps. That is why you see “16h → 122h” with the same hyperparameters, even though the real per-step slowdown is smaller.

Below is a detailed breakdown with context and how this interacts with your exact config.

1. Background: LoRA vs DoRA

1.1 LoRA recap (what you had when `use_dora=False`)

LoRA (Low-Rank Adaptation) replaces a full dense update (\Delta W) with a low-rank factorization (BA) (rank r). You keep the base weight (W_0) frozen and only train the small low-rank matrices. This drastically cuts the number of trainable parameters but does not remove the base matmul; during training you still compute:

[
y = x W_0^\top + x (BA)^\top
]

So training cost per step is roughly:

Base matmul cost (unchanged)
- small extra matmul for (BA) (LoRA) (Hugging Face)

LoRA is cheap in parameters and moderately cheap in compute.

1.2 DoRA (what you get when `use_dora=True`)

DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes each weight into:

Magnitude (m) (scalar/vector norm)
Direction (v) (normalized weight vector)

Direction is updated via LoRA; magnitude is an independent learnable parameter. (Nbasyl)

So for each adapted weight you now have:

Extra magnitude parameter(s) to maintain and update.
Extra computations:
- compute or apply the magnitude scaling,
- normalize / renormalize the effective weight,
- combine base weight + LoRA direction update + magnitude.

This makes DoRA:

More expressive and robust, especially at low ranks (e.g. r=4–8). (Hugging Face)
But heavier per step than LoRA during training, because you add magnitude logic on top of LoRA.

The Hugging Face PEFT docs explicitly list a caveat:

“DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference…” (Hugging Face)

So some slowdown is expected even in an ideal implementation.

2. Implementation reality: why the overhead is big in PEFT

On top of the theoretical extra work, PEFT’s current implementation makes DoRA noticeably slower in practice.

2.1 Adapter construction with DoRA is slow

There is a GitHub issue “Getting Dora Model Is Very Slow” in the PEFT repo:

When get_peft_model is called with use_dora=True, it takes several minutes to construct the model for a 7B-class model.
The same code with plain LoRA builds almost immediately. (GitHub)

This extra initialization typically happens at:

the moment you wrap the model, and/or
the first training steps.

If your Trainer uses those very slow initial steps to estimate the ETA, it will produce a hugely inflated time estimate even though later steps are faster.

Your “16 hours vs 122 hours” is exactly the pattern described in the public HF thread “Dora training taking 8x time? Why?”, which reproduces your config and observation. (Hugging Face Forums)

2.2 Adapters are computed sequentially with the base layer

General PEFT overhead is now well understood: fine-tuning methods like LoRA add small adapter layers that are executed sequentially after the base layer on the GPU. During training, you cannot merge them into the base weights (they’re changing), so each step does:

base linear layer matmul, then
adapter matmul (LoRA or DoRA), then
combination / scaling.

A recent paper on PaCA (“Partial Connection Adaptation”) analyzes this and notes that even PEFT methods do not necessarily reduce training time, because the adapter and pretrained layers are processed sequentially, causing non-trivial latency overhead. (arXiv)

DoRA adds further operations on top of LoRA’s adapter, so its sequential overhead is even larger:

More kernels (norms, scalings),
More reads/writes,
More Python-level glue in the current implementation.

2.3 PEFT’s DoRA optimizations only kick in for certain settings

In the PEFT LoRA developer guide, the DoRA section says:

DoRA is supported via LoraConfig(use_dora=True, ...).
There are optimizations for evaluation and for certain dropout settings, including a runtime config with ephemeral_gpu_offload=True. (Hugging Face)

Importantly:

DoRA is optimized mainly for evaluation and for zero LoRA dropout.
With non-zero lora_dropout, the library cannot reuse cached computations as aggressively and must recompute more every step.

Your config uses lora_dropout: 0.05, which is standard for LoRA, but for DoRA it puts you on the slow path rather than the optimized one.

3. How this interacts with your exact config

You are training LLaMA-8B Instruct with:

3.1 LoRA / DoRA config

target_modules: "q_proj,k_proj,v_proj,o_proj,gate_proj"
r: 32
lora_alpha: 16
lora_dropout: 0.05
use_rslora: True
use_dora: True (only in the slow run)

Comments:

You are adapting five large projections per block (q/k/v/o/gate). This is already a fairly heavy coverage even for LoRA.
r=32 is a medium–high rank for LLaMA-8B. It increases the LoRA adapter’s size and compute proportionally; DoRA then wraps this.
use_rslora=True adds rank-stabilized scaling on top of LoRA; DoRA then adds magnitude/direction decomposition on top of that.
lora_dropout=0.05 is fine for pure LoRA, but for DoRA it prevents the most aggressive optimization (which assumes lora_dropout=0 for caching) described in the docs. (Hugging Face)

So compared to your LoRA baseline, DoRA is:

Operating on the same large set of modules
With the same relatively high rank
But with extra per-weight parameters and math,
And running on a less optimized code path due to nonzero dropout.

That alone can easily give you a 2–3× real per-step slowdown.

3.2 TrainingArguments

Your training arguments include:

per_device_train_batch_size: 1
gradient_accumulation_steps: 8 → effective batch size 8.
num_train_epochs: 3
optim: "adamw_8bit" (bitsandbytes optimizer)
fp16: true
gradient_checkpointing: false
eval_strategy: "steps"
eval_steps: 0.123
save_steps: 0.123

A few important interactions:

Very small micro-batch (1) per device
- When the batch per device is 1, GPU hardware is underutilized; per-step overhead (extra kernels, Python logic) matters more.
- Any extra DoRA operations are more visible in wall-clock time.
No gradient checkpointing
- You recompute all activations in full; DoRA’s extra computations are fully paid every forward/backward.
8-bit optimizer
- This is good for memory, but DoRA + quantization support in PEFT is newer and not as tuned as plain LoRA. The docs even flag caveats for DoRA with quantized weights. (Hugging Face)
eval_steps / save_steps as ratios
- Hugging Face Trainer allows eval_steps and save_steps to be a fraction of total steps (0–1); 0.123 means “every 12.3% of total steps”, not every 0.123 steps. (Hugging Face)
- That is not wrong, but it means the first evaluation/save happens relatively early; if those early steps are unusually slow because of DoRA initialization, they distort both the timing and ETA estimation early on.

Net effect:

DoRA’s heavier per-step cost is amplified by small batch and no checkpointing.
Early DoRA setup + first evaluation/save steps are extra slow.
The Trainer uses those to compute an ETA, so it tells you “~122 hours” instead of something like “~30–40 hours”.

4. So what is actually causing “16h → 122h”?

Putting all of this together, the behaviour is caused by a combination of:

Real extra compute and memory overhead from DoRA vs LoRA
- DoRA adds magnitude parameters and normalization on top of LoRA’s low-rank update. (Hugging Face)
- PEFT’s documentation explicitly says “DoRA introduces a bigger overhead than pure LoRA.” (Hugging Face)
- In your configuration (many target modules, rank 32, non-zero dropout, small batch), this overhead is magnified, so each training step truly is significantly slower than with plain LoRA.
PEFT/Transformers implementation details
- Wrapping the model with LoraConfig(use_dora=True) triggers heavy adapter construction; a PEFT GitHub issue shows that get_peft_model can take minutes with DoRA while being almost instant for LoRA. (GitHub)
- Adapters are computed sequentially with the base layers during training, which is known to cause latency overhead for PEFT methods in general. (arXiv)
ETA extrapolated from very slow first steps
- The Trainer estimates remaining time by measuring the duration of the first steps and extrapolating across all steps.
- With DoRA, those first steps include model wrapping, graph warm-up, early evaluation/save, etc., and can be many times slower than “steady-state” steps.
- This is exactly the pattern described in the HF forum post that mirrors your question: toggling use_dora=True on the same LLaMA-8B LoRA setup raises the ETA from ~16h to ~122h. (Hugging Face Forums)

So:

The real per-step slowdown is likely in the range of ~2–3× (depending on hardware), due to DoRA’s added work and less optimized path.
The reported 8× expansion (16h → 122h) comes from combining that real slowdown with a very pessimistic ETA based on unusually slow initial steps.

5. What you can do (optional, but useful)

Even though you did not explicitly ask for fixes, it is useful to know the levers you have:

Measure actual per-step time
- Look at wall-clock time per N steps (e.g. per 100 steps) after the first few hundred steps of training.
- Compare LoRA vs DoRA runs. This will show you the “true” slowdown instead of relying on ETA.
If you want to keep DoRA
- Set lora_dropout=0.0 for DoRA to unlock better caching/optimization. This is aligned with the PEFT DoRA docs. (Hugging Face)
- Consider lowering r (e.g. 16 instead of 32); DoRA tends to perform well at lower ranks, so you may recover similar quality with less compute. (Hugging Face)
- If you’re offloading parts of the model to CPU, use LoraRuntimeConfig(ephemeral_gpu_offload=True) as suggested in the docs for better speed. (Hugging Face)
If speed is more important for now
- Stick to plain LoRA + rsLoRA (your original 16h setup) with use_dora=False.
- Use the saved compute for more data, slightly higher rank, or more epochs, which often gives larger practical gains than switching from LoRA to DoRA in the first place for r=32 on LLaMA-8B.

Bullet summary

Turning on use_dora=True switches from LoRA to DoRA, which adds magnitude parameters and extra normalization math on top of LoRA’s low-rank updates.
The Hugging Face PEFT docs state that “DoRA introduces a bigger overhead than pure LoRA,” so some slowdown is expected. (Hugging Face)
PEFT’s current DoRA implementation has heavy adapter construction and sequential adapter execution, which are particularly visible with small per-device batch size and nonzero lora_dropout. (GitHub)
Your specific config (many target modules, rank 32, lora_dropout=0.05, adamw_8bit, batch size 1) amplifies this overhead.
The “16 hours vs 122 hours” is mainly an ETA artifact: initial DoRA steps are unusually slow, and the Trainer extrapolates from them, overestimating total time.
The underlying cause of the behaviour is the combination of DoRA’s extra compute + PEFT’s less optimized DoRA path + ETA estimation from slow initial steps, not a hidden change in your other hyperparameters.

Topic		Replies	Views
DORA memory requirements Beginners	0	51	September 7, 2024
Fine Tune with/without LORA 🤗Transformers	1	296	October 7, 2024
LoraConfig task_type 🤗Transformers	0	652	September 11, 2023
Training llama with Lora on multiple GPUs may exist bug 🤗Transformers	10	9877	August 25, 2023
[RuntimeError] DPOTrainer - "element 0 of tensors does not require grad and does not have a grad_fn" on 8x A100 GPUs 🤗Accelerate	1	85	May 20, 2025