Phi-4 RAG (LoRA fine-tuned) β€” Q4_K_M GGUF

Quantized GGUF build of microsoft/phi-4 with a LoRA adapter merged in, fine-tuned for retrieval-augmented question answering. The model answers only from supplied document context in English, Spanish, or Catalan, using the same RAG-oriented system prompt as MonkeyGrab, a local, fully private RAG stack developed for a Bachelor's thesis (TFG) at the Universitat Politècnica de València (UPV).

Source code, thesis, and contact

The full MonkeyGrab application repository is not public yet (defense / publication timeline). This Hugging Face model repo ships inference assets (Phi4-Q4_K_M.gguf), the Ollama Modelfile, and a reproduction/ folder with frozen copies of the training script, merge utility, and evaluation_comparison.json so methodology and metrics remain auditable without requiring access to the full codebase.

Contact: nadiva1243@gmail.com for questions about training, evaluation, Ollama usage, or when the full repository will be released.

GGUF pipeline (high level): LoRA fine-tuning on the datasets below β†’ merge with merge_lora.py (see reproduction/) β†’ GGUF export via the llama.cpp toolchain β†’ Q4_K_M quantization. The merge script documents expected paths and flags.

Files in this repo

File Description
Phi4-Q4_K_M.gguf Full weights after LoRA merge, Q4_K_M quantization.
Modelfile Ollama recipe: ChatML template, RAG system prompt, sampling parameters.
README.md This model card.
LICENSE MIT β€” applies to the model card, Modelfile, and files added here by nadiva1243 (not to Microsoft's base terms).
reproduction/train-phi4.py Snapshot of scripts/training/train-phi4.py (v1) used for this adapter.
reproduction/merge_lora.py Snapshot of scripts/conversion/merge_lora.py used to merge the LoRA weights into a dense checkpoint before GGUF export.
reproduction/evaluation_comparison.json Frozen evaluation export (base vs. adapted, dev/test splits, per dataset + weighted aggregate).
reproduction/CONVERSION.md Step-by-step notes: merge β†’ GGUF β†’ Q4_K_M quantization β†’ Ollama import.

Base model and method

  • Base: microsoft/phi-4 β€” 14B-parameter transformer (ChatML-style; end-of-turn token <|im_end|>).
  • Adaptation: PEFT LoRA fine-tuning on five RAG-focused datasets β†’ LoRA adapter merged into dense weights β†’ GGUF export β†’ Q4_K_M quantization.

LoRA configuration

Setting Value
r 32
lora_alpha 64
lora_dropout 0.05
target_modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
bias none

Training (train-phi4.py, v1)

  • Seed: 42 (propagates to torch / NumPy / CUDA via set_seed).
  • Task format: ChatML <|im_start|>user … <|im_end|> with the instruction and <context>…</context> on the user turn; loss computed only on the assistant completion (prompt labels masked with –100).
  • Data β€” balanced 5-way interleaving (3,200 train samples per source, 16,000 total):
  • Sequence limits: max_length 4,096 tokens; context truncated to 2,048 tokens; generation up to 2,048 new tokens.
  • Optimizer / schedule: AdamW 8-bit, lr 5e-5, cosine decay with warmup_ratio 0.05, weight_decay 0.01, max_grad_norm 1.0.
  • Batching: per_device_train_batch_size 1, gradient_accumulation_steps 16 β†’ effective batch 16; bf16 + TF32; gradient checkpointing enabled.
  • Epochs: 3; checkpoints saved every 300 steps (keep last 3); eval every 150 steps; load_best_model_at_end on eval_loss; early stopping patience 3 evaluations.

Evaluation protocol

  • Frozen dev/test splits: identical for the base (microsoft/phi-4) and the adapted (LoRA merged) model β€” no data leakage.
  • Dev: 320 samples Γ— 5 sources = 1,600 examples (aligned with evaluate_baselines.py for cross-experiment comparability).
  • Test: full held-out splits β€” 8,490 examples total across all five sources.
  • Metrics: Token F1, ROUGE-L F1, BERTScore F1 (microsoft/deberta-xlarge-mnli); BERTScore is computed after unloading the generative model to fit in GPU memory.
  • Artifacts: all metric values and sample pairs are in reproduction/evaluation_comparison.json.

Evaluation results

Values are percentage points (0–100 scale). Ξ” (pp) = adapted βˆ’ base; Ξ” rel (%) = relative change vs. base.

Weighted aggregate (all five sources)

Split N Metric Base Adapted Ξ” (pp) Ξ” rel (%)
Dev 1,600 Token F1 45.17 60.24 +15.07 +33.36
Dev 1,600 ROUGE-L F1 37.18 50.49 +13.31 +35.79
Dev 1,600 BERTScore F1 39.59 53.48 +13.89 +35.07
Test 8,490 Token F1 45.42 63.20 +17.78 +39.14
Test 8,490 ROUGE-L F1 37.21 52.97 +15.76 +42.35
Test 8,490 BERTScore F1 39.90 56.42 +16.52 +41.41

Per-dataset dev (320 samples each)

Dataset Token F1 (Base β†’ Adapted) ROUGE-L F1 (Base β†’ Adapted) BERTScore F1 (Base β†’ Adapted)
Neural-Bridge RAG 50.46 β†’ 81.17 45.46 β†’ 77.46 46.79 β†’ 79.34
Dolly QA 44.46 β†’ 50.95 38.21 β†’ 45.51 38.88 β†’ 46.24
Aina-EN 44.67 β†’ 56.15 35.32 β†’ 43.16 41.61 β†’ 50.42
Aina-ES 40.47 β†’ 57.11 31.44 β†’ 43.37 33.35 β†’ 45.66
Aina-CA 45.80 β†’ 55.82 35.48 β†’ 42.95 37.32 β†’ 45.72

Full test-split breakdowns and qualitative sample pairs are in reproduction/evaluation_comparison.json.

Relation to the baseline benchmark

The base dev numbers are aligned with the multi-model benchmark (evaluate_baselines.py, predictions_phi-4.json), so Phi-4 before fine-tuning is directly comparable to the other models in that suite. For post-LoRA performance, use the Adapted columns above.

Hardware compatibility (inference)

Setup Notes
GPU (recommended) ~10 GB VRAM is a practical minimum for this Q4_K_M ~14B-class GGUF in Ollama at moderate batching; 8 GB may work with shorter context or with slower GPU offloading.
Context length The bundled Modelfile sets num_ctx 16384 β€” raising context increases VRAM/RAM use roughly linearly; reduce num_ctx if you hit OOM.
CPU Supported by Ollama / llama.cpp runners, but significantly slower than a discrete GPU at this model size.
Training hardware LoRA training used bf16, gradient checkpointing, and an 8-bit optimizer on a CUDA GPU (see reproduction/train-phi4.py); this is separate from these inference notes.

Ollama

Place Phi4-Q4_K_M.gguf next to Modelfile (or adjust the FROM path). Then:

ollama create phi4-rag -f Modelfile
ollama run phi4-rag

Generation defaults in the bundled Modelfile: num_ctx 16384, temperature 0.15, top_p 0.9, repeat_penalty 1.15.

Limitations

  • Intended for grounded QA over retrieved context; do not rely on it as an unconstrained world-knowledge model without retrieval.
  • Q4_K_M is a speed/size trade-off versus higher bit-widths or FP16.
  • Response quality depends on the quality of the retrieved context and on wrapping it in <context>…</context> tags as in training.

License

  • MIT β€” The model card, Modelfile, and other metadata added by nadiva1243 are released under the MIT License (see the LICENSE file in this repository).
  • Base weights β€” The GGUF is a derivative of microsoft/phi-4. You must also comply with the license and terms of the base model and with any requirements of the training datasets when redistributing or using the weights.

Citation

@misc{phi4_rag_gguf_monkeygrab,
  title        = {Phi-4 RAG LoRA Fine-tune (Q4_K_M GGUF)},
  author       = {nadiva1243},
  year         = {2026},
  howpublished = {Hugging Face: \url{https://huggingface.co/nadiva1243/phi4RAG}},
  note         = {Base: microsoft/phi-4; training: MonkeyGrab train-phi4.py v1}
}
Downloads last month
73
GGUF
Model size
15B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nadiva1243/phi4RAG

Base model

microsoft/phi-4
Adapter
(59)
this model