Looks like num_beam_groups from generation in V5. Is there an alternative way to use Diverse Beam Search?
It seems to have been removed from the documentation but is still usable. However, since it might be removed in the future, it might be better to load it from the Hub.
What happened to num_beam_groups in Transformers v5?
Youâre seeing a real change in the documentation surface, and a bigger change in where the feature lives.
- In the v5.1.0 âText generationâ docs,
GenerationConfigis presented as âthe complete list of generation parametersâ and (in the visible parameter list) it only calls out things likedo_sampleandnum_beamsânum_beam_groupsis not present on that page. (Hugging Face) - Meanwhile, Transformers maintainers removed the in-core âGroup Beam Searchâ implementation (the mechanism that powered âDiverse Beam Searchâ) and moved it to a Hub-hosted
custom_generaterepository. The PR states it âRemoves Group Beam Search ⌠directs users totransformers-community/group-beam-searchâ and thattrust_remote_code=Trueis required. (GitHub)
So the short version is:
- Diverse Beam Search â Group Beam Search
- In v5, the official way to run it is via
custom_generate(Hub extension), not via the in-library decoding loop.
Background: what âDiverse Beam Searchâ is (and why num_beam_groups exists)
Classic beam search often returns near-duplicates. Diverse Beam Search (Vijayakumar et al.) encourages beams to spread out by splitting beams into groups and applying a diversity penalty so later groups avoid choosing the same tokens as earlier groups at each step. (arXiv)
Historically, Transformers exposed this as:
num_beamsnum_beam_groupsdiversity_penalty
âŚand older docs explicitly listed num_beam_groups and diversity_penalty under generation parameters. (Hugging Face)
The v5 replacement: use the Hub custom_generate method
1) Use custom_generate="transformers-community/group-beam-search"
Transformers v5 maps GROUP_BEAM_SEARCH to the Hub repo transformers-community/group-beam-search. (GitHub)
When a strategy has been moved, v5 emits a warning telling you to set custom_generate=... explicitly, and it requires trust_remote_code=True because it loads code from the Hub. (GitHub)
The Hub repo itself documents the intended usage and provides an example generate() call with the familiar parameters. (Hugging Face)
Minimal working example (decoder-only model)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Qwen/Qwen3-0.6B" # example; use your decoder-only model
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, device_map="auto")
inputs = tok("Write 5 different product taglines for a coffee brand:", return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=60,
# Diverse / Group Beam Search knobs:
num_beams=10,
num_beam_groups=5, # must divide num_beams
diversity_penalty=1.0,
# Usually return multiple candidates to see the diversity:
num_return_sequences=10,
# v5 path for this method:
do_sample=False,
custom_generate="transformers-community/group-beam-search",
trust_remote_code=True,
)
print(tok.batch_decode(out, skip_special_tokens=True))
Key constraints / expectations
num_beams % num_beam_groups == 0(e.g., 8 beams + 4 groups). (Hugging Face)- The repo currently states model compatibility: âDecoder-only transformer models.â (Hugging Face)
custom_generateis an officialgenerate()parameter in v5 (it can be a string repo ID or a callable). (Hugging Face)
Practical tuning (what usually works)
Picking num_beam_groups
Think âhow many distinct clusters of outputs do I want?â
- Want ~4 distinct styles? Try
num_beams=8,num_beam_groups=4 - Want ~6 distinct candidates? Try
num_beams=12,num_beam_groups=4or6
More groups â stronger pressure to diverge, but sometimes slightly worse average quality.
Picking diversity_penalty
- Start at
0.5 - Increase to
1.0â1.5if results are still too similar - Decrease if outputs become off-topic / low-quality
Always return multiple sequences
Diverse beam search is most useful when you actually inspect multiple candidates:
num_return_sequences = num_beamsis a common pattern. (Hugging Face)
If you donât want to trust remote code
Because this uses Hub-hosted code, the safest âno remote executionâ pattern is:
- Vendor the generator code locally (copy it into your repo)
- Pass a callable as
custom_generate=...
Transformers v5 explicitly supports custom_generate as a callable and extracts callable-specific kwargs by signature inspection. (GitHub)
This gives you:
- no network execution at runtime
- easier reproducibility and auditing
Alternatives if you mainly want âmultiple different candidatesâ
If your goal is âgive me multiple diverse completions,â these are often simpler and sometimes better than beam-based diversity:
- Sampling (
do_sample=True,top_p,temperature,num_return_sequences) - Beam + sampling (
num_beams>1anddo_sample=True)
Transformersâ generation guide frames sampling as the standard approach for diversity. (Hugging Face)
A common production approach is:
- generate a pool with sampling
- rerank with a scorer (or simple heuristics), possibly using an explicit âdiversityâ filter (MMR / n-gram overlap penalty)
Bottom line
- In v5,
num_beam_groupsis no longer presented as a first-class core strategy knob on the text-generation docs page. (Hugging Face) - The official v5 way to run Diverse Beam Search is to load it as a Hub
custom_generatemethod:transformers-community/group-beam-search, withtrust_remote_code=True. (GitHub)
This script compares standard beam search vs Diverse Beam Search (Group Beam Search) in Transformers v5 using custom_generate (Hub-hosted decoding loop). Transformers documents custom_generate as a generate() argument that can load a custom_generate/generate.py from a Hub repo. (Hugging Face) The community group-beam-search repo documents using num_beams, num_beam_groups, and diversity_penalty. (Hugging Face)
"""
Verification demo: Diverse Beam Search (Group Beam Search) in Transformers v5 via `custom_generate`
References (URLs intentionally included here, per request):
- Transformers `generate()` docs (mentions `custom_generate`): https://huggingface.co/docs/transformers/en/main_classes/text_generation
- Transformers generation strategies (explains how `custom_generate` is loaded/called): https://huggingface.co/docs/transformers/en/generation_strategies
- Diverse / Group Beam Search custom generator repo: https://huggingface.co/transformers-community/group-beam-search
- Tiny public model used in this demo: https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
Dependencies (install one-time):
pip install -U "transformers>=5.1.0" torch
Notes:
- This demo is CPU/GPU safe:
- CPU: float32 (safer / more compatible)
- GPU (e.g., T4): float16 to reduce VRAM
- `custom_generate="transformers-community/group-beam-search"` requires `trust_remote_code=True`
because it loads a decoding loop from the Hub repo.
"""
from __future__ import annotations
import math
import os
import sys
from itertools import combinations
import torch
import transformers
from packaging import version
from transformers import AutoModelForCausalLM, AutoTokenizer
def pick_device_and_dtype() -> tuple[torch.device, torch.dtype]:
"""Pick device/dtype with low RAM/VRAM in mind."""
if torch.cuda.is_available():
return torch.device("cuda"), torch.float16 # T4-friendly
return torch.device("cpu"), torch.float32
def maybe_format_as_chat(tokenizer, user_text: str) -> str:
"""
If the tokenizer provides a chat template, use it (often better for instruct models).
Otherwise, fall back to a plain prompt.
"""
if hasattr(tokenizer, "apply_chat_template"):
messages = [{"role": "user", "content": user_text}]
# add_generation_prompt=True appends an assistant header in many templates
try:
return tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
except TypeError:
# Some tokenizers have different signatures; keep it simple.
return tokenizer.apply_chat_template(messages, tokenize=False)
return user_text
def decode_many(tokenizer, sequences: torch.Tensor) -> list[str]:
"""Decode (batch, seq_len) into list[str]."""
return tokenizer.batch_decode(sequences, skip_special_tokens=True)
def trigram_jaccard(a: str, b: str) -> float:
"""
Very simple diversity proxy:
- tokenize to "words"
- compute trigram set overlap (Jaccard)
Lower is "more different".
"""
wa = a.split()
wb = b.split()
ta = set(zip(wa, wa[1:], wa[2:])) if len(wa) >= 3 else set()
tb = set(zip(wb, wb[1:], wb[2:])) if len(wb) >= 3 else set()
if not ta and not tb:
return 1.0 if a.strip() == b.strip() else 0.0
if not ta or not tb:
return 0.0
return len(ta & tb) / len(ta | tb)
def avg_pairwise_similarity(texts: list[str]) -> float:
"""Average pairwise trigram Jaccard similarity across outputs."""
if len(texts) < 2:
return 1.0
sims = [trigram_jaccard(x, y) for x, y in combinations(texts, 2)]
return sum(sims) / len(sims)
def main() -> None:
# ---- Version guard (custom_generate is documented in modern Transformers) ----
if version.parse(transformers.__version__) < version.parse("5.1.0"):
raise RuntimeError(
f"Please upgrade Transformers to >= 5.1.0 (current: {transformers.__version__}).\n"
"Docs: https://huggingface.co/docs/transformers/en/main_classes/text_generation"
)
# ---- Model choice: very small public instruct model ----
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
device, dtype = pick_device_and_dtype()
print(f"[info] transformers={transformers.__version__} torch={torch.__version__}")
print(f"[info] device={device.type} dtype={dtype}")
# Reduce surprises in low-memory environments
torch.set_grad_enabled(False)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Many decoder-only tokenizers have no pad token; beam search benefits from having one.
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
# Load model with conservative settings (small model; should fit easily)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
).to(device)
model.eval()
prompt = "Give 6 different, concrete uses for a paperclip. Output as 6 bullet points."
prompt_text = maybe_format_as_chat(tokenizer, prompt)
inputs = tokenizer(prompt_text, return_tensors="pt").to(device)
# Shared generation settings
max_new_tokens = 80
num_beams = 6
# Returning all beams makes it easier to SEE whether they differ.
num_return_sequences = num_beams
# ---- 1) Baseline: standard beam search (often produces near-duplicates) ----
with torch.inference_mode():
out_beam = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
num_beams=num_beams,
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
beam_texts = decode_many(tokenizer, out_beam)
# Remove the prompt prefix when printing (optional, heuristic)
# Keeping it simple: print full decoded text.
print("\n" + "=" * 80)
print("STANDARD BEAM SEARCH OUTPUTS")
print("=" * 80)
for i, t in enumerate(beam_texts, 1):
print(f"\n--- Beam #{i} ---\n{t}")
beam_sim = avg_pairwise_similarity(beam_texts)
print(f"\n[metric] Avg pairwise trigram Jaccard similarity (beam): {beam_sim:.3f}")
# ---- 2) Diverse Beam Search (Group Beam Search) via Hub custom_generate ----
# The group-beam-search repo documents these knobs:
# - num_beam_groups must divide num_beams
# - diversity_penalty > 0 for actual diversity pressure
# Repo: https://huggingface.co/transformers-community/group-beam-search
num_beam_groups = 3
diversity_penalty = 1.0
assert num_beams % num_beam_groups == 0, "num_beams must be divisible by num_beam_groups"
with torch.inference_mode():
out_diverse = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False, # diverse/group beam search is a beam method
num_beams=num_beams,
num_beam_groups=num_beam_groups,
diversity_penalty=diversity_penalty,
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
# Key v5+ piece: load the decoding loop from the Hub repo
custom_generate="transformers-community/group-beam-search",
trust_remote_code=True,
# Tip (optional): pin revisions for reproducibility after auditing code:
# revision="<tag-or-commit-sha>"
)
diverse_texts = decode_many(tokenizer, out_diverse)
print("\n" + "=" * 80)
print("DIVERSE (GROUP) BEAM SEARCH OUTPUTS")
print("=" * 80)
for i, t in enumerate(diverse_texts, 1):
print(f"\n--- Diverse Beam #{i} ---\n{t}")
diverse_sim = avg_pairwise_similarity(diverse_texts)
print(f"\n[metric] Avg pairwise trigram Jaccard similarity (diverse): {diverse_sim:.3f}")
print("\n" + "=" * 80)
print("INTERPRETATION")
print("=" * 80)
print(
"Lower similarity usually means the candidates are more diverse.\n"
"If you don't see a decrease, try:\n"
" - increasing diversity_penalty (e.g., 1.5)\n"
" - using more beams (e.g., num_beams=8, num_beam_groups=4)\n"
" - changing the prompt to a task with multiple plausible answers."
)
if __name__ == "__main__":
# No argparse per request; just run: python demo.py
main()