Num_beam_groups removed in V5?

Looks like num_beam_groups from generation in V5. Is there an alternative way to use Diverse Beam Search?

It seems to have been removed from the documentation but is still usable. However, since it might be removed in the future, it might be better to load it from the Hub.


What happened to num_beam_groups in Transformers v5?

You’re seeing a real change in the documentation surface, and a bigger change in where the feature lives.

  • In the v5.1.0 “Text generation” docs, GenerationConfig is presented as “the complete list of generation parameters” and (in the visible parameter list) it only calls out things like do_sample and num_beams—num_beam_groups is not present on that page. (Hugging Face)
  • Meanwhile, Transformers maintainers removed the in-core “Group Beam Search” implementation (the mechanism that powered “Diverse Beam Search”) and moved it to a Hub-hosted custom_generate repository. The PR states it “Removes Group Beam Search … directs users to transformers-community/group-beam-search” and that trust_remote_code=True is required. (GitHub)

So the short version is:

  • Diverse Beam Search ≈ Group Beam Search
  • In v5, the official way to run it is via custom_generate (Hub extension), not via the in-library decoding loop.

Background: what “Diverse Beam Search” is (and why num_beam_groups exists)

Classic beam search often returns near-duplicates. Diverse Beam Search (Vijayakumar et al.) encourages beams to spread out by splitting beams into groups and applying a diversity penalty so later groups avoid choosing the same tokens as earlier groups at each step. (arXiv)

Historically, Transformers exposed this as:

  • num_beams
  • num_beam_groups
  • diversity_penalty

…and older docs explicitly listed num_beam_groups and diversity_penalty under generation parameters. (Hugging Face)


The v5 replacement: use the Hub custom_generate method

1) Use custom_generate="transformers-community/group-beam-search"

Transformers v5 maps GROUP_BEAM_SEARCH to the Hub repo transformers-community/group-beam-search. (GitHub)
When a strategy has been moved, v5 emits a warning telling you to set custom_generate=... explicitly, and it requires trust_remote_code=True because it loads code from the Hub. (GitHub)

The Hub repo itself documents the intended usage and provides an example generate() call with the familiar parameters. (Hugging Face)

Minimal working example (decoder-only model)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Qwen/Qwen3-0.6B"  # example; use your decoder-only model
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float16, device_map="auto")

inputs = tok("Write 5 different product taglines for a coffee brand:", return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=60,

    # Diverse / Group Beam Search knobs:
    num_beams=10,
    num_beam_groups=5,       # must divide num_beams
    diversity_penalty=1.0,

    # Usually return multiple candidates to see the diversity:
    num_return_sequences=10,

    # v5 path for this method:
    do_sample=False,
    custom_generate="transformers-community/group-beam-search",
    trust_remote_code=True,
)

print(tok.batch_decode(out, skip_special_tokens=True))

Key constraints / expectations

  • num_beams % num_beam_groups == 0 (e.g., 8 beams + 4 groups). (Hugging Face)
  • The repo currently states model compatibility: “Decoder-only transformer models.” (Hugging Face)
  • custom_generate is an official generate() parameter in v5 (it can be a string repo ID or a callable). (Hugging Face)

Practical tuning (what usually works)

Picking num_beam_groups

Think “how many distinct clusters of outputs do I want?”

  • Want ~4 distinct styles? Try num_beams=8, num_beam_groups=4
  • Want ~6 distinct candidates? Try num_beams=12, num_beam_groups=4 or 6

More groups ⇒ stronger pressure to diverge, but sometimes slightly worse average quality.

Picking diversity_penalty

  • Start at 0.5
  • Increase to 1.0–1.5 if results are still too similar
  • Decrease if outputs become off-topic / low-quality

Always return multiple sequences

Diverse beam search is most useful when you actually inspect multiple candidates:

  • num_return_sequences = num_beams is a common pattern. (Hugging Face)

If you don’t want to trust remote code

Because this uses Hub-hosted code, the safest “no remote execution” pattern is:

  1. Vendor the generator code locally (copy it into your repo)
  2. Pass a callable as custom_generate=...

Transformers v5 explicitly supports custom_generate as a callable and extracts callable-specific kwargs by signature inspection. (GitHub)

This gives you:

  • no network execution at runtime
  • easier reproducibility and auditing

Alternatives if you mainly want “multiple different candidates”

If your goal is “give me multiple diverse completions,” these are often simpler and sometimes better than beam-based diversity:

  1. Sampling (do_sample=True, top_p, temperature, num_return_sequences)
  2. Beam + sampling (num_beams>1 and do_sample=True)
    Transformers’ generation guide frames sampling as the standard approach for diversity. (Hugging Face)

A common production approach is:

  • generate a pool with sampling
  • rerank with a scorer (or simple heuristics), possibly using an explicit “diversity” filter (MMR / n-gram overlap penalty)

Bottom line

  • In v5, num_beam_groups is no longer presented as a first-class core strategy knob on the text-generation docs page. (Hugging Face)
  • The official v5 way to run Diverse Beam Search is to load it as a Hub custom_generate method: transformers-community/group-beam-search, with trust_remote_code=True. (GitHub)

This script compares standard beam search vs Diverse Beam Search (Group Beam Search) in Transformers v5 using custom_generate (Hub-hosted decoding loop). Transformers documents custom_generate as a generate() argument that can load a custom_generate/generate.py from a Hub repo. (Hugging Face) The community group-beam-search repo documents using num_beams, num_beam_groups, and diversity_penalty. (Hugging Face)

"""
Verification demo: Diverse Beam Search (Group Beam Search) in Transformers v5 via `custom_generate`

References (URLs intentionally included here, per request):
- Transformers `generate()` docs (mentions `custom_generate`): https://huggingface.co/docs/transformers/en/main_classes/text_generation
- Transformers generation strategies (explains how `custom_generate` is loaded/called): https://huggingface.co/docs/transformers/en/generation_strategies
- Diverse / Group Beam Search custom generator repo: https://huggingface.co/transformers-community/group-beam-search
- Tiny public model used in this demo: https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

Dependencies (install one-time):
  pip install -U "transformers>=5.1.0" torch
Notes:
- This demo is CPU/GPU safe:
  - CPU: float32 (safer / more compatible)
  - GPU (e.g., T4): float16 to reduce VRAM
- `custom_generate="transformers-community/group-beam-search"` requires `trust_remote_code=True`
  because it loads a decoding loop from the Hub repo.
"""

from __future__ import annotations

import math
import os
import sys
from itertools import combinations

import torch
import transformers
from packaging import version
from transformers import AutoModelForCausalLM, AutoTokenizer


def pick_device_and_dtype() -> tuple[torch.device, torch.dtype]:
    """Pick device/dtype with low RAM/VRAM in mind."""
    if torch.cuda.is_available():
        return torch.device("cuda"), torch.float16  # T4-friendly
    return torch.device("cpu"), torch.float32


def maybe_format_as_chat(tokenizer, user_text: str) -> str:
    """
    If the tokenizer provides a chat template, use it (often better for instruct models).
    Otherwise, fall back to a plain prompt.
    """
    if hasattr(tokenizer, "apply_chat_template"):
        messages = [{"role": "user", "content": user_text}]
        # add_generation_prompt=True appends an assistant header in many templates
        try:
            return tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
        except TypeError:
            # Some tokenizers have different signatures; keep it simple.
            return tokenizer.apply_chat_template(messages, tokenize=False)
    return user_text


def decode_many(tokenizer, sequences: torch.Tensor) -> list[str]:
    """Decode (batch, seq_len) into list[str]."""
    return tokenizer.batch_decode(sequences, skip_special_tokens=True)


def trigram_jaccard(a: str, b: str) -> float:
    """
    Very simple diversity proxy:
    - tokenize to "words"
    - compute trigram set overlap (Jaccard)
    Lower is "more different".
    """
    wa = a.split()
    wb = b.split()
    ta = set(zip(wa, wa[1:], wa[2:])) if len(wa) >= 3 else set()
    tb = set(zip(wb, wb[1:], wb[2:])) if len(wb) >= 3 else set()
    if not ta and not tb:
        return 1.0 if a.strip() == b.strip() else 0.0
    if not ta or not tb:
        return 0.0
    return len(ta & tb) / len(ta | tb)


def avg_pairwise_similarity(texts: list[str]) -> float:
    """Average pairwise trigram Jaccard similarity across outputs."""
    if len(texts) < 2:
        return 1.0
    sims = [trigram_jaccard(x, y) for x, y in combinations(texts, 2)]
    return sum(sims) / len(sims)


def main() -> None:
    # ---- Version guard (custom_generate is documented in modern Transformers) ----
    if version.parse(transformers.__version__) < version.parse("5.1.0"):
        raise RuntimeError(
            f"Please upgrade Transformers to >= 5.1.0 (current: {transformers.__version__}).\n"
            "Docs: https://huggingface.co/docs/transformers/en/main_classes/text_generation"
        )

    # ---- Model choice: very small public instruct model ----
    model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"

    device, dtype = pick_device_and_dtype()
    print(f"[info] transformers={transformers.__version__} torch={torch.__version__}")
    print(f"[info] device={device.type} dtype={dtype}")

    # Reduce surprises in low-memory environments
    torch.set_grad_enabled(False)

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Many decoder-only tokenizers have no pad token; beam search benefits from having one.
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load model with conservative settings (small model; should fit easily)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=dtype,
        low_cpu_mem_usage=True,
    ).to(device)
    model.eval()

    prompt = "Give 6 different, concrete uses for a paperclip. Output as 6 bullet points."
    prompt_text = maybe_format_as_chat(tokenizer, prompt)

    inputs = tokenizer(prompt_text, return_tensors="pt").to(device)

    # Shared generation settings
    max_new_tokens = 80
    num_beams = 6
    # Returning all beams makes it easier to SEE whether they differ.
    num_return_sequences = num_beams

    # ---- 1) Baseline: standard beam search (often produces near-duplicates) ----
    with torch.inference_mode():
        out_beam = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            num_beams=num_beams,
            num_return_sequences=num_return_sequences,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    beam_texts = decode_many(tokenizer, out_beam)
    # Remove the prompt prefix when printing (optional, heuristic)
    # Keeping it simple: print full decoded text.
    print("\n" + "=" * 80)
    print("STANDARD BEAM SEARCH OUTPUTS")
    print("=" * 80)
    for i, t in enumerate(beam_texts, 1):
        print(f"\n--- Beam #{i} ---\n{t}")

    beam_sim = avg_pairwise_similarity(beam_texts)
    print(f"\n[metric] Avg pairwise trigram Jaccard similarity (beam): {beam_sim:.3f}")

    # ---- 2) Diverse Beam Search (Group Beam Search) via Hub custom_generate ----
    # The group-beam-search repo documents these knobs:
    # - num_beam_groups must divide num_beams
    # - diversity_penalty > 0 for actual diversity pressure
    # Repo: https://huggingface.co/transformers-community/group-beam-search
    num_beam_groups = 3
    diversity_penalty = 1.0
    assert num_beams % num_beam_groups == 0, "num_beams must be divisible by num_beam_groups"

    with torch.inference_mode():
        out_diverse = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # diverse/group beam search is a beam method
            num_beams=num_beams,
            num_beam_groups=num_beam_groups,
            diversity_penalty=diversity_penalty,
            num_return_sequences=num_return_sequences,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            # Key v5+ piece: load the decoding loop from the Hub repo
            custom_generate="transformers-community/group-beam-search",
            trust_remote_code=True,
            # Tip (optional): pin revisions for reproducibility after auditing code:
            # revision="<tag-or-commit-sha>"
        )

    diverse_texts = decode_many(tokenizer, out_diverse)

    print("\n" + "=" * 80)
    print("DIVERSE (GROUP) BEAM SEARCH OUTPUTS")
    print("=" * 80)
    for i, t in enumerate(diverse_texts, 1):
        print(f"\n--- Diverse Beam #{i} ---\n{t}")

    diverse_sim = avg_pairwise_similarity(diverse_texts)
    print(f"\n[metric] Avg pairwise trigram Jaccard similarity (diverse): {diverse_sim:.3f}")

    print("\n" + "=" * 80)
    print("INTERPRETATION")
    print("=" * 80)
    print(
        "Lower similarity usually means the candidates are more diverse.\n"
        "If you don't see a decrease, try:\n"
        "  - increasing diversity_penalty (e.g., 1.5)\n"
        "  - using more beams (e.g., num_beams=8, num_beam_groups=4)\n"
        "  - changing the prompt to a task with multiple plausible answers."
    )


if __name__ == "__main__":
    # No argparse per request; just run: python demo.py
    main()