Token has access to llama but cannot create endpoint

busybisi · March 4, 2026, 6:27pm

i trained my model with collab and can’t seem to create an endpoint for it becuase it’s saying the model is gated even when i have access

John6666 · March 4, 2026, 11:21pm

Even with a fine-tuned model, if the repository contains LoRA, the base model (specified in README.md under base_model:) is set. If the base model is gated, the LoRA loading process cannot load the base model part without the corresponding token.

The most reliable approach is to merge it beforehand. In this case, the token becomes unnecessary unless you manually set up a new gated model yourself.

What “gated” means (and why Colab works but an Endpoint fails)

Gated model on Hugging Face usually means: you must agree to a license / request access before the Hub will let you download the files (weights, config.json, tokenizer, etc.). Access is tied to your Hugging Face user identity, and tokens authenticate as that user. (TECHCOMMUNITY.MICROSOFT.COM)

When you train in Colab, it often works because:

you were already authenticated in Colab (e.g., huggingface-cli login / hf auth login, huggingface_hub.login(), or HF_TOKEN set), and/or
the base Llama model was already cached in the Colab session.

A Hugging Face Inference Endpoint runs in a separate managed container. It does not automatically inherit your Colab login/token, so when it tries to download:

your fine-tuned repo and/or
the base Llama repo your fine-tune depends on (common with LoRA / adapters),

it can fail with “model is gated” / 401 / 403 unless you explicitly provide a token in the endpoint environment. (Hugging Face Forums)

The most common fix: set `HF_TOKEN` on the Endpoint

Hugging Face support has explicitly recommended this for fine-tuned gated models (including Llama): add HF_TOKEN as an environment variable on the endpoint, with the value being your Hugging Face User Access Token. (Hugging Face Forums)

Do this in the Inference Endpoints UI

Go to the endpoint creation page (or open the endpoint that’s failing).
Find Advanced configuration (or similar section for env vars).
Add an environment variable:
- Key: HF_TOKEN
- Value: your Hugging Face User Access Token (a “read” token is usually enough just to download models) (Hugging Face Forums)
Save / redeploy / restart the endpoint.

Important details that commonly break this

The variable name must be exactly HF_TOKEN (not HF_TOKEN_API, not HUGGINGFACE_TOKEN, etc.). HF_TOKEN is the standard env var used by Hugging Face tooling to authenticate to gated/private repos. (Hugging Face)
Use a token from the same Hugging Face account that was approved for the Llama gate.
If your endpoint is in an Organization namespace, access to a gated base model is still typically tied to an actual user’s acceptance/approval. Use a user token that definitely has access. (TECHCOMMUNITY.MICROSOFT.COM)

Fix token problems: verify the token you’re using really has access

Even if “you have access” in the browser, endpoint failures often come from using the wrong token or a fine-grained token missing repo permissions.

1) If you’re using fine-grained tokens, make sure the repo is allowed

Fine-grained tokens can be restricted to specific repos; if the token doesn’t include the gated Llama repo (and your model repo), downloads will fail. The HF cookbook explicitly notes fine-grained tokens need the right repository permissions and endpoint permissions. (Hugging Face)

2) Quick local test (recommended)

On any machine (or in a fresh Colab runtime) test with the same token you plan to put in the endpoint:

Log in:
- hf auth login (Hugging Face)
Try downloading a small file from the gated base model (like config.json) using the CLI download command (or with huggingface_hub).
- If this fails with 401/403, the endpoint will fail too.

This isolates “token/access problem” from “endpoint configuration problem”. (Hugging Face)

If you fine-tuned with LoRA/adapters: you may be accidentally deploying only the adapter

This is a very common Llama workflow:

Your Hub repo contains only LoRA adapter weights.
At inference time, the system must also download the base model (gated).

In that setup, the endpoint must authenticate to download the base model, so HF_TOKEN becomes mandatory. A real-world Endpoint case deploying a private LoRA adapter was fixed exactly by adding HF_TOKEN. (Hugging Face Forums)

Two clean deployment approaches

Option A (most common): keep LoRA + base, but provide `HF_TOKEN`

Add HF_TOKEN to the endpoint env vars.
Ensure both the base model repo and adapter repo are accessible with that token.

This is simplest if you’re comfortable the endpoint will always need that token.

Option B: merge LoRA into the base model and push a merged checkpoint

This avoids needing the endpoint to “compose” base+adapter at startup (and can reduce moving parts). PEFT documents merging adapter weights into the base model for inference. (Hugging Face)

Typical merge pattern:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "meta-llama/..."          # gated base
adapter_id = "yourname/your-adapter"

base = AutoModelForCausalLM.from_pretrained(base_id, token=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id, token=True)
model = model.merge_and_unload()

model.save_pretrained("merged", safe_serialization=True)
tok = AutoTokenizer.from_pretrained(base_id, token=True)
tok.save_pretrained("merged")

Notes:

Using token=True makes Transformers use your cached/login token (or HF_TOKEN env var) where available. (Hugging Face)
Some Endpoint stacks are more reliable with .safetensors (safe_serialization=True) than legacy pickle .bin weights. (Hugging Face Forums)

Whether you are allowed to publish redistributed merged Llama weights depends on the specific Llama license/terms—so treat this as a technical option, but check your compliance requirements.

If it still fails after setting `HF_TOKEN`

1) Confirm the Endpoint is actually using the variable

If you previously set something like HF_TOKEN_API, it won’t be picked up automatically. Only HF_TOKEN is the standard env var for Hub auth. (Hugging Face)

2) Ensure the model repo truly belongs to the same account/namespace you’re deploying from

If your fine-tuned model is under an org/private namespace and you deploy from a different identity context, the endpoint may not be able to read it without the token. (The endpoint runtime must be authenticated to read the repo.) (Hugging Face Forums)

3) Consider a custom container if you need more control

If you’re deploying via vLLM/TGI custom images, you still typically solve gated downloads the same way: provide HF_TOKEN to the container environment. (Hugging Face)

Minimal checklist (quickest path)

Create a fresh User Access Token (start with a normal “read” token unless you need fine-grained restrictions). (Hugging Face)
Verify the token can download the gated base model files (fresh environment). (Hugging Face)
In the Endpoint UI: set env var HF_TOKEN=<your token>. (Hugging Face Forums)
Redeploy.
If using LoRA/adapters and it’s still flaky: merge + push safetensors (or use a custom container). (Hugging Face)

Topic		Replies	Views
Custom handler with gated model Inference Endpoints on the Hub	5	890	January 25, 2024
How to use gated models? 🤗Hub	9	32308	September 17, 2024
Inference Endpoints 401 Error Intermediate	2	526	July 15, 2024
Inference Endpoint Deployment Error Beginners	2	77	February 7, 2025
When deploying AutoTrained model: "Cannot access gated repo" 🤗AutoTrain	1	730	May 1, 2024