i trained my model with collab and can’t seem to create an endpoint for it becuase it’s saying the model is gated even when i have access
Even with a fine-tuned model, if the repository contains LoRA, the base model (specified in README.md under base_model:) is set. If the base model is gated, the LoRA loading process cannot load the base model part without the corresponding token.
The most reliable approach is to merge it beforehand. In this case, the token becomes unnecessary unless you manually set up a new gated model yourself.
What “gated” means (and why Colab works but an Endpoint fails)
Gated model on Hugging Face usually means: you must agree to a license / request access before the Hub will let you download the files (weights, config.json, tokenizer, etc.). Access is tied to your Hugging Face user identity, and tokens authenticate as that user. (TECHCOMMUNITY.MICROSOFT.COM)
When you train in Colab, it often works because:
- you were already authenticated in Colab (e.g.,
huggingface-cli login/hf auth login,huggingface_hub.login(), orHF_TOKENset), and/or - the base Llama model was already cached in the Colab session.
A Hugging Face Inference Endpoint runs in a separate managed container. It does not automatically inherit your Colab login/token, so when it tries to download:
- your fine-tuned repo and/or
- the base Llama repo your fine-tune depends on (common with LoRA / adapters),
it can fail with “model is gated” / 401 / 403 unless you explicitly provide a token in the endpoint environment. (Hugging Face Forums)
The most common fix: set HF_TOKEN on the Endpoint
Hugging Face support has explicitly recommended this for fine-tuned gated models (including Llama): add HF_TOKEN as an environment variable on the endpoint, with the value being your Hugging Face User Access Token. (Hugging Face Forums)
Do this in the Inference Endpoints UI
-
Go to the endpoint creation page (or open the endpoint that’s failing).
-
Find Advanced configuration (or similar section for env vars).
-
Add an environment variable:
- Key:
HF_TOKEN - Value: your Hugging Face User Access Token (a “read” token is usually enough just to download models) (Hugging Face Forums)
- Key:
-
Save / redeploy / restart the endpoint.
Important details that commonly break this
- The variable name must be exactly
HF_TOKEN(notHF_TOKEN_API, notHUGGINGFACE_TOKEN, etc.).HF_TOKENis the standard env var used by Hugging Face tooling to authenticate to gated/private repos. (Hugging Face) - Use a token from the same Hugging Face account that was approved for the Llama gate.
- If your endpoint is in an Organization namespace, access to a gated base model is still typically tied to an actual user’s acceptance/approval. Use a user token that definitely has access. (TECHCOMMUNITY.MICROSOFT.COM)
Fix token problems: verify the token you’re using really has access
Even if “you have access” in the browser, endpoint failures often come from using the wrong token or a fine-grained token missing repo permissions.
1) If you’re using fine-grained tokens, make sure the repo is allowed
Fine-grained tokens can be restricted to specific repos; if the token doesn’t include the gated Llama repo (and your model repo), downloads will fail. The HF cookbook explicitly notes fine-grained tokens need the right repository permissions and endpoint permissions. (Hugging Face)
2) Quick local test (recommended)
On any machine (or in a fresh Colab runtime) test with the same token you plan to put in the endpoint:
-
Log in:
hf auth login(Hugging Face)
-
Try downloading a small file from the gated base model (like
config.json) using the CLI download command (or withhuggingface_hub).- If this fails with 401/403, the endpoint will fail too.
This isolates “token/access problem” from “endpoint configuration problem”. (Hugging Face)
If you fine-tuned with LoRA/adapters: you may be accidentally deploying only the adapter
This is a very common Llama workflow:
- Your Hub repo contains only LoRA adapter weights.
- At inference time, the system must also download the base model (gated).
In that setup, the endpoint must authenticate to download the base model, so HF_TOKEN becomes mandatory. A real-world Endpoint case deploying a private LoRA adapter was fixed exactly by adding HF_TOKEN. (Hugging Face Forums)
Two clean deployment approaches
Option A (most common): keep LoRA + base, but provide HF_TOKEN
- Add
HF_TOKENto the endpoint env vars. - Ensure both the base model repo and adapter repo are accessible with that token.
This is simplest if you’re comfortable the endpoint will always need that token.
Option B: merge LoRA into the base model and push a merged checkpoint
This avoids needing the endpoint to “compose” base+adapter at startup (and can reduce moving parts). PEFT documents merging adapter weights into the base model for inference. (Hugging Face)
Typical merge pattern:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "meta-llama/..." # gated base
adapter_id = "yourname/your-adapter"
base = AutoModelForCausalLM.from_pretrained(base_id, token=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id, token=True)
model = model.merge_and_unload()
model.save_pretrained("merged", safe_serialization=True)
tok = AutoTokenizer.from_pretrained(base_id, token=True)
tok.save_pretrained("merged")
Notes:
- Using
token=Truemakes Transformers use your cached/login token (orHF_TOKENenv var) where available. (Hugging Face) - Some Endpoint stacks are more reliable with
.safetensors(safe_serialization=True) than legacy pickle.binweights. (Hugging Face Forums)
Whether you are allowed to publish redistributed merged Llama weights depends on the specific Llama license/terms—so treat this as a technical option, but check your compliance requirements.
If it still fails after setting HF_TOKEN
1) Confirm the Endpoint is actually using the variable
If you previously set something like HF_TOKEN_API, it won’t be picked up automatically. Only HF_TOKEN is the standard env var for Hub auth. (Hugging Face)
2) Ensure the model repo truly belongs to the same account/namespace you’re deploying from
If your fine-tuned model is under an org/private namespace and you deploy from a different identity context, the endpoint may not be able to read it without the token. (The endpoint runtime must be authenticated to read the repo.) (Hugging Face Forums)
3) Consider a custom container if you need more control
If you’re deploying via vLLM/TGI custom images, you still typically solve gated downloads the same way: provide HF_TOKEN to the container environment. (Hugging Face)
Minimal checklist (quickest path)
- Create a fresh User Access Token (start with a normal “read” token unless you need fine-grained restrictions). (Hugging Face)
- Verify the token can download the gated base model files (fresh environment). (Hugging Face)
- In the Endpoint UI: set env var
HF_TOKEN=<your token>. (Hugging Face Forums) - Redeploy.
- If using LoRA/adapters and it’s still flaky: merge + push safetensors (or use a custom container). (Hugging Face)