Purpose of commit_hash in PreTrainedModel.from_pretrained

I’ve been digging into the source code of the transformers library and stumbled upon a detail regarding how files are fetched and cached that I’m hoping someone can clarify.

Specifically, I am trying to understand the exact role of the commit_hash argument within the PreTrainedModel.from_pretrained method, and how it differs from revision.

My initial research led me to believe that commit_hash is used to pin consecutive file downloads to a specific state. This prevents a race condition where a branch (like main) is updated halfway through downloading a multi-file model, which would result in mismatched files.

Looking at the code, it seems to support this. First, it tries to obtain the commit_hash of the current revision early on by resolving the config file:

if commit_hash is None:
    if not isinstance(config, PretrainedConfig):
        # We make a call to the config file first (which may be absent) to get the commit hash as soon as possible
        resolved_config_file = cached_file(
            pretrained_model_name_or_path,
            CONFIG_NAME,
            # ... [other args omitted for brevity] ...
            revision=revision,
        )
        commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
    else:
        commit_hash = getattr(config, "_commit_hash", None)

This commit_hash is then passed down into cached_file_kwargs for subsequent loading code (like fetching the actual model weights):

cached_file_kwargs = {
    # ... [other args] ...
    "revision": revision,
    "_commit_hash": commit_hash,
}
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)

Here is my confusion: When I look inside the cached_file method itself, I noticed that the _commit_hash appears to only be used for local cache checks. If a download from the Hub is actually triggered, it seems to still rely on the revision argument. Other loading functions also don’t seem to strictly use the identified commit_hash for the remote fetch.

My questions:

  1. If commit_hash is primarily used just for local cache resolution, couldn’t the revision argument handle that on its own?

  2. Does the underlying huggingface_hub download logic actually use this commit_hash to lock the remote fetch to that specific commit, or is my assumption about preventing mid-download revision changes incorrect?

  3. Does commit_hashserves any other purposes?

Thank you in advance.

1 Like

Oh… Complicated…


revision and _commit_hash are related, but they are not the same thing.

  • revision = what the caller asked for. It can be a branch, tag, PR ref, or commit hash. That is the public Hub API. (Hugging Face)
  • _commit_hash = the exact immutable commit that revision resolved to during loading. Transformers tries to discover it early and then carry it through later steps. (GitHub)

Why _commit_hash exists at all

The key background is the Hub cache layout.

Hugging Face stores symbolic refs like main separately from the actual immutable snapshots. In the cache, refs/ holds mappings such as main -> <commit>, while snapshots/<commit>/... holds the actual file tree for that exact revision. So a moving ref like main is not the same thing as a concrete cached snapshot. (Hugging Face)

That is why revision alone is not always enough once loading has started. If Transformers already knows the exact commit, _commit_hash lets it target the precise cached snapshot instead of re-resolving a symbolic ref. (GitHub)

What from_pretrained is doing

Your reading of the code is correct.

from_pretrained first tries to resolve the config file specifically to learn the commit hash “as soon as possible”, then extracts that hash and passes it along as _commit_hash in later calls. (GitHub)

That exact commit is also preserved in config-loading code. If _commit_hash is found in the loaded config, Transformers keeps propagating it instead of discarding it. (GitHub)

So _commit_hash is not a useless internal leftover. It is deliberate state that gets threaded through the loading pipeline. (GitHub)

What _commit_hash is actually used for

Inside cached_file / cached_files, _commit_hash is primarily used for exact cache lookup.

The docstring says it is passed when chaining several file loads and that, if files are already cached for that commit hash, Transformers can “avoid calls to head and get from the cache.” The implementation then checks try_to_load_from_cache(... revision=_commit_hash ...) before doing any remote download. (GitHub)

So the main purpose is:

  1. exact local cache resolution
  2. better offline/cache behavior
  3. provenance propagation across config/tokenizer/model loading steps (GitHub)

Does it pin later remote downloads too?

This is the subtle part.

In the normal single-file cached_file(...) path:

No, not directly.

If the file is not already found via the _commit_hash cache fast path, Transformers falls back to hf_hub_download(... revision=revision ...). In other words, the actual remote call still uses the original revision, not _commit_hash. (GitHub)

So your “it locks all later remote fetches to the same commit” interpretation is too strong for that path.

In the multi-file snapshot_download(...) path:

Yes, effectively.

snapshot_download resolves the requested revision once, gets repo_info.sha as commit_hash, stores the refs/<revision> mapping if needed, and then works from that resolved commit snapshot. (GitHub)

So the “resolve once, then pin to one exact commit” behavior is real, but it is most clearly implemented in snapshot_download, not in every single-file hf_hub_download fallback from cached_file. (GitHub)

Direct answers to your questions

1. If _commit_hash is mainly for local cache resolution, couldn’t revision handle that too?

Not as well.

revision="main" is only a symbolic name. The cache ultimately needs an exact snapshot directory keyed by commit hash. Once Transformers already knows the resolved commit, _commit_hash is the stricter and more useful key. (Hugging Face)

2. Does huggingface_hub use this commit hash to lock the remote fetch?

Not in the general single-file hf_hub_download path used by cached_file. That fallback still passes revision=revision. (GitHub)

Yes in snapshot_download. That code resolves the revision to an exact SHA and then uses that resolved snapshot. (GitHub)

3. Does _commit_hash serve other purposes?

Yes.

Besides exact cache lookup, it also serves as provenance metadata that gets propagated through config loading and later file resolution. That way later steps know not just “the user asked for main”, but “this load was actually resolved from commit X”. (GitHub)

Bottom line

The most accurate summary is:

revision is the user-facing ref. _commit_hash is the exact resolved commit.
Transformers uses _commit_hash mainly to make chained loads more deterministic by targeting the exact cached snapshot and carrying forward the resolved provenance.
It is not just redundant with revision, but it is also not a universal remote-download lock in every from_pretrained code path. That stronger “pin everything to one commit” behavior is most clearly provided by snapshot_download. (GitHub)

If you want a one-line version:

_commit_hash is mostly an internal exact-snapshot key, while revision is the public ref you asked for.