I’ve been digging into the source code of the transformers library and stumbled upon a detail regarding how files are fetched and cached that I’m hoping someone can clarify.
Specifically, I am trying to understand the exact role of the commit_hash argument within the PreTrainedModel.from_pretrained method, and how it differs from revision.
My initial research led me to believe that commit_hash is used to pin consecutive file downloads to a specific state. This prevents a race condition where a branch (like main) is updated halfway through downloading a multi-file model, which would result in mismatched files.
Looking at the code, it seems to support this. First, it tries to obtain the commit_hash of the current revision early on by resolving the config file:
if commit_hash is None:
if not isinstance(config, PretrainedConfig):
# We make a call to the config file first (which may be absent) to get the commit hash as soon as possible
resolved_config_file = cached_file(
pretrained_model_name_or_path,
CONFIG_NAME,
# ... [other args omitted for brevity] ...
revision=revision,
)
commit_hash = extract_commit_hash(resolved_config_file, commit_hash)
else:
commit_hash = getattr(config, "_commit_hash", None)
This commit_hash is then passed down into cached_file_kwargs for subsequent loading code (like fetching the actual model weights):
cached_file_kwargs = {
# ... [other args] ...
"revision": revision,
"_commit_hash": commit_hash,
}
resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
Here is my confusion: When I look inside the cached_file method itself, I noticed that the _commit_hash appears to only be used for local cache checks. If a download from the Hub is actually triggered, it seems to still rely on the revision argument. Other loading functions also don’t seem to strictly use the identified commit_hash for the remote fetch.
My questions:
-
If
commit_hashis primarily used just for local cache resolution, couldn’t therevisionargument handle that on its own? -
Does the underlying
huggingface_hubdownload logic actually use thiscommit_hashto lock the remote fetch to that specific commit, or is my assumption about preventing mid-download revision changes incorrect? -
Does
commit_hashserves any other purposes?
Thank you in advance.