Text Generation
GGUF
PyTorch
nvidia
nemotron-3
latent-moe
mtp
unsloth
conversational

UD-Q4_K_XL will not load on latest llama.cpp ("master" branch)

#2
by evilJazz - opened

Error I am getting is:

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected   2688,   4096,    512, got   2688,   1024,    512,      1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_UD-Q4_K_XL_NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf'
srv    load_model: failed to load model, '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_UD-Q4_K_XL_NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

What am I doing wrong? This is running in a Docker container with llama-swap, config is:

  "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL":
    name: "Nvidia Nemotron-3-Super 120B - 512K UD-Q4_K_XL 83.8GB"
    ttl: 0
    aliases:
      - "nemotron3:120b"
    cmd: |
      ${llama}
        -ctk f16 -ctv f16
        -c 524288
        -b 2048
        --temp 0.6 --top-p 0.95
        -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL

Just tried again with MXFP4_MOE, same issue. This is with 5x AMD Instinct MI60 32 GB using HIP / ROCm backend

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected   2688,   4096,    512, got   2688,   1024,    512,      1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_MXFP4_MOE_NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf'
srv    load_model: failed to load model, '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_MXFP4_MOE_NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

Yeah, looks like I missed exactly that commit. Recompiling now. My bad
Update: Works! Thank you!

Unsloth AI org

Nice! Yep llama.cpp now has Nemotron support!

getting the same error with the latest llama-b8287-bin-win-cuda-12.4-x64.zip .

PS E:\llms> llama-server --model .\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf --host 0.0.0.0 --port 1234 --jinja --ctx-size 50000 --flash-attn on --fit on

print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv load_model: failed to load model, '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

getting the same error with the latest llama-b8287-bin-win-cuda-12.4-x64.zip .

PS E:\llms> llama-server --model .\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf --host 0.0.0.0 --port 1234 --jinja --ctx-size 50000 --flash-attn on --fit on

print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv load_model: failed to load model, '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

It looks like the llama.cpp pipeline can't keep up with changes. The latest release contains changes from yesterday. The only options are to wait or compile it yourself.

Same issue, assume we just have to wait.

yep, same here with b8292

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124390 MiB free)
load_backend: loaded ROCm backend from /home/flo/llama/last/llama-b8292/libggml-hip.so
load_backend: loaded RPC backend from /home/flo/llama/last/llama-b8292/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /home/flo/llama/last/llama-b8292/libggml-vulkan.so
load_backend: loaded CPU backend from /home/flo/llama/last/llama-b8292/libggml-cpu-zen4.so

Loading model... -llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
\llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
|llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1

I just want to say I pulled the master branch of llama.cpp today with git hash 88915cb55c, rebuilt llama.cpp, and it works now.

Sign up or log in to comment