UD-Q4_K_XL will not load on latest llama.cpp ("master" branch)

by evilJazz - opened Mar 11

•

Error I am getting is:

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected   2688,   4096,    512, got   2688,   1024,    512,      1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_UD-Q4_K_XL_NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf'
srv    load_model: failed to load model, '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_UD-Q4_K_XL_NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

What am I doing wrong? This is running in a Docker container with llama-swap, config is:

  "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL":
    name: "Nvidia Nemotron-3-Super 120B - 512K UD-Q4_K_XL 83.8GB"
    ttl: 0
    aliases:
      - "nemotron3:120b"
    cmd: |
      ${llama}
        -ctk f16 -ctv f16
        -c 524288
        -b 2048
        --temp 0.6 --top-p 0.95
        -hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL

evilJazz

Mar 11

•

edited Mar 11

Just tried again with MXFP4_MOE, same issue. This is with 5x AMD Instinct MI60 32 GB using HIP / ROCm backend

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected   2688,   4096,    512, got   2688,   1024,    512,      1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_MXFP4_MOE_NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf'
srv    load_model: failed to load model, '/root/.cache/llama.cpp/unsloth_NVIDIA-Nemotron-3-Super-120B-A12B-GGUF_MXFP4_MOE_NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

DrRos

Mar 11

https://github.com/ggml-org/llama.cpp/pull/20411 - merged about an hour ago

evilJazz

Mar 11

•

edited Mar 11

Yeah, looks like I missed exactly that commit. Recompiling now. My bad
Update: Works! Thank you!

danielhanchen

Unsloth AI org Mar 11

Nice! Yep llama.cpp now has Nemotron support!

dany19991

Mar 12

getting the same error with the latest llama-b8287-bin-win-cuda-12.4-x64.zip .

PS E:\llms> llama-server --model .\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf --host 0.0.0.0 --port 1234 --jinja --ctx-size 50000 --flash-attn on --fit on

print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv load_model: failed to load model, '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

Temp1234

Mar 12

getting the same error with the latest llama-b8287-bin-win-cuda-12.4-x64.zip .

PS E:\llms> llama-server --model .\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf --host 0.0.0.0 --port 1234 --jinja --ctx-size 50000 --flash-attn on --fit on

print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv load_model: failed to load model, '.\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q8_K_XL-00001-of-00004.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

It looks like the llama.cpp pipeline can't keep up with changes. The latest release contains changes from yesterday. The only options are to wait or compile it yourself.

lazywoof

Mar 12

Same issue, assume we just have to wait.

sirolf2

Mar 12

yep, same here with b8292

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124390 MiB free)
load_backend: loaded ROCm backend from /home/flo/llama/last/llama-b8292/libggml-hip.so
load_backend: loaded RPC backend from /home/flo/llama/last/llama-b8292/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /home/flo/llama/last/llama-b8292/libggml-vulkan.so
load_backend: loaded CPU backend from /home/flo/llama/last/llama-b8292/libggml-cpu-zen4.so

Loading model... -llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1
llama_model_load_from_file_impl: failed to load model
\llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
|llama_model_load: error loading model: check_tensor_dims: tensor 'blk.1.ffn_down_exps.weight' has wrong shape; expected 2688, 4096, 512, got 2688, 1024, 512, 1

sirolf2

Mar 12

opened a bug on github -> https://github.com/ggml-org/llama.cpp/issues/20466

tajhlande

Mar 15

I just want to say I pulled the master branch of llama.cpp today with git hash 88915cb55c, rebuilt llama.cpp, and it works now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment