Re-cooking imatrix and quants with updated ik/llama.cpp PR

#1
by ubergarm - opened
Owner
β€’
edited Jan 21

Given the implementation was using the wrong gating function, gonna re-compute imatrix and re-cook the quants for optimal quality.

The MXFP4 quant did not use imatrix so it is fine if you already downloaded that one, it does not need to be updated.

See these two PRs for reference:

Please pardon my dust, I'll try to get new perplexity data updated soon.

Cheers!

I already received it, but I'll wait again. It's Very good model.

@shewin

Nice! I enjoy your demonstrations! It should be better now after you pull the latest code and use an updated (or non imatrix) quant!

Okay, updated all available quants now! I'll do perplexity testing tomorrow! Cheers!

Wont the MXFP4 need to be refresh also, given the comment:
https://github.com/ggml-org/llama.cpp/pull/18980#issuecomment-3777184589
where the scoring_func is updated in the model conf?

@noctrex

No need to update yours given you didn't use imatrix.

If you look at the code of the llama.cpp PR it will detect GLM-4.7-Flash and automatically do the needful:

// GLM 4.7 Lite
hparams.expert_gating_func = LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID;

What daniel is doing in the linked comment is just adding it to the safetensors config.json file explicitly. Not sure if they then re-did their original BF16 GGUF as well to add in that single line of additional metadata. Using either ik_llama.cpp or llama.cpp with the patch will detect it is missing and properly set it.

For example, running on ik_llama.cpp now it starts up and says it explicitly:

================= Missing experts gating function -> set to sigmoid

Cheers!

Yes it seems that they redid all the quants, also FP16. Don't know if it will make any difference now that also llama.cpp is patched. Maybe intended for users that have older versions of the program.

Owner
β€’
edited Jan 21

also FP16

Do you have a link? One must be careful not to conflate bf16 and fp16 as they have very different dynamic range and precisions. I asked them about it once as it seemed like they were risking clipping of weights by using fp16 for some quants: https://huggingface.co/unsloth/Olmo-3.1-32B-Instruct-GGUF/discussions/1 when the original weights are bf16. If you run the original weights through a script and make sure nothing exceeds the dynamic range of fp16 (so no clipping) then it would be okay.

Don't know if it will make any difference now that also llama.cpp is patched. Maybe intended for users that have older versions of the program.

Yes that would properly use the sigmoid gating function by making it explicit in the model metadata for running on older un-patched versions and downstream projects like ollama, lm studio, koboldcpp etc until they pick up the changes.

Sorry, the keyboard devil stuck me, it's BF16 they updated: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/tree/main/BF16

@noctrex

No need to update yours given you didn't use imatrix.

@ubergram is your MXFP4 Quant with attention layers bumped to Q8 compatible with current llama.cpp or it is for ik llama.cpp

Owner
β€’
edited Feb 6

@engrtipusultan

Mine works well with both ik and mainline llama.cpp and sees a very slight benefit in PPL due to bumping the extra couple tensors to q8_0

That said, looking at KLD, I'd suggest trying to run one of the non MXFP4 versions probably.. curious if you have any strong preferences though!

No preference but GLM flash on llama.cpp is much slower than other A3Bs. I wanted to try 4bit version and apparently based on following thread MXFP4 of this model has better perplexity and KLD as compared to any other Q4 quant.

https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF/discussions/1#6978c43bb6cc295f9bf970e0

Do you have any other statistics that MXFP4 KLD is worse than other Quants.

I also thought that MXFP4 KLD would be worse than Q4_K_M but testing in thread says otherwise.

https://github.com/Thireus/GGUF-Tool-Suite/issues/52#issuecomment-3795175551

Your tests show a total different story. I am confused to be honest.

Does it have something to do with main llama.cpp vs ik llama.cpp?

Owner

Does it have something to do with main llama.cpp vs ik llama.cpp?

No, it is just confusing haha... In general both mainline and ik have mostly similar perplexity values for the same quants...

Perplexity and KLD while easy to measure, are not always giving the full story in terms of quality. Longer benchmarks take more time though. For many quants (especially non instruct) perplexity and KLD tend to be correlated with each other and with total quant size.

I'm not sure why MXPF4 is so popular, it has been mentioned by devs in the PRs adding it to both mainline and ik it is mostly for gpt-oss which QAT targeted that specific quant... For most models there isn't any reason to choose it over other similar sized quantization types imho...

another reference: https://github.com/ggml-org/llama.cpp/pull/18953#issuecomment-3779925873

I don't have single MXFP4 on my system other that GPT OSS 20B. Confusing part is above comment shared by me where author states that for GLM Flash not only perplexity but KL divergence is better for MXFP4 as compared to Q4_K_M. Whereas same model tested by you for KL divergence is showing inferior KL divergence as compared to other q4 quants.

I use vulkan backend I do not have any other option plus unified memory is DDR4. Bf16 is not natively supported plus iq quants too are slower.
So my objective is currently to find 4bit quants that do hit bandwidth much plus have better speed and accuracy for my setup.

Owner
β€’
edited Feb 7

@engrtipusultan

Whereas same model tested by you for KL divergence is showingwith like Q4_0 inferior KL divergence as compared to other q4 quants.

The results can vary depending on the exact corpus used for measurement as well. The best use of PPL and KLD is to compare quantized versions of the exact same model on the exact same hardware using the exact same corpus and exact same imatrix corpus etc. It gives a relative view of how quantized are models are doing relative to the baseline bf16 or whatever original "full quality" is.

I didn't honestly have time to look at the exact details in the link you gave me, trying to catch up on some new models released sorry I can't discuss it better at the moment.

It is interesting though if they are suggesting both PPL and all KLD metrics are "better" for their MXFP4 but I didn't have time to understand their exact procedures etc.

If only we had a simple graph or visual... I try to release a perplexity graph for most of my quant collections. And AesSedai and others have released interesting visualizations of KLD metrics vs size etc.

I use vulkan backend I do not have any other option plus unified memory is DDR4. Bf16 is not natively supported plus iq quants too are slower.

If you're using vulkan, you can use ik_llama.cpp with like Q4_0 or Q4_K which will run on your GPU. If you're doing MoE you can use the newer better stuff for CPU routed exps like IQ4_KSS etc. This would likely require you to roll custom quants. Rarely is there ever a need for bf16 imo (I don't think it helps noticibly to leave bf16 on token embed or output (head) personally).

Also I suggest benchmarking with llama-sweep-bench, i have a branch with that for mainline here: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench

What is your exact rig size (DDR and GPU/VRAM and CPU? and what kinds of models do you want to run? The new step-3.5-flash is interesting, I'm working on it now.

I haven't released all these, but for example this is how I use perplexity. I didn't try MXFP4 tho
ppl-Step-3.5

Thank you for your detailed responses. I will look further into it.

Sign up or log in to comment