DevQuasar

community
Verified
Activity Feed

AI & ML interests

Open-Source LLMs, Local AI Projects: https://pypi.org/project/llm-predictive-router/

Recent Activity

csabakecskemetiΒ 
posted an update 2 days ago
csabakecskemetiΒ 
posted an update 4 days ago
view post
Post
1999
Looking for some help to test an INT8 Deepseek 3.2:
SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK)
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

Currently uploading an INT8 version of Deepseek 3.2 Speciale:
DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8

I cannot test this I'm on AMD
"AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support"
(I assumed it can fall back to some non optimized kernel but seems not)

If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome.

If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me.

Thanks all!
Β·
csabakecskemetiΒ 
posted an update 27 days ago
view post
Post
301
Recently there are so much activity on token efficient formats, I've also build a package (inspired by toon).

Deep-TOON

My goal was to token efficiently handle json structures with complex embeddings.

So this is what I've built on the weekend. Feel free try:

https://pypi.org/project/deep-toon/0.1.0/

csabakecskemetiΒ 
posted an update about 2 months ago
view post
Post
2602
Christmas came early this year
Β·
csabakecskemetiΒ 
posted an update 6 months ago
view post
Post
3067
Has anyone ever backed up a model to a sequential tape drive, or I'm the world first? :D
Just played around with my retro PC that has got a tape driveβ€”did it just because I can.
Β·
csabakecskemetiΒ 
posted an update 6 months ago
csabakecskemetiΒ 
posted an update 8 months ago
csabakecskemetiΒ 
posted an update 8 months ago
csabakecskemetiΒ 
posted an update 9 months ago
view post
Post
3419
I'm collecting llama-bench results for inference with a llama 3.1 8B q4 and q8 reference models on varoius GPUs. The results are average of 5 executions.
The system varies (different motherboard and CPU ... but that probably that has little effect on the inference performance).

https://devquasar.com/gpu-gguf-inference-comparison/
the exact models user are in the page

I'd welcome results from other GPUs is you have access do anything else you've need in the post. Hopefully this is useful information everyone.
csabakecskemetiΒ 
posted an update 9 months ago
view post
Post
2412
Managed to get my hands on a 5090FE, it's beefy

| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | pp512 | 12207.44 Β± 481.67 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 143.18 Β± 0.18 |

Comparison with others GPUs
http://devquasar.com/gpu-gguf-inference-comparison/
csabakecskemetiΒ 
posted an update 9 months ago
csabakecskemetiΒ 
posted an update 9 months ago
csabakecskemetiΒ 
posted an update 9 months ago
view post
Post
847
Fine tuning on the edge. Pushing the MI100 to it's limits.
QWQ-32B 4bit QLORA fine tuning
VRAM usage 31.498G/31.984G :D

  • 4 replies
Β·
csabakecskemetiΒ 
posted an update 9 months ago
view post
Post
2001
-UPDATED-
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.
  • 4 replies
Β·
csabakecskemetiΒ 
posted an update 10 months ago
view post
Post
2908
Testing Training on AMD/ROCm the first time!

I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)

For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.

Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.
Β·
csabakecskemetiΒ 
posted an update 10 months ago
csabakecskemetiΒ 
posted an update 10 months ago
view post
Post
1908
Check out my idea:
LLmaaS - Local LLM as a Service

With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the user’s device.

Demo, code, more detailed description.
https://devquasar.com/llmaas/
https://github.com/csabakecskemeti/LLmaaS
https://youtu.be/OOWGr8jcP5Q

Call for contributors
Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures.
I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side.
Also looking for ideas how to make the HTML par more modular and easy to use.
  • 4 replies
Β·