pytorch
/

gemma-3-12b-it-INT4

@@ -11,7 +11,7 @@ language:
 # INT4 google/gemma-3-12b-it model
-- **Developed by:** jerryzh168
 - **License:** apache-2.0
 - **Quantized from Model :** google/gemma-3-12b-it
 - **Quantization Method :** INT4
@@ -28,14 +28,14 @@ pip install torchao
 Then we can serve with the following command:
 ```Shell
 # Server
-export MODEL=jerryzh168/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
 # Client
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-  "model": "jerryzh168/gemma-3-12b-it-INT4",
   "messages": [
     {"role": "user", "content": "Give me a short introduction to large language models."}
   ],
@@ -64,7 +64,7 @@ Example:
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "jerryzh168/gemma-3-12b-it-INT4"
 # load the tokenizer and the model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -187,7 +187,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
-|                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4         |
 | mmlu                             | 71.51   | 68.96                      |
@@ -204,7 +204,7 @@ lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it --tasks mmlu --
 ## INT4
 ```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-INT4
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
@@ -218,7 +218,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
-|                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4              |
 | Peak Memory (GB) | 24.50   | 8.68 (65% reduction)    |
@@ -232,8 +232,8 @@ We can use the following code to get a sense of peak memory usage during inferen
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-# use "google/gemma-3-12b-it" or "jerryzh168/gemma-3-12b-it-INT4"
-model_id = "jerryzh168/gemma-3-12b-it-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -278,7 +278,7 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
-|                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-INT4        |
 | latency (batch_size=1)           | 3.73s          | 2.16s (1.73x speedup)    |
 <details>
@@ -308,7 +308,7 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
 ### INT4
 ```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>

 # INT4 google/gemma-3-12b-it model
+- **Developed by:** pytorch
 - **License:** apache-2.0
 - **Quantized from Model :** google/gemma-3-12b-it
 - **Quantization Method :** INT4
 Then we can serve with the following command:
 ```Shell
 # Server
+export MODEL=pytorch/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3
 ```
 ```Shell
 # Client
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "pytorch/gemma-3-12b-it-INT4",
   "messages": [
     {"role": "user", "content": "Give me a short introduction to large language models."}
   ],
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "pytorch/gemma-3-12b-it-INT4"
 # load the tokenizer and the model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
+|                                  | google/gemma-3-12b-it   | pytorch/gemma-3-12b-it-INT4         |
 | mmlu                             | 71.51   | 68.96                      |
 ## INT4
 ```Shell
+export MODEL=pytorch/gemma-3-12b-it-INT4
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
+|                  | google/gemma-3-12b-it   | pytorch/gemma-3-12b-it-INT4              |
 | Peak Memory (GB) | 24.50   | 8.68 (65% reduction)    |
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "google/gemma-3-12b-it" or "pytorch/gemma-3-12b-it-INT4"
+model_id = "pytorch/gemma-3-12b-it-INT4"
 quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
+|                                  | google/gemma-3-12b-it   | pytorch/gemma-3-12b-it-INT4        |
 | latency (batch_size=1)           | 3.73s          | 2.16s (1.73x speedup)    |
 <details>
 ### INT4
 ```Shell
+export MODEL=pytorch/gemma-3-12b-it-INT4
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>