ShizhenGPT-7B-Omni

---
license: apache-2.0
datasets:
- FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT
- FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT
language:
- zh
base_model:
- Qwen/Qwen2.5-7B
pipeline_tag: text-generation
tags:
- Traditional Chinese Medicin
- Multimodal LLM
- multimodal
- Image-text-to-text
- Audio-text-to-text
---

<div align="center">
<h1>
  ShizhenGPT-7B-Omni
</h1>
</div>

<div align="center">
<a href="https://github.com/FreedomIntelligence/ShizhenGPT" target="_blank">GitHub</a> | <a href="https://arxiv.org/abs/2508.14706" target="_blank">Paper</a>
</div>


**ShizhenGPT** is the first multimodal LLM for Traditional Chinese Medicine (TCM).
It not only possesses strong expertise in TCM, but also supports TCM multimodal diagnostic capabilities, which involve looking (望), listening/smelling (闻), questioning (问), and pulse-taking (切).

👉 More details on GitHub: [ShizhenGPT](https://github.com/FreedomIntelligence/ShizhenGPT)


# <span>Model Info</span>

> **ShizhenGPT-7B-Omni** is the full version of ShizhenGPT-7B, supporting multiple modalities of input. If your needs only involve text or image input, you can consider using other versions:

|                        | Parameters | Supported Modalities          | Link                                                                  |
| ---------------------- | ---------- | ----------------------------- | --------------------------------------------------------------------- |
| **ShizhenGPT-7B-LLM**  | 7B         | Text                          | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-LLM) |
| **ShizhenGPT-7B-VL**   | 7B         | Text, Image Understanding     | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-VL) |
| **ShizhenGPT-7B-Omni** | 7B         | Text, Four Diagnostics (望闻问切) | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-7B-Omni) |
| **ShizhenGPT-32B-LLM**  | 32B        | Text                          | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-32B-LLM) |
| **ShizhenGPT-32B-VL**   | 32B        | Text, Image Understanding     | [HF Link](https://huggingface.co/FreedomIntelligence/ShizhenGPT-32B-VL) |
| **ShizhenGPT-32B-Omni** | 32B        | Text, Four Diagnostics (望闻问切) | Available soon                                                          |

*Note: The LLM and VL models are parameter-split variants of ShizhenGPT-7B-Omni. Since their architectures align with Qwen2.5 and Qwen2.5-VL, they are easier to adapt to different environments. In contrast, ShizhenGPT-7B-Omni requires `transformers==4.51.0`.*


# <span>Usage</span>
To use `ShizhenGPT-7B-Omni`, you need to use `transformers==4.51.0` and set `trust_remote_code` to True. You can run the following script:

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import fetch_image
import librosa

# Load model and processor
model_path = 'FreedomIntelligence/ShizhenGPT-7B-Omni'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype="auto").cuda()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

def generate(text, images=None, signals=None):
    # Process images if provided
    processed_images = []
    if images is not None and images:
        text = ''.join(['<|vision_start|><|image_pad|><|vision_end|>']*len(images)) + text
        processed_images = [fetch_image({"type": "image", "image": img, "max_pixels": 360*420}) 
                            for img in images if img is not None]
    else:
        processed_images = None
    
    # Process audio signals if provided
    processed_signals = []
    if signals is not None and signals:
        text = ''.join(['<|audio_bos|><|AUDIO|><|audio_eos|>']*len(signals)) + text
        processed_signals = [librosa.load(signal, sr=processor.feature_extractor.sampling_rate)[0] 
                             for signal in signals if signal is not None]
    else:
        processed_signals = None
    
    # Prepare messages
    messages = [{'role': 'user', 'content': text}]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Ensure text is non-empty
    if not text:
        text = [""]

    # Process the input data
    input_data = processor(
        text=[text],
        audios=processed_signals,
        images=processed_images, 
        return_tensors="pt", 
        padding=True
    )
    input_data = input_data.to(model.device)
    
    # Generate the output
    generated_ids = model.generate(**input_data, max_new_tokens=1024)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(input_data.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]

# Example usage
# Text input
print(generate('为什么我总是手脚冰凉，是阳虚吗？'))
# Image input
print(generate('请从中医角度解读这张舌苔。', images=['path_to_image']))
# Audio input
print(generate('请回答这个语音问题', signals=['path_to_audio']))

```


# <span>📖 Citation</span>
```
@misc{chen2025shizhengptmultimodalllmstraditional,
      title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine}, 
      author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang},
      year={2025},
      eprint={2508.14706},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14706},
}
```