Covo-Audio

Covo-Audio Technical Report

📖 Overview

Covo-Audio is a 7B-parameter end-to-end large audio language model that directly processes continuous audio inputs and generates audio outputs within a single unified architecture, which is presented in the paper Covo-Audio Technical Report. We release Covo-Audio-Chat in this repository.

Covo-Audio-Chat Results — *An Overview of Comprehensive Performance Comparison.*

Key Features

Hierarchical Tri-modal Speech-Text Interleaving: We propose a framework designed to achieve deep alignment and fusion across modalities and scales. The Tri-modal aspect integrates continuous acoustic features, discrete speech tokens, and natural language text within a unified sequence, effectively bridging the gap between high-fidelity prosodic nuances and robust semantic structures.
Mitigating Intelligence-Speaker Coupling: We propose a intelligence-speaker decoupling technique that decouples speaker from dialogue intelligence via multi-speaker training, then develop a contextual adaptation method to transfer and share high-quality TTS voice.
Native Full-Duplex Voice Interaction: We evolve Covo-Audio into Covo-Audio-Chat-FD, a variant with native, low-latency full-duplex capability.
Comprehensive State-of-the-Art Performance: Achieving state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction.

🔧 Installation

1. Requirements

Recommends Python >= 3.11

conda create -n covoaudio python=3.11
conda activate covoaudio
pip install -r requirements.txt

2. Clone Repository

git clone https://github.com/Tencent/Covo-Audio.git
cd Covo-Audio

3. Download Pretrained Models

Using HuggingFace:

pip install huggingface-hub
hf download tencent/Covo-Audio-Chat --local-dir ./covoaudio

By running the above script, you can use the model downloaded from huggingface to override the directory of the same name in this repository. Or you can specify your own directory to store the model by modifying the local-dir argument (In this case, you need to edit the arguments model_dir and decode_load_path in example.sh accordingly before running the inference script).

🚀 Usage

Run Inference Scripts

After completeing the configuration and model downloading, you can perform one-click inference by running the script:

bash example.sh

To perform interaction with our model, just replace the paths in example.py with your own audio files.

🙏 Acknowledgments

Part of the code for this project is based on the following open-source projects:

The llm backbone and audio encoder of Covo-Audio are initialized respectively with the weights from:

🔗 Citation

If you find this model useful, please cite our paper:

@misc{wang2026covoaudiotechnicalreport,
      title={Covo-Audio Technical Report}, 
      author={Wenfu Wang and Chenxing Li and Liqiang Zhang and Yiyang Zhao and Yuxiang Zou and Hanzhao Li and Mingyu Cui and Hao Zhang and Kun Wei and Le Xu and Zikang Huang and Jiajun Xu and Jiliang Hu and Xiang He and Zeyu Xie and Jiawen Kang and Youjun Chen and Meng Yu and Dong Yu and Rilin Chen and Linlin Di and Shulin Feng and Na Hu and Yang Liu and Bang Wang and Shan Yang},
      year={2026},
      eprint={2602.09823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.09823}, 
}

📄 License

Our model and code are licensed under LICENSE.

✉️ Contact

If you have any questions or suggestions, feel free to contact us:

📔 Disclaimer

Covo-Audio-Chat is for research and experimental purposes only. It may occasionally produce inaccurate, inappropriate, biased, outdated, or factually incorrect content. Users should independently verify critical information, and are solely responsible for their use of the model and any consequences thereof.

Downloads last month: 77

Safetensors

Model size

8B params

Tensor type

BF16

Space using tencent/Covo-Audio-Chat 1

Paper for tencent/Covo-Audio-Chat

Covo-Audio Technical Report

Paper • 2602.09823 • Published Feb 10 • 12