nielsr HF Staff commited on
Commit
0da579a
Β·
verified Β·
1 Parent(s): 7a5889c

Update pipeline tag, add library name, paper link, and sample usage

Browse files

This PR significantly improves the model card for the WorldPlay model by:

* Updating the `pipeline_tag` from `text-to-3d` to `image-to-video`. This more accurately reflects the model's primary function of generating streaming video from an image or text prompt, as evidenced by the paper abstract and the `config.json` (`"ideal_task": "i2v"`).
* Adding `library_name: diffusers` to the metadata. The `config.json` shows `"_diffusers_version": "0.35.0"`, confirming compatibility with the Hugging Face `diffusers` library. This enables automated code snippets for users.
* Integrating a direct link to the official Hugging Face paper page ([WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling](https://huggingface.co/papers/2512.14614)) into the badge section, replacing the previous general report link for clearer access to the publication. The existing Project Page badge's text is also clarified.
* Adding a `

Files changed (1) hide show
  1. README.md +139 -9
README.md CHANGED
@@ -1,11 +1,12 @@
1
  ---
2
- license: other
3
- license_name: tencent-hy-worldplay-community
4
- license_link: https://github.com/Tencent-Hunyuan/HY-WorldPlay/blob/main/License.txt
5
  language:
6
  - en
7
  - zh
8
- pipeline_tag: text-to-3d
 
 
 
 
9
  tags:
10
  - hunyuan3d
11
  - worldmodel
@@ -17,6 +18,7 @@ tags:
17
  - text-to-3D
18
  ---
19
 
 
20
 
21
  <div align="center">
22
  <img src="https://github.com/Tencent-Hunyuan/HY-WorldPlay/raw/main/assets/teaser.webp">
@@ -29,8 +31,8 @@ tags:
29
  <a href=https://3d.hunyuan.tencent.com/sceneTo3D?tab=worldplay target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
30
  <a href=https://huggingface.co/tencent/HY-WorldPlay target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
31
  <a href="https://github.com/Tencent-Hunyuan/HY-WorldPlay/" target="_blank"><img src="https://img.shields.io/badge/GitHub-181717.svg?logo=github&logoColor=white" height="22px"></a>
32
- <a href="https://3d-models.hunyuan.tencent.com/world/" target="_blank"><img src="https://img.shields.io/badge/Page-bb8a2e.svg?logo=googlechrome&logoColor=white" height="22px"></a>
33
- <a href=https://3d-models.hunyuan.tencent.com/world/world1_5/HYWorld_1.5_Tech_Report.pdf target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
34
  <a href=https://discord.gg/dNBrdrGGMa target="_blank"><img src= https://img.shields.io/badge/Discord-white.svg?logo=discord height=22px></a>
35
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Tencent%20HY-black.svg?logo=x height=22px></a>
36
  <a href="#community-resources" target="_blank"><img src=https://img.shields.io/badge/Community-lavender.svg?logo=homeassistantcommunitystore height=22px></a>
@@ -47,7 +49,7 @@ tags:
47
 
48
 
49
  ## πŸ“– Introduction
50
- While **HY-World 1.0** is capable of generating immersive 3D worlds, it relies on a lengthy offline generation process and lacks real-time interaction. **HY-World 1.5** bridges this gap with **WorldPlay**, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. Our model draws power from four key designs. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We design WorldCompass, a novel Reinforcement Learning (RL) post-training framework designed to directly improve the action-following and visual quality of the long-horizon, autoregressive video model. 4) We also propose Context Forcing, a novel distillation method designed for memory-aware models. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, HY-World 1.5 generates long-horizon streaming video at 24 FPS with superior consistency, comparing favorably with existing techniques. Our model shows strong generalization across diverse scenes, supporting first-person and third-person perspectives in both real-world and stylized environments, enabling versatile applications such as 3D reconstruction, promptable events, and infinite world extension.
51
 
52
 
53
  - **Systematic Overview**
@@ -66,6 +68,121 @@ While **HY-World 1.0** is capable of generating immersive 3D worlds, it relies o
66
  <img src="https://github.com/Tencent-Hunyuan/HY-WorldPlay/raw/main/assets/pipeline.png">
67
  </p>
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ## πŸ“š Citation
70
 
71
  ```bibtex
@@ -75,10 +192,23 @@ While **HY-World 1.0** is capable of generating immersive 3D worlds, it relies o
75
  journal={arXiv preprint},
76
  year={2025}
77
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
79
 
80
 
81
  ## πŸ™ Acknowledgements
82
  We would like to thank [HunyuanWorld](https://github.com/Tencent-Hunyuan/HunyuanWorld-1.0), [HunyuanWorld-Mirror
83
- ](https://github.com/Tencent-Hunyuan/HunyuanWorld-Mirror), [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their great work.
84
-
 
1
  ---
 
 
 
2
  language:
3
  - en
4
  - zh
5
+ license: other
6
+ license_name: tencent-hy-worldplay-community
7
+ license_link: https://github.com/Tencent-Hunyuan/HY-WorldPlay/blob/main/License.txt
8
+ pipeline_tag: image-to-video
9
+ library_name: diffusers
10
  tags:
11
  - hunyuan3d
12
  - worldmodel
 
18
  - text-to-3D
19
  ---
20
 
21
+ This repository contains the **WorldPlay** model, a streaming video diffusion model, presented in the paper [WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling](https://huggingface.co/papers/2512.14614).
22
 
23
  <div align="center">
24
  <img src="https://github.com/Tencent-Hunyuan/HY-WorldPlay/raw/main/assets/teaser.webp">
 
31
  <a href=https://3d.hunyuan.tencent.com/sceneTo3D?tab=worldplay target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
32
  <a href=https://huggingface.co/tencent/HY-WorldPlay target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
33
  <a href="https://github.com/Tencent-Hunyuan/HY-WorldPlay/" target="_blank"><img src="https://img.shields.io/badge/GitHub-181717.svg?logo=github&logoColor=white" height="22px"></a>
34
+ <a href="https://3d-models.hunyuan.tencent.com/world/" target="_blank"><img src=https://img.shields.io/badge/Project%20Page-bb8a2e.svg?logo=googlechrome&logoColor=white height=22px></a>
35
+ <a href=https://huggingface.co/papers/2512.14614 target="_blank"><img src=https://img.shields.io/badge/Paper-b5212f.svg?logo=huggingface height=22px></a>
36
  <a href=https://discord.gg/dNBrdrGGMa target="_blank"><img src= https://img.shields.io/badge/Discord-white.svg?logo=discord height=22px></a>
37
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Tencent%20HY-black.svg?logo=x height=22px></a>
38
  <a href="#community-resources" target="_blank"><img src=https://img.shields.io/badge/Community-lavender.svg?logo=homeassistantcommunitystore height=22px></a>
 
49
 
50
 
51
  ## πŸ“– Introduction
52
+ While **HY-World 1.0** is capable of generating immersive 3D worlds, it relies on a lengthy offline generation process and lacks real-time interaction. **HY-World 1.5** bridges this gap with **WorldPlay**, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. Our model draws power from four key designs. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We design WorldCompass, a novel Reinforcement Learning (RL) post-training framework designed to directly improve the action-following and visual quality of the long-horizon, autoregressive video model. 4) We also propose Context Forcing, a novel distillation method designed for memory-aware models. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, HY-World 1.5 generates long-horizon streaming video at 24 FPS with superior consistency, comparing favorably with existing techniques. Our model shows strong generalization across diverse scenes, supporting first-person and third-person perspectives in both real-world and stylized environments, enabling versatile applications such as 3D reconstruction, promptable events, and infinite world extension.
53
 
54
 
55
  - **Systematic Overview**
 
68
  <img src="https://github.com/Tencent-Hunyuan/HY-WorldPlay/raw/main/assets/pipeline.png">
69
  </p>
70
 
71
+ ## πŸ”‘ Sample Usage
72
+ We open source the inference code for both bidirectional and autoregressive diffusion models. For prompt rewriting, we recommend using Gemini or models deployed via vLLM. This codebase currently only supports models compatible with the vLLM API. If you wish to use Gemini, you will need to implement your own interface calls. The details can be found in [HunyuanVideo-1.5](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5).
73
+
74
+ We recommend using `generate_custom_trajectory.py` for generating customized camera trajectory.
75
+
76
+ ```bash
77
+ export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
78
+ export T2V_REWRITE_MODEL_NAME="<your_model_name>"
79
+ export I2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
80
+ export I2V_REWRITE_MODEL_NAME="<your_model_name>"
81
+
82
+ PROMPT='A paved pathway leads towards a stone arch bridge spanning a calm body of water. Lush green trees and foliage line the path and the far bank of the water. A traditional-style pavilion with a tiered, reddish-brown roof sits on the far shore. The water reflects the surrounding greenery and the sky. The scene is bathed in soft, natural light, creating a tranquil and serene atmosphere. The pathway is composed of large, rectangular stones, and the bridge is constructed of light gray stone. The overall composition emphasizes the peaceful and harmonious nature of the landscape.'
83
+
84
+ IMAGE_PATH=./assets/img/test.png # Now we only provide the i2v model, so the path cannot be None
85
+ SEED=1
86
+ ASPECT_RATIO=16:9
87
+ RESOLUTION=480p # Now we only provide the 480p model
88
+ OUTPUT_PATH=./outputs/
89
+ MODEL_PATH= # Path to pretrained hunyuanvideo-1.5 model
90
+ AR_ACTION_MODEL_PATH= # Path to our HY-World 1.5 autoregressive checkpoints
91
+ BI_ACTION_MODEL_PATH= # Path to our HY-World 1.5 bidirectional checkpoints
92
+ AR_DISTILL_ACTION_MODEL_PATH= # Path to our HY-World 1.5 autoregressive distilled checkpoints
93
+ POSE_JSON_PATH=./assets/pose/test_forward_32_latents.json # Path to the customized camera trajectory
94
+ NUM_FRAMES=125
95
+
96
+ # Configuration for faster inference
97
+ # For AR inference, the maximum number recommended is 4. For bidirectional models, it can be set to 8.
98
+ N_INFERENCE_GPU=4 # Parallel inference GPU count.
99
+
100
+ # Configuration for better quality
101
+ REWRITE=false # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
102
+ ENABLE_SR=false # Enable super resolution. When the NUM_FRAMES == 121, you can set it to true
103
+
104
+ # inference with bidirectional model
105
+ torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
106
+ --prompt "$PROMPT" \
107
+ --image_path $IMAGE_PATH \
108
+ --resolution $RESOLUTION \
109
+ --aspect_ratio $ASPECT_RATIO \
110
+ --video_length $NUM_FRAMES \
111
+ --seed $SEED \
112
+ --rewrite $REWRITE \
113
+ --sr $ENABLE_SR --save_pre_sr_video \
114
+ --pose_json_path $POSE_JSON_PATH \
115
+ --output_path $OUTPUT_PATH \
116
+ --model_path $MODEL_PATH \
117
+ --action_ckpt $BI_ACTION_MODEL_PATH \
118
+ --few_step false \
119
+ --model_type 'bi'
120
+
121
+ # inference with autoregressive model
122
+ #torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
123
+ # --prompt "$PROMPT" \
124
+ # --image_path $IMAGE_PATH \
125
+ # --resolution $RESOLUTION \
126
+ # --aspect_ratio $ASPECT_RATIO \
127
+ # --video_length $NUM_FRAMES \
128
+ # --seed $SEED \
129
+ # --rewrite $REWRITE \
130
+ # --sr $ENABLE_SR --save_pre_sr_video \
131
+ # --pose_json_path $POSE_JSON_PATH \
132
+ # --output_path $OUTPUT_PATH \
133
+ # --model_path $MODEL_PATH \
134
+ # --action_ckpt $AR_ACTION_MODEL_PATH \
135
+ # --few_step false \
136
+ # --model_type 'ar'
137
+
138
+ # inference with autoregressive distilled model
139
+ #torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
140
+ # --prompt "$PROMPT" \
141
+ # --image_path $IMAGE_PATH \
142
+ # --resolution $RESOLUTION \
143
+ # --aspect_ratio $ASPECT_RATIO \
144
+ # --video_length $NUM_FRAMES \
145
+ # --seed $SEED \
146
+ # --rewrite $REWRITE \
147
+ # --sr $ENABLE_SR --save_pre_sr_video \
148
+ # --pose_json_path $POSE_JSON_PATH \
149
+ # --output_path $OUTPUT_PATH \
150
+ # --model_path $MODEL_PATH \
151
+ # --action_ckpt $AR_DISTILL_ACTION_MODEL_PATH \
152
+ # --few_step true \
153
+ # --num_inference_steps 4 \
154
+ # --model_type 'ar'
155
+ ```
156
+
157
+ ## πŸ“Š Evaluation
158
+
159
+ HY-World 1.5 surpasses existing methods across various quantitative metrics, including reconstruction metrics for different video lengths and human evaluations.
160
+
161
+ | Model | Real-time | | | Short-term | | | | | Long-term | | |
162
+ |:---------------------------| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
163
+ | | | **PSNR** ⬆ | **SSIM** ⬆ | **LPIPS** ⬇ | **$R_{dist}$** ⬇ | **$T_{dist}$** ⬇ | **PSNR** ⬆ | **SSIM** ⬆ | **LPIPS** ⬇ | **$R_{dist}$** ⬇ | **$T_{dist}$** ⬇ |
164
+ | CameraCtrl | ❌ | 17.93 | 0.569 | 0.298 | 0.037 | 0.341 | 10.09 | 0.241 | 0.549 | 0.733 | 1.117 |
165
+ | SEVA | ❌ | 19.84 | 0.598 | 0.313 | 0.047 | 0.223 | 10.51 | 0.301 | 0.517 | 0.721 | 1.893 |
166
+ | ViewCrafter | ❌ | 19.91 | 0.617 | 0.327 | 0.029 | 0.543 | 9.32 | 0.271 | 0.661 | 1.573 | 3.051 |
167
+ | Gen3C | ❌ | 21.68 | 0.635 | 0.278 | **0.024** | 0.477 | 15.37 | 0.431 | 0.483 | 0.357 | 0.979 |
168
+ | VMem | ❌ | 19.97 | 0.587 | 0.316 | 0.048 | 0.219 | 12.77 | 0.335 | 0.542 | 0.748 | 1.547 |
169
+ | Matrix-Game-2.0 | βœ… | 17.26 | 0.505 | 0.383 | 0.287 | 0.843 | 9.57 | 0.205 | 0.631 | 2.125 | 2.742 |
170
+ | GameCraft | ❌ | 21.05 | 0.639 | 0.341 | 0.151 | 0.617 | 10.09 | 0.287 | 0.614 | 2.497 | 3.291 |
171
+ | Ours (w/o Context Forcing) | ❌ | 21.27 | 0.669 | 0.261 | 0.033 | 0.157 | 16.27 | 0.425 | 0.495 | 0.611 | 0.991 |
172
+ | **Ours (full)** | βœ… | **21.92** | **0.702** | **0.247** | 0.031 | **0.121** | **18.94** | **0.585** | **0.371** | **0.332** | **0.797** |
173
+
174
+ <p align="center">
175
+ <img src="https://github.com/Tencent-Hunyuan/HY-WorldPlay/raw/main/assets/human_eval.png">
176
+ </p>
177
+
178
+ ## 🎬 More Examples
179
+
180
+ https://github.com/user-attachments/assets/6aac8ad7-3c64-4342-887f-53b7100452ed
181
+
182
+ https://github.com/user-attachments/assets/531bf0ad-1fca-4d76-bb65-84701368926d
183
+
184
+ https://github.com/user-attachments/assets/f165f409-5a74-4e19-a32c-fc98d92259e1
185
+
186
  ## πŸ“š Citation
187
 
188
  ```bibtex
 
192
  journal={arXiv preprint},
193
  year={2025}
194
  }
195
+
196
+ @article{worldplay2025,
197
+ title={WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Model},
198
+ author={Wenqiang Sun and Haiyu Zhang and Haoyuan Wang and Junta Wu and Zehan Wang and Zhenwei Wang and Yunhong Wang and Jun Zhang and Tengfei Wang and Chunchao Guo},
199
+ year={2025},
200
+ journal={arXiv preprint}
201
+ }
202
+
203
+ @inproceedings{wang2025compass,
204
+ title={WorldCompass: Reinforcement Learning for Long-Horizon World Models},
205
+ author={Zehan Wang and Tengfei Wang and Haiyu Zhang and Wenqiang Sun and Junta Wu and Haoyuan Wang and Zhenwei Wang and Hengshuang Zhao and Chunchao Guo and Zhou Zhao},
206
+ journal = {arXiv preprint},
207
+ year = 2025
208
+ }
209
  ```
210
 
211
 
212
  ## πŸ™ Acknowledgements
213
  We would like to thank [HunyuanWorld](https://github.com/Tencent-Hunyuan/HunyuanWorld-1.0), [HunyuanWorld-Mirror
214
+ ](https://github.com/Tencent-Hunyuan/HunyuanWorld-Mirror), [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5), and [FastVideo](https://github.com/hao-ai-lab/FastVideo) for their great work.