File size: 12,251 Bytes
49d95bb f7e8fbf 49d95bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 |
---
language:
- en
license: mit
library_name: same
tags:
- vision-language
- navigation
- embodied-ai
- visual-navigation
- mixture-of-experts
- multimodal
- pytorch
datasets:
- R2R
- REVERIE
- RXR
- CVDN
- SOON
- ObjectNav-MP3D
metrics:
- success_rate
- spl
pipeline_tag: visual-question-answering
model-index:
- name: SAME
results:
- task:
type: visual-navigation
name: Vision-and-Language Navigation
dataset:
type: R2R
name: Room-to-Room (R2R)
metrics:
- type: success_rate
value: 76
name: SR (val_unseen)
- type: spl
value: 66
name: SPL (val_unseen)
- type: success_rate
value: 74
name: SR (test_unseen)
- type: spl
value: 64
name: SPL (test_unseen)
- task:
type: visual-navigation
name: Vision-and-Language Navigation
dataset:
type: REVERIE
name: REVERIE
metrics:
- type: success_rate
value: 46.4
name: SR (val_unseen)
- type: spl
value: 36.1
name: SPL (val_unseen)
- type: success_rate
value: 48.6
name: SR (test_unseen)
- type: spl
value: 37.1
name: SPL (test_unseen)
- task:
type: visual-navigation
name: Multilingual VLN
dataset:
type: RXR
name: RxR-EN
metrics:
- type: success_rate
value: 50.5
name: SR (val_unseen)
- type: ndtw
value: 51.2
name: nDTW (val_unseen)
- task:
type: visual-navigation
name: Dialog Navigation
dataset:
type: CVDN
name: CVDN
metrics:
- type: goal_progress
value: 6.94
name: GP (val)
- type: goal_progress
value: 7.07
name: GP (test)
- task:
type: visual-navigation
name: Object-Oriented Navigation
dataset:
type: SOON
name: SOON
metrics:
- type: success_rate
value: 36.1
name: SR (val_unseen)
- type: spl
value: 25.4
name: SPL (val_unseen)
- type: success_rate
value: 38.2
name: SR (test_unseen)
- type: spl
value: 27.1
name: SPL (test_unseen)
- task:
type: object-navigation
name: Object Navigation
dataset:
type: ObjectNav-MP3D
name: ObjectNav-MP3D
metrics:
- type: success_rate
value: 76.3
name: SR (val)
- type: spl
value: 42.7
name: SPL (val)
---
<div align="center">
<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>
<div>
<a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>๐</sup></a>;
<a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>๐ญ</sup></a>;
<a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>๐</sup></a>;
<a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>๐ฎ</sup></a>;
<a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>๐</sup></a>;
<a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>๐</sup></a>
</div>
<sup>๐</sup>AIML, University of Adelaide
<sup>๐ญ</sup>Adobe Research
<sup>๐</sup>UNC, Chapel Hill
<sup>๐ฎ</sup>UNSW Sydney
<br>
<div>
<a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
<a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>
</div>
## Model Description
**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
### Key Features
- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions
## Model Architecture
SAME is built on a transformer-based architecture with the following key components:
| Component | Description |
|-----------|-------------|
| **Language Encoder** | 9-layer BERT-based transformer encoder |
| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
| **Local VP Encoder** | Viewport-level information with crossmodal fusion |
| **Global Map Encoder** | Global spatial graph with dynamic routing |
| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |
### MoE Routing
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements
## Intended Uses
### Primary Use Cases
- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
- **Object Navigation**: Finding target objects given category names
- **Dialog-based Navigation**: Multi-turn conversational navigation
- **Remote Object Grounding**: Navigating to and identifying remote objects
### Supported Tasks
| Task | Dataset | Description |
|------|---------|-------------|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
| Object Search | SOON | Semantic object-oriented navigation |
| Object Navigation | ObjectNav-MP3D | Category-based object finding |
## How to Use
### Installation
```bash
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
```
### Download Data and Models
```bash
# Download all datasets and features
python download.py --data
# Download pretrained models
python download.py --pretrain
# Download trained checkpoints (optional)
python download.py --checkpoints
```
### Training
```bash
cd src
# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml
# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
run.py --config_dir configs/main_multi_q.yaml
```
### Evaluation
```bash
cd src
python run.py --config_dir configs/test.yaml \
--options experiment.resume_file=/path/to/checkpoint.pt
```
### Configuration Options
```yaml
model:
use_moe_layer: true
moe_type: "Task" # Task-based MoE
moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN
task_routing_feature: "multi" # Multimodal routing (recommended)
num_experts: 8
num_experts_per_tok: 2 # Top-2 expert selection
```
## Training Details
### Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset | Environment | Sampling Weight |
|---------|-------------|-----------------|
| R2R-ScaleVLN | HM3D | 10-20 |
| R2R-PREVALENT | MP3D | 1 |
| R2R | MP3D | 1 |
| REVERIE-ScaleVLN | HM3D | 1-10 |
| REVERIE | MP3D | 1 |
| RXR-EN | MP3D | 1 |
| CVDN | MP3D | 1 |
| SOON | MP3D | 1 |
| ObjectNav-MP3D | MP3D (Habitat) | 2 |
### Training Hyperparameters
- **Optimizer**: AdamW
- **Learning Rate**: 1e-5
- **Total Iterations**: 500,000
- **Batch Size**: 16
- **Gradient Clipping**: 0.5
- **Training Algorithm**: DAgger (Dataset Aggregation)
- **MoE Auxiliary Loss Coefficient**: 0.8
### Visual Features
- **Feature Extractor**: CLIP ViT-B/16
- **Feature Dimension**: 512
- **Format**: HDF5 / LMDB
- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
## Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
### Main Results (Unified Model)
#### Room-to-Room (R2R)
| Split | SR โ | SPL โ |
|-------|------|-------|
| Val Unseen | **76** | 66 |
| Test Unseen | **74** | **64** |
#### REVERIE
| Split | SR โ | SPL โ |
|-------|------|-------|
| Val Unseen | **46.4** | **36.1** |
| Test Unseen | **48.6** | **37.1** |
#### RxR-EN (Multilingual VLN)
| Split | SR โ | nDTW โ |
|-------|------|--------|
| Val Unseen | **50.5** | **51.2** |
#### CVDN (Dialog Navigation)
| Split | GP โ |
|-------|------|
| Val | **6.94** |
| Test | 7.07 |
#### SOON (Object-Oriented Navigation)
| Split | SR โ | SPL โ |
|-------|------|-------|
| Val Unseen | 36.1 | 25.4 |
| Test Unseen | **38.2** | **27.1** |
#### ObjectNav-MP3D
| Split | SR โ | SPL โ |
|-------|------|-------|
| Val | **76.3** | 42.7 |
### Evaluation Metrics
- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
- **GP (Goal Progress)**: Progress towards the goal in dialog navigation
- **NE (Navigation Error)**: Distance to goal at episode end
- **OSR (Oracle Success Rate)**: Success rate with oracle stop action
## Model Variants
| Variant | MoE Position | Routing | Checkpoint |
|---------|--------------|---------|------------|
| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |
## Limitations
- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
- **English Language**: Primary support for English instructions (though RXR provides multilingual data)
- **Static Environments**: Assumes static environments without dynamic obstacles or agents
## Environmental Impact
- **Hardware**: Training conducted on NVIDIA A100 GPUs
- **Training Time**: Approximately 2-3 days on 4x A100 GPUs
## Citation
If you find this work helpful, please cite:
```bibtex
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
```
## Authors
- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))
## Acknowledgements
We extend our gratitude to:
- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors. |