File size: 12,251 Bytes
49d95bb
 
 
 
f7e8fbf
49d95bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
---
language:
- en
license: mit
library_name: same
tags:
- vision-language
- navigation
- embodied-ai
- visual-navigation
- mixture-of-experts
- multimodal
- pytorch
datasets:
- R2R
- REVERIE
- RXR
- CVDN
- SOON
- ObjectNav-MP3D
metrics:
- success_rate
- spl
pipeline_tag: visual-question-answering
model-index:
- name: SAME
  results:
  - task:
      type: visual-navigation
      name: Vision-and-Language Navigation
    dataset:
      type: R2R
      name: Room-to-Room (R2R)
    metrics:
    - type: success_rate
      value: 76
      name: SR (val_unseen)
    - type: spl
      value: 66
      name: SPL (val_unseen)
    - type: success_rate
      value: 74
      name: SR (test_unseen)
    - type: spl
      value: 64
      name: SPL (test_unseen)
  - task:
      type: visual-navigation
      name: Vision-and-Language Navigation
    dataset:
      type: REVERIE
      name: REVERIE
    metrics:
    - type: success_rate
      value: 46.4
      name: SR (val_unseen)
    - type: spl
      value: 36.1
      name: SPL (val_unseen)
    - type: success_rate
      value: 48.6
      name: SR (test_unseen)
    - type: spl
      value: 37.1
      name: SPL (test_unseen)
  - task:
      type: visual-navigation
      name: Multilingual VLN
    dataset:
      type: RXR
      name: RxR-EN
    metrics:
    - type: success_rate
      value: 50.5
      name: SR (val_unseen)
    - type: ndtw
      value: 51.2
      name: nDTW (val_unseen)
  - task:
      type: visual-navigation
      name: Dialog Navigation
    dataset:
      type: CVDN
      name: CVDN
    metrics:
    - type: goal_progress
      value: 6.94
      name: GP (val)
    - type: goal_progress
      value: 7.07
      name: GP (test)
  - task:
      type: visual-navigation
      name: Object-Oriented Navigation
    dataset:
      type: SOON
      name: SOON
    metrics:
    - type: success_rate
      value: 36.1
      name: SR (val_unseen)
    - type: spl
      value: 25.4
      name: SPL (val_unseen)
    - type: success_rate
      value: 38.2
      name: SR (test_unseen)
    - type: spl
      value: 27.1
      name: SPL (test_unseen)
  - task:
      type: object-navigation
      name: Object Navigation
    dataset:
      type: ObjectNav-MP3D
      name: ObjectNav-MP3D
    metrics:
    - type: success_rate
      value: 76.3
      name: SR (val)
    - type: spl
      value: 42.7
      name: SPL (val)
---

<div align="center">

<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>

<div>
    <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>๐Ÿ•</sup></a>;
    <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>๐ŸŒญ</sup></a>;
    <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>๐Ÿ”</sup></a>;
    <a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>๐ŸŒฎ</sup></a>;
    <a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>๐Ÿ”</sup></a>;
    <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>๐Ÿ•</sup></a>
</div>
<sup>๐Ÿ•</sup>AIML, University of Adelaide 
<sup>๐ŸŒญ</sup>Adobe Research 
<sup>๐Ÿ”</sup>UNC, Chapel Hill
<sup>๐ŸŒฎ</sup>UNSW Sydney

<br>

<div>
    <a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
    <a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
    <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>

</div>

## Model Description

**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.

### Key Features

- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions

## Model Architecture

SAME is built on a transformer-based architecture with the following key components:

| Component | Description |
|-----------|-------------|
| **Language Encoder** | 9-layer BERT-based transformer encoder |
| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
| **Local VP Encoder** | Viewport-level information with crossmodal fusion |
| **Global Map Encoder** | Global spatial graph with dynamic routing |
| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |

### MoE Routing

The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements

## Intended Uses

### Primary Use Cases

- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
- **Object Navigation**: Finding target objects given category names
- **Dialog-based Navigation**: Multi-turn conversational navigation
- **Remote Object Grounding**: Navigating to and identifying remote objects

### Supported Tasks

| Task | Dataset | Description |
|------|---------|-------------|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
| Object Search | SOON | Semantic object-oriented navigation |
| Object Navigation | ObjectNav-MP3D | Category-based object finding |

## How to Use

### Installation

```bash
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
```

### Download Data and Models

```bash
# Download all datasets and features
python download.py --data

# Download pretrained models
python download.py --pretrain

# Download trained checkpoints (optional)
python download.py --checkpoints
```

### Training

```bash
cd src

# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
    run.py --config_dir configs/main_multi_q.yaml
```

### Evaluation

```bash
cd src
python run.py --config_dir configs/test.yaml \
    --options experiment.resume_file=/path/to/checkpoint.pt
```

### Configuration Options

```yaml
model:
  use_moe_layer: true
  moe_type: "Task"              # Task-based MoE
  moe_position: "Attn_q"        # Attn_q, Attn_kv, or FFN
  task_routing_feature: "multi" # Multimodal routing (recommended)
  num_experts: 8
  num_experts_per_tok: 2        # Top-2 expert selection
```
## Training Details
### Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset | Environment | Sampling Weight |
|---------|-------------|-----------------|
| R2R-ScaleVLN | HM3D | 10-20 |
| R2R-PREVALENT | MP3D | 1 |
| R2R | MP3D | 1 |
| REVERIE-ScaleVLN | HM3D | 1-10 |
| REVERIE | MP3D | 1 |
| RXR-EN | MP3D | 1 |
| CVDN | MP3D | 1 |
| SOON | MP3D | 1 |
| ObjectNav-MP3D | MP3D (Habitat) | 2 |
### Training Hyperparameters
- **Optimizer**: AdamW
- **Learning Rate**: 1e-5
- **Total Iterations**: 500,000
- **Batch Size**: 16
- **Gradient Clipping**: 0.5
- **Training Algorithm**: DAgger (Dataset Aggregation)
- **MoE Auxiliary Loss Coefficient**: 0.8
### Visual Features
- **Feature Extractor**: CLIP ViT-B/16
- **Feature Dimension**: 512
- **Format**: HDF5 / LMDB
- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
## Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
### Main Results (Unified Model)
#### Room-to-Room (R2R)
| Split | SR โ†‘ | SPL โ†‘ |
|-------|------|-------|
| Val Unseen | **76** | 66 |
| Test Unseen | **74** | **64** |
#### REVERIE
| Split | SR โ†‘ | SPL โ†‘ |
|-------|------|-------|
| Val Unseen | **46.4** | **36.1** |
| Test Unseen | **48.6** | **37.1** |
#### RxR-EN (Multilingual VLN)
| Split | SR โ†‘ | nDTW โ†‘ |
|-------|------|--------|
| Val Unseen | **50.5** | **51.2** |
#### CVDN (Dialog Navigation)
| Split | GP โ†‘ |
|-------|------|
| Val | **6.94** |
| Test | 7.07 |
#### SOON (Object-Oriented Navigation)
| Split | SR โ†‘ | SPL โ†‘ |
|-------|------|-------|
| Val Unseen | 36.1 | 25.4 |
| Test Unseen | **38.2** | **27.1** |
#### ObjectNav-MP3D
| Split | SR โ†‘ | SPL โ†‘ |
|-------|------|-------|
| Val | **76.3** | 42.7 |
### Evaluation Metrics
- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
- **GP (Goal Progress)**: Progress towards the goal in dialog navigation
- **NE (Navigation Error)**: Distance to goal at episode end
- **OSR (Oracle Success Rate)**: Success rate with oracle stop action
## Model Variants
| Variant | MoE Position | Routing | Checkpoint |
|---------|--------------|---------|------------|
| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |

## Limitations

- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
- **English Language**: Primary support for English instructions (though RXR provides multilingual data)
- **Static Environments**: Assumes static environments without dynamic obstacles or agents

## Environmental Impact

- **Hardware**: Training conducted on NVIDIA A100 GPUs
- **Training Time**: Approximately 2-3 days on 4x A100 GPUs

## Citation

If you find this work helpful, please cite:

```bibtex
@article{zhou2024same,
  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
  journal={arXiv preprint arXiv:2412.05552},
  year={2024},
}
```

## Authors

- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))

## Acknowledgements

We extend our gratitude to:
- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors.