TODO:

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Nagation

Kehan Chen; Yan Huang; Dong An; Jiawei He; Yifei Su; Jing Liu; Nianfeng Liu; Liang Wang+;

[Paper]

Existing Vision-Language Navigation (VLN) paradigms require agents to follow verbose instructions without global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans or guide maps) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce FloorPlan-VLN, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise, practical instructions. First, we construct the FloorPlan-VLN dataset, comprising over 10K episodes across 72 scenes, pairing more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions. Then, we propose a simple yet effective method FP-Nav that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and geometric map distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.

TODO:

Release MP3D floor plan
Release FPNav model
Release finetune json files
Release Val-seen and Val-unseen evaluation files
Release FloorPlan-VLN-R2R finetune data
Release FloorPlan-VLN-RxR finetune data
Release FloorPlan Dataset construction code

Quick Start:

Data and Model

Download Matterport3D Scenes: Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:
```
# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
```
Extract such that it has the form scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes. Place the scene_datasets folder in data/.
Download Matterport-3D Floor Plans
Download FloorPlan-VLN-R2R and FloorPlan-VLN-R2xR data for evaluation.

Download FloorPlan-VLN-R2R-tar for finetune:

cd FloorPlan-VLN-R2R-tar
cat FloorPlan-VLN-R2R.tar.* > FloorPlan-VLN-R2R.tar
tar -xvf full_archive.tar

Downlaod FloorPlan-VLN-RxR-tar for finetune:

cd FloorPlan-VLN-RxR-tar
cat FloorPlan-VLN-RxR.tar.* > FloorPlan-VLN-RxR.tar
tar -xvf full_archive.tar

Download FloorPlan-VLN-R2R-finetune-json and FloorPlan-VLN-RxR-finetune-json for finetune
Download FP-Nav model for evaluation.

Overall, data are organized as follows:

FloorPlan-VLN
├── models
│    ├── Qwen-2.5-VL-7B-Instruct
│    └── fp-nav-vision-r2r-rxr-lr-4e-5-videoframe6
├── qwen-vl-finetune
│    ├── data
│    │   ├── mp3d_floorplan
│    │   │    ├── 1LXtFkjw3qL/floorplan.json
│    │   │    ├── ...
│    │   ├── FloorPlan-VLN-R2R
│    │   │    └── r2r/train
│    │   │             ├── 1LXtFkjw3qL/0.mp4
│    |   │             ├── ...
│    │   ├── FloorPlan-VLN-RxR
│    |   │    └── rxr/train
│    │   │             ├── 1LXtFkjw3qL/0.mp4
│    |   │             ├── ...
│    │   ├── floorplan_vln_r2r_finetune.json
│    │   ├── floorplan_vln_rxr_finetune.json
│    ├── scripts
│    ├── ...
├── VLN-CE
│    ├── data
│    │   ├── datasets
│    │   │    ├── FloorPlan-VLN-R2R
│    │   │    │        ├── val_seen
|    │   |    │        │      ├── val_seen.json.gz
│    │   │    │        |      └── val_seen_gt.json.gz
│    │   │    |        └── val_unseen
│    │   │    │               ├── val_unseen.json.gz
│    │   │    │               └── val_unseen_gt.json.gz
│    │   │    └── FloorPlan-VLN-RxR
│    │   │             ├── val_seen
│    │   │             |      ├── val_seen.json.gz
│    │   │             |      └── val_seen_gt.json.gz
│    │   │             └── val_unseen
│    │   │                    ├── val_unseen.json.gz
│    │   │                    └── val_unseen_gt.json.gz
│    │   └── scene_datasets/mp3d
|    ├── ...
├── ...

Installatin for finetuning

NOTE: We finetune Qwen-2.5-VL-7B on 4 H100 80GB GPUs.

Create a virtual environment. We develop this project with Python 3.12:
```
conda create -n FP-Nav python==3.12.7
conda activate FP-Nav
```

Install Qwen-2.5-VL requirements (More details in README_QWEN.md):

pip install transformers==4.51.3 accelerate
cd qwen-vl-finetune
pip install -r requirements.txt

Check requirements and more details according to the README_Finetune.md

Finetune

To reproduce the fine-tuning process described in our paper, you will need the pre-trained Qwen model. You can set this up in two ways:

Option 1 (Automatic): Directly use the Hugging Face Repo ID Qwen/Qwen2.5-VL-7B-Instruct in the bash script.
Option 2 (Manual): Download the model weights manually from Hugging Face and place them in the following directory: FloorPlan-VLN/models/Qwen-2.5-VL-7B-Instruct/.

cd FloorPlan-VLN/qwen-vl-finetune
bash scripts/sft_7b_fpnav_r2r_rxr_vision.sh # remember to specify the 'llm' path in bash file.

Installation for evaluation

NOTE: Some GPUS such as H100 may not support to install habitat. We evaluate on 8 3090 GPUs.

Activate the conda environment:
```
conda activate FP-Nav
```

Install habitat-sim-v0.1.7 for a machine with multiple GPUs or without an attached display (i.e. a cluster):

git clone https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
git checkout tags/v0.1.7
pip install -r requirements.txt
python setup.py install --headless

Install habitat-lab-v0.1.7:

git clone https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
git checkout tags/v0.1.7
cd habitat_baselines/rl
vi requirements.txt # delete tensorflow==1.13.1
cd ../../ # (return to habitat-lab direction)

pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install -r requirements.txt
python setup.py develop --all # install habitat and habitat_baselines; If the installation fails, try again, most of the time it is due to network problems

# if install failed, try: pip install -e .[all]

If you encounter some problems and failed to install habitat, please try to follow the Official Habitat Installation Guide to install habitat-lab and habitat-sim. We use version v0.1.7 in our experiments, same as in the VLN-CE, please refer to the VLN-CE page for more details.

Install requirements:

cd FloorPlan-VLN/VLN-CE
pip install -r requirements.txt

Evaluate

Before start evaluation, you should:

Check fp-nav.yaml, set the correct BASE_TASK_CONFIG_PATH, SPLIT
Check fp-nav-r2r.yaml and set the correct DATA_PATH and SCENE_DIR

To start multi-gpu evaluation:

cd FloorPlan-VLN
bash eval_fp_nav.sh

The results will saved at tmp/<your experiment name>, to calculate the metrics, run:

cd FloorPlan-VLN
python analyze_results.py --path tmp/<your experiment name>

Dataset construction

cd FloorPlan-VLN-Dataset

The step of construct FloorPlan-VLN-R2R/RxR Dataset, Navigation QA Dataset and Auxiliary Task Dataset for finetue are as follows:

(The data construction process is quite complex. I haven't had time to further integrate the code yet, so I’m providing a brief explanation rather than a detailed execution guide.)

collect_navigation_dataset.py: construct mp3d floor plans and filter valid episodes.
collect_navigation_step_images.py: record navigation images step-by-step in habitat
images2videos.py: turn images to videos for each episode
rebalance_actions.py: merge consecutive actions and upsample low frequency actions such as stop
resample_videos_mp.py: resampling videos according to the rebalanced actions.
collect_floorplan_step_images.py: plot trajector(after action rebalancing) on floor plans.
images2videos.py: create floor plan navigation videos.
concate_videos.py: create spatio-temporally aligned videos.
construct_floorplan_instruction.py: construct concise instructions for FloorPlan-VLN that only refer to start region, target region and stop conditions.
construct_auxiliary_tasks.py construct auxiliary tasks.
construct_finetune_json_files: construct QA samples for finetune.

Contact Information

kehan.chen@cripac.ia.ac.cn, Kehan Chen

Acknowledge

Our implementations are partially inspired by NaVid and NaVILA. Thanks for the great works!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
FloorPlan-VLN-Dataset		FloorPlan-VLN-Dataset
VLN_CE		VLN_CE
assets		assets
evaluation/mmmu		evaluation/mmmu
qwen-vl-finetune		qwen-vl-finetune
qwen-vl-utils		qwen-vl-utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_QWEN.md		README_QWEN.md
analyze_results.py		analyze_results.py
eval_fp_nav.sh		eval_fp_nav.sh
floorplan_nav.py		floorplan_nav.py
fp_nav_agent.py		fp_nav_agent.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Nagation

[Paper]

TODO:

Quick Start:

Data and Model

Installatin for finetuning

Finetune

Installation for evaluation

Evaluate

Dataset construction

Contact Information

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Nagation

[Paper]

TODO:

Quick Start:

Data and Model

Installatin for finetuning

Finetune

Installation for evaluation

Evaluate

Dataset construction

Contact Information

Acknowledge

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages