AerialVLN: Vision-and-Language Navigation for UAVs

Shubo Liu; Hongsheng Zhang; Yuankai Qi; Peng Wang; Yanning Zhang; Qi Wu

doi:10.1109/ICCV51070.2023.01411

AerialVLN: Vision-and-Language Navigation for UAVs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, Qi Wu

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

20 Scopus citations

Abstract

Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	15338-15348
Number of pages	11
ISBN (Electronic)	9798350307184
DOIs	https://doi.org/10.1109/ICCV51070.2023.01411
State	Published - 2023
Event	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: 2 Oct 2023 → 6 Oct 2023

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
ISSN (Print)	1550-5499

Conference

Conference	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/Territory	France
City	Paris
Period	2/10/23 → 6/10/23

Access to Document

10.1109/ICCV51070.2023.01411

Cite this

Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., & Wu, Q. (2023). AerialVLN: Vision-and-Language Navigation for UAVs. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 (pp. 15338-15348). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV51070.2023.01411

@inproceedings{b4b3ea03ce054375932affc11e7ee828,

title = "AerialVLN: Vision-and-Language Navigation for UAVs",

abstract = "Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.",

author = "Shubo Liu and Hongsheng Zhang and Yuankai Qi and Peng Wang and Yanning Zhang and Qi Wu",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 ; Conference date: 02-10-2023 Through 06-10-2023",

year = "2023",

doi = "10.1109/ICCV51070.2023.01411",

language = "英语",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "15338--15348",

booktitle = "Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023",

}

Liu, S, Zhang, H, Qi, Y, Wang, P , Zhang, Y & Wu, Q 2023, AerialVLN: Vision-and-Language Navigation for UAVs. in Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., pp. 15338-15348, 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 2/10/23. https://doi.org/10.1109/ICCV51070.2023.01411

AerialVLN: Vision-and-Language Navigation for UAVs. / Liu, Shubo; Zhang, Hongsheng; Qi, Yuankai et al.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 15338-15348 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - AerialVLN

T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

AU - Liu, Shubo

AU - Zhang, Hongsheng

AU - Qi, Yuankai

AU - Wang, Peng

AU - Zhang, Yanning

AU - Wu, Qi

PY - 2023

Y1 - 2023

N2 - Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.

AB - Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.

UR - http://www.scopus.com/inward/record.url?scp=85185875542&partnerID=8YFLogxK

U2 - 10.1109/ICCV51070.2023.01411

DO - 10.1109/ICCV51070.2023.01411

M3 - 会议稿件

AN - SCOPUS:85185875542

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 15338

EP - 15348

BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 October 2023 through 6 October 2023

ER -

AerialVLN: Vision-and-Language Navigation for UAVs

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this