Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning

Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizin...

Full description

Saved in:

Bibliographic Details
Published in	International journal of applied earth observation and geoinformation Vol. 125; p. 103591
Main Authors	Wu, Meiliu, Huang, Qunying, Gao, Song, Zhang, Zhou
Format	Journal Article
Language	English
Published	Elsevier B.V 01.12.2023 Elsevier
Subjects	land use Mixed land use mapping Prompt engineering spatial data Street view images vision Vision-language multimodal learning Zero-shot learning Street view images Mixed land use mapping Zero-shot learning Vision-language multimodal learning Prompt engineering
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizing ground-level street view images geo-located at the point level, to provide more concrete, subtle, and informative visual characteristics for urban mixed land use analysis, addressing the two major limitations of overhead imagery: coarse resolution and insufficient visual information. Given that spatial context-aware land-use descriptions are commonly employed to describe urban environments, this study treats mixed land use detection as a Natural Language for Visual Reasoning (NLVR) task, i.e., classifying land use(s) in images based on the similarity of their visual characteristics and local descriptive land use contexts, by integrating street view images (vision) with spatial context-aware land use descriptions (language) through vision-language multimodal learning. The results indicate that our multimodal approach significantly outperforms traditional vision-based methods and can accurately capture the multiple functionalities of the ground features. It benefits from the incorporation of spatial context-aware prompts, whereas the geographic scale of geo-locations matters. Additionally, our approach marks a significant advancement in mixed land use mapping, achieving point-level precision. It allows for the representation of diverse land use types at point locations, offering the flexibility of mapping at various spatial resolutions, including census tracts and zoning districts. This approach is particularly effective in areas with diverse urban functionalities, facilitating a more fine-grained and detailed perspective on mixed land uses in urban settings. •Proposing a novel framework of vision-language learning on land use mixture.•Better performance of vision-language multimodal learning than vision-based learning.•Validating spatial context-aware prompts in achieving better land use prediction.•Offering flexibility of spatial aggregation for fine-grained point-level mapping.•Globally available datasets enhancing model generalization and adaptability.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1569-8432 1872-826X
DOI:	10.1016/j.jag.2023.103591