Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning

Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizin...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of applied earth observation and geoinformation Vol. 125; p. 103591
Main Authors Wu, Meiliu, Huang, Qunying, Gao, Song, Zhang, Zhou
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.12.2023
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizing ground-level street view images geo-located at the point level, to provide more concrete, subtle, and informative visual characteristics for urban mixed land use analysis, addressing the two major limitations of overhead imagery: coarse resolution and insufficient visual information. Given that spatial context-aware land-use descriptions are commonly employed to describe urban environments, this study treats mixed land use detection as a Natural Language for Visual Reasoning (NLVR) task, i.e., classifying land use(s) in images based on the similarity of their visual characteristics and local descriptive land use contexts, by integrating street view images (vision) with spatial context-aware land use descriptions (language) through vision-language multimodal learning. The results indicate that our multimodal approach significantly outperforms traditional vision-based methods and can accurately capture the multiple functionalities of the ground features. It benefits from the incorporation of spatial context-aware prompts, whereas the geographic scale of geo-locations matters. Additionally, our approach marks a significant advancement in mixed land use mapping, achieving point-level precision. It allows for the representation of diverse land use types at point locations, offering the flexibility of mapping at various spatial resolutions, including census tracts and zoning districts. This approach is particularly effective in areas with diverse urban functionalities, facilitating a more fine-grained and detailed perspective on mixed land uses in urban settings. •Proposing a novel framework of vision-language learning on land use mixture.•Better performance of vision-language multimodal learning than vision-based learning.•Validating spatial context-aware prompts in achieving better land use prediction.•Offering flexibility of spatial aggregation for fine-grained point-level mapping.•Globally available datasets enhancing model generalization and adaptability.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1569-8432
1872-826X
DOI:10.1016/j.jag.2023.103591