Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning
Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizin...
Saved in:
Published in | International journal of applied earth observation and geoinformation Vol. 125; p. 103591 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.12.2023
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Traditional overhead imagery techniques for urban land use detection and mapping often lack the precision needed for accurate, fine-grained analysis, particularly in complex environments with multi-functional, multi-story buildings. To bridge the gap, this study introduces a novel approach, utilizing ground-level street view images geo-located at the point level, to provide more concrete, subtle, and informative visual characteristics for urban mixed land use analysis, addressing the two major limitations of overhead imagery: coarse resolution and insufficient visual information. Given that spatial context-aware land-use descriptions are commonly employed to describe urban environments, this study treats mixed land use detection as a Natural Language for Visual Reasoning (NLVR) task, i.e., classifying land use(s) in images based on the similarity of their visual characteristics and local descriptive land use contexts, by integrating street view images (vision) with spatial context-aware land use descriptions (language) through vision-language multimodal learning. The results indicate that our multimodal approach significantly outperforms traditional vision-based methods and can accurately capture the multiple functionalities of the ground features. It benefits from the incorporation of spatial context-aware prompts, whereas the geographic scale of geo-locations matters. Additionally, our approach marks a significant advancement in mixed land use mapping, achieving point-level precision. It allows for the representation of diverse land use types at point locations, offering the flexibility of mapping at various spatial resolutions, including census tracts and zoning districts. This approach is particularly effective in areas with diverse urban functionalities, facilitating a more fine-grained and detailed perspective on mixed land uses in urban settings.
•Proposing a novel framework of vision-language learning on land use mixture.•Better performance of vision-language multimodal learning than vision-based learning.•Validating spatial context-aware prompts in achieving better land use prediction.•Offering flexibility of spatial aggregation for fine-grained point-level mapping.•Globally available datasets enhancing model generalization and adaptability. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1569-8432 1872-826X |
DOI: | 10.1016/j.jag.2023.103591 |