Assessing the spatial accuracy of geocoding flood-related imagery using Vision Language Models
While the capabilities of large language models and visual language models for various classification tasks have advanced significantly, their potential for location inference remains largely underexplored. Therefore, this study evaluates the performance of four prominent models — BLIP-2, LLaVA1.6,...
Saved in:
Published in | Spatial information research (Online) Vol. 33; no. 2; p. 15 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
대한공간정보학회
01.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | While the capabilities of large language models and visual language models for various classification tasks have advanced significantly, their potential for location inference remains largely underexplored. Therefore, this study evaluates the performance of four prominent models — BLIP-2, LLaVA1.6, OpenFlamingo, and GPT-4o — for geocoding flood-related images from Flickr. Model inferences are compared against the original photo locations and human-labelled assessments. Our findings reveal that GPT-4o achieves the highest spatial accuracy (median deviation of 89.12 km). OpenFlamingo geocodes the highest number of images (90.7%), albeit with fluctuating quality (median 408.35 km), while still outperforming the human annotators. LLaVA1.6 geocodes only 18.9% of all images, while BLIP-2 exhibits the highest median deviation (1,781 km). We observe a spatial bias in our results, with inferences being most accurate in Central Europe. Additionally, model results improve when images feature recognisable landmarks. The proposed workflow could significantly increase the amount of geocoded web-based data available for disaster management, though further research is required to enhance accuracy across diverse geographic contexts. |
---|---|
ISSN: | 2366-3286 2366-3294 |
DOI: | 10.1007/s41324-025-00609-0 |