Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation
Fully supervised deep learning (DL) models for surgical video segmentation have been shown to struggle with non-adversarial, real-world corruptions of image quality including smoke, bleeding, and low illumination. Foundation models for image segmentation, such as the segment anything model (SAM) tha...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.08.2024
|
Online Access | Get full text |
Cover
Loading…
Summary: | Fully supervised deep learning (DL) models for surgical video segmentation
have been shown to struggle with non-adversarial, real-world corruptions of
image quality including smoke, bleeding, and low illumination. Foundation
models for image segmentation, such as the segment anything model (SAM) that
focuses on interactive prompt-based segmentation, move away from semantic
classes and thus can be trained on larger and more diverse data, which offers
outstanding zero-shot generalization with appropriate user prompts. Recently,
building upon this success, SAM-2 has been proposed to further extend the
zero-shot interactive segmentation capabilities from independent frame-by-frame
to video segmentation. In this paper, we present a first experimental study
evaluating SAM-2's performance on surgical video data. Leveraging the
SegSTRONG-C MICCAI EndoVIS 2024 sub-challenge dataset, we assess SAM-2's
effectiveness on uncorrupted endoscopic sequences and evaluate its
non-adversarial robustness on videos with corrupted image quality simulating
smoke, bleeding, and low brightness conditions under various prompt strategies.
Our experiments demonstrate that SAM-2, in zero-shot manner, can achieve
competitive or even superior performance compared to fully-supervised deep
learning models on surgical video data, including under non-adversarial
corruptions of image quality. Additionally, SAM-2 consistently outperforms the
original SAM and its medical variants across all conditions. Finally,
frame-sparse prompting can consistently outperform frame-wise prompting for
SAM-2, suggesting that allowing SAM-2 to leverage its temporal modeling
capabilities leads to more coherent and accurate segmentation compared to
frequent prompting. |
---|---|
DOI: | 10.48550/arxiv.2408.04098 |