A Lightweight Clustering Framework for Unsupervised Semantic Segmentation
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model acc...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Unsupervised semantic segmentation aims to categorize each pixel in an image
into a corresponding class without the use of annotated data. It is a widely
researched area as obtaining labeled datasets is expensive. While previous
works in the field have demonstrated a gradual improvement in model accuracy,
most required neural network training. This made segmentation equally
expensive, especially when dealing with large-scale datasets. We thus propose a
lightweight clustering framework for unsupervised semantic segmentation. We
discovered that attention features of the self-supervised Vision Transformer
exhibit strong foreground-background differentiability. Therefore, clustering
can be employed to effectively separate foreground and background image
patches. In our framework, we first perform multilevel clustering across the
Dataset-level, Category-level, and Image-level, and maintain consistency
throughout. Then, the binary patch-level pseudo-masks extracted are upsampled,
refined and finally labeled. Furthermore, we provide a comprehensive analysis
of the self-supervised Vision Transformer features and a detailed comparison
between DINO and DINOv2 to justify our claims. Our framework demonstrates great
promise in unsupervised semantic segmentation and achieves state-of-the-art
results on PASCAL VOC and MS COCO datasets. |
---|---|
DOI: | 10.48550/arxiv.2311.18628 |