Overall architecture of ERR-Seg. We first utilize the Hierarchical Semantic Module to generate dense pixel-text cost maps. The training-free Channel Reduction Module is then used to eliminate redundant classes. The Efficient Semantic Context Fusion further enhances the cost maps, improving spatial and class-level contextual information. Finally, the Gradual Upsampling Decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features \(\mathcal{F}_i^v, i\in \{1,2,3\}\) of CLIP's visual encoder.
Visualization of segmentation results in various domains. Our proposed ERR-Seg is capable of segmenting capybaras (a rare category in public datasets) from various domains, including (a) synthesized images, (b) cartoon images, (c) natural images, and (d) capybara dolls. Moreover, ERR-Seg achieves more precise masks than SAN and CAT-Seg.
ERR-Seg can correctly distinguish between a yellow dog and a white dog and between a lying capybara and a standing capybara.
@article{chen2025efficient,
title={Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation},
author={Chen, Lin and Yang, Qi and Ding, Kun and Li, Zhihao and Shen, Gang and Li, Fei and Cao, Qiyuan and Xiang, Shiming},
journal={arXiv preprint arXiv:2501.17642},
year={2025}
}