Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Lin Chen1, Qi Yang1, Kun Ding1, Zhihao Li2, Gang Shen3, Fei Li3, Qiyuan Cao1, Shiming Xiang1
1Institute of Automation, Chinese Academy of Sciences
2Shandong University, 3Tower Corporation Limited

Abstract

Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated that it is capable of understanding open vocabulary in various visual scenes. However, most existing methods suffer from either suboptimal performance or long latency. This paper proposes a novel framework to balance accuracy and efficiency for OVSS in view of Efficient Redundancy Reduction (ERR) with the help of pre-trained models. We name it ERR-Seg for abbreviation. Different from the tricks for channel reduction in traditional architecture design, we construct a training-free Channel Reduction Module (CRM) by leveraging prior knowledge from vision-language models like CLIP to identify the most relevant classes and discard the others. Moreover, an Efficient Semantic Context Fusion (ESCF) is designed to achieve the goal of spatial-level and class-level sequence reduction. Architecturally, CRM and ESCF are concatenated in a staggered form into the ERR-Seg. Additionally, by considering the significance of hierarchical semantics extracted from middle-layer features for the compatibility of closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +\(5.6\%\) mIoU improvement and reduces latency by \(67.3\%\).

Main Architecture

Main Architecture

Overall architecture of ERR-Seg. We first utilize the Hierarchical Semantic Module to generate dense pixel-text cost maps. The training-free Channel Reduction Module is then used to eliminate redundant classes. The Efficient Semantic Context Fusion further enhances the cost maps, improving spatial and class-level contextual information. Finally, the Gradual Upsampling Decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features \(\mathcal{F}_i^v, i\in \{1,2,3\}\) of CLIP's visual encoder.

Results

Main Architecture

Qualitative Results

quantitative results

Visualization of segmentation results in various domains. Our proposed ERR-Seg is capable of segmenting capybaras (a rare category in public datasets) from various domains, including (a) synthesized images, (b) cartoon images, (c) natural images, and (d) capybara dolls. Moreover, ERR-Seg achieves more precise masks than SAN and CAT-Seg.

quantitative results

ERR-Seg can correctly distinguish between a yellow dog and a white dog and between a lying capybara and a standing capybara.

A-150 results
A-847 results

BibTeX

@article{chen2025efficient,
      title={Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation},
      author={Chen, Lin and Yang, Qi and Ding, Kun and Li, Zhihao and Shen, Gang and Li, Fei and Cao, Qiyuan and Xiang, Shiming},
      journal={arXiv preprint arXiv:2501.17642},
      year={2025}
    }