Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Lin Chen¹, Qi Yang¹, Kun Ding¹, Zhihao Li², Gang Shen³, Fei Li³, Qiyuan Cao¹, Shiming Xiang¹

¹Institute of Automation, Chinese Academy of Sciences

²Shandong University, ³Tower Corporation Limited

Abstract

Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated that it is capable of understanding open vocabulary in various visual scenes. However, most existing methods suffer from either suboptimal performance or long latency. This paper proposes a novel framework to balance accuracy and efficiency for OVSS in view of Efficient Redundancy Reduction (ERR) with the help of pre-trained models. We name it ERR-Seg for abbreviation. Different from the tricks for channel reduction in traditional architecture design, we construct a training-free Channel Reduction Module (CRM) by leveraging prior knowledge from vision-language models like CLIP to identify the most relevant classes and discard the others. Moreover, an Efficient Semantic Context Fusion (ESCF) is designed to achieve the goal of spatial-level and class-level sequence reduction. Architecturally, CRM and ESCF are concatenated in a staggered form into the ERR-Seg. Additionally, by considering the significance of hierarchical semantics extracted from middle-layer features for the compatibility of closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +\(5.6\%\) mIoU improvement and reduces latency by \(67.3\%\).

Main Architecture

Overall architecture of ERR-Seg. We first utilize the Hierarchical Semantic Module to generate dense pixel-text cost maps. The training-free Channel Reduction Module is then used to eliminate redundant classes. The Efficient Semantic Context Fusion further enhances the cost maps, improving spatial and class-level contextual information. Finally, the Gradual Upsampling Decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features \(\mathcal{F}_i^v, i\in \{1,2,3\}\) of CLIP's visual encoder.

Qualitative Results

Visualization of segmentation results in various domains. Our proposed ERR-Seg is capable of segmenting capybaras (a rare category in public datasets) from various domains, including (a) synthesized images, (b) cartoon images, (c) natural images, and (d) capybara dolls. Moreover, ERR-Seg achieves more precise masks than SAN and CAT-Seg.

ERR-Seg can correctly distinguish between a yellow dog and a white dog and between a lying capybara and a standing capybara.

BibTeX

@article{chen2025efficient, title={Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation}, author={Chen, Lin and Yang, Qi and Ding, Kun and Li, Zhihao and Shen, Gang and Li, Fei and Cao, Qiyuan and Xiang, Shiming}, journal={arXiv preprint arXiv:2501.17642}, year={2025} }