ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

M. Arda Aydin, Efe Mert Cirpar, Elvin Abdinli, Gozde Una, Yusuf H. Sahin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on five popular segmentation benchmarks. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PublisherIEEE Computer Society
Pages4142-4152
Number of pages11
ISBN (Electronic)9798331599942
DOIs
Publication statusPublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025 - Nashville, United States
Duration: 11 Jun 202512 Jun 2025

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2512/06/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Keywords

  • open-vocabulary semantic segmentation
  • training-free semantic segmentation
  • vision-language models

Fingerprint

Dive into the research topics of 'ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements'. Together they form a unique fingerprint.

Cite this