Skip to main navigation Skip to search Skip to main content

Text and Non-Text Latent Feature Disentanglement for Screen Content Image Compression

  • Hao Wang
  • , Junyan Huo
  • , Fei Yang
  • , Shuai Wan
  • , Gaoxing Chen
  • , Kun Yang
  • , Luis Herranz
  • , Fuzheng Yang
  • Xidian University
  • Nankai University
  • Royal Melbourne Institute of Technology University
  • Alibaba Group Holding Ltd.
  • Technical University of Madrid

Research output: Contribution to journalArticlepeer-review

Abstract

With the growing prevalence of screen content images in multimedia communication, efficient compression has become increasingly crucial. Unlike natural scene images, screen content typically contains rich text regions that exhibit unique characteristics and low correlation with surrounding non-text elements. The intricate mixture of text and non-text within images poses significant challenges for existing learned compression networks, as the text and non-text features are severely entangled in the latent domain along the channel dimension, leading to compromised reconstruction quality and suboptimal entropy estimation. In this paper, we propose a novel Disentangled Image Compression Architecture (DICA) that enhances the analysis module and the entropy model of existing compression architectures to address these limitations. First, we introduce a Disentangled Analysis Module (DAM) by augmenting original analysis modules with an additional text approximation branch and a disentangling network. They work in concert to disentangle latent features into text and non-text classes along the channel dimension, resulting in a more structured feature distribution that better aligns with compression requirements. Second, we propose a Disentangled Channel-Conditional Entropy Model (DCEM) that efficiently leverages the feature distribution bias introduced by DAM, thereby further improving compression performance. Experimental results demonstrate that the proposed DICA, along with DAM and DCEM can be integrated into various channel-conditional compression backbones, significantly improving their performance in screen content compression–particularly in hard-to-compress text regions. When integrated with an advanced WACNN backbone, our method achieves a 13% overall BD-Rate gain and a 16% BD-Rate gain in text regions on the SIQAD dataset.

Original languageEnglish
Pages (from-to)2505-2519
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume36
Issue number2
DOIs
StatePublished - 2026

Keywords

  • Image compression
  • latent feature disentanglement
  • screen content image

Fingerprint

Dive into the research topics of 'Text and Non-Text Latent Feature Disentanglement for Screen Content Image Compression'. Together they form a unique fingerprint.

Cite this