Aligning local features from multi-view (ALFM): A hybrid self-Supervised framework for object detection via contextual distillation and global representation learning

  • Zhenyu Fang
  • , Zhuowei Wang
  • , Jinchang Ren
  • , Jiangbin Zheng
  • , Rongjun Chen
  • , Huimin Zhao

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Self-supervised learning learns generalized representations using unlabeled data for downstream tasks, where many of them are optimized with a pretext task derived from multi-view augmented images, based on the assumption that the majority of the foreground objects in the source dataset and the background features are redundant. However, for object detection datasets, background features are essential for accurately detecting objects. In this paper, a detection-specific self-supervised method is proposed for aligning local features from multi-view images (ALFM). The proposed ALFM consists of two learning branches: global minimal sufficient representation (GMSR) and contextual distillation on local patches (CDLP). The GMSR loss globally learns sufficient feature representations with minimal redundant information, enabling the network to maintain generalization when foreground categories are not determined. This is achieved by maximizing the similarity between the embeddings of two views and increasing the differential entropy of the embeddings from each view. The CDLP loss is proposed to enhance local feature representations while reducing redundant information caused by the gap between the pretext task and the detection task. This is achieved by learning to predict ”soft-labels” with rich contextual information. Taking COCO as the pretraining dataset, results from various detection benchmarks validate the efficacy of the proposed ALFM, achieving similar mAP as ImageNet-pretrained models while using only 10 % of the training samples.

Original languageEnglish
Article number114671
JournalKnowledge-Based Systems
Volume330
DOIs
StatePublished - 25 Nov 2025

Keywords

  • Contextual distillation
  • Global minimal sufficient representation
  • Object detection
  • Self-supervised learning

Fingerprint

Dive into the research topics of 'Aligning local features from multi-view (ALFM): A hybrid self-Supervised framework for object detection via contextual distillation and global representation learning'. Together they form a unique fingerprint.

Cite this