Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model

Yang Jin; Lei Zhang; Shi Yan; Bin Fan; Binglu Wang

doi:10.1007/978-3-031-72890-7_23

Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model

Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, i.e., a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model’s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method. The code will be available at https://github.com/jinyang06/SamGOP.

Original language	English
Title of host publication	Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
Editors	Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	369-386
Number of pages	18
ISBN (Print)	9783031728891
DOIs	https://doi.org/10.1007/978-3-031-72890-7_23
State	Published - 2025
Event	18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy Duration: 29 Sep 2024 → 4 Oct 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	15127 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	18th European Conference on Computer Vision, ECCV 2024
Country/Territory	Italy
City	Milan
Period	29/09/24 → 4/10/24

Keywords

Gaze object prediction
Object segmentation
Space-to-object gaze regression
Vision foundation model

Access to Document

10.1007/978-3-031-72890-7_23

Cite this

Jin, Y., Zhang, L., Yan, S., Fan, B., & Wang, B. (2025). Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer Vision – ECCV 2024 - 18th European Conference, Proceedings (pp. 369-386). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15127 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-72890-7_23

Jin, Yang ; Zhang, Lei ; Yan, Shi et al. / Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model. Computer Vision – ECCV 2024 - 18th European Conference, Proceedings. editor / Aleš Leonardis ; Elisa Ricci ; Stefan Roth ; Olga Russakovsky ; Torsten Sattler ; Gül Varol. Springer Science and Business Media Deutschland GmbH, 2025. pp. 369-386 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f008adfa6bf54e64bafddac49a582898,

title = "Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model",

abstract = "Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, i.e., a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model{\textquoteright}s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method. The code will be available at https://github.com/jinyang06/SamGOP.",

keywords = "Gaze object prediction, Object segmentation, Space-to-object gaze regression, Vision foundation model",

author = "Yang Jin and Lei Zhang and Shi Yan and Bin Fan and Binglu Wang",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.; 18th European Conference on Computer Vision, ECCV 2024 ; Conference date: 29-09-2024 Through 04-10-2024",

year = "2025",

doi = "10.1007/978-3-031-72890-7_23",

language = "英语",

isbn = "9783031728891",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "369--386",

editor = "Ale{\v s} Leonardis and Elisa Ricci and Stefan Roth and Olga Russakovsky and Torsten Sattler and G{\"u}l Varol",

booktitle = "Computer Vision – ECCV 2024 - 18th European Conference, Proceedings",

}

Jin, Y, Zhang, L, Yan, S, Fan, B & Wang, B 2025, Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model. in A Leonardis, E Ricci, S Roth, O Russakovsky, T Sattler & G Varol (eds), Computer Vision – ECCV 2024 - 18th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 15127 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 369-386, 18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, 29/09/24. https://doi.org/10.1007/978-3-031-72890-7_23

Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model. / Jin, Yang; Zhang, Lei; Yan, Shi et al.
Computer Vision – ECCV 2024 - 18th European Conference, Proceedings. ed. / Aleš Leonardis; Elisa Ricci; Stefan Roth; Olga Russakovsky; Torsten Sattler; Gül Varol. Springer Science and Business Media Deutschland GmbH, 2025. p. 369-386 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 15127 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model

AU - Jin, Yang

AU - Zhang, Lei

AU - Yan, Shi

AU - Fan, Bin

AU - Wang, Binglu

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

PY - 2025

Y1 - 2025

N2 - Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, i.e., a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model’s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method. The code will be available at https://github.com/jinyang06/SamGOP.

AB - Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, i.e., a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model’s inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method. The code will be available at https://github.com/jinyang06/SamGOP.

KW - Gaze object prediction

KW - Object segmentation

KW - Space-to-object gaze regression

KW - Vision foundation model

UR - http://www.scopus.com/inward/record.url?scp=85212985609&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-72890-7_23

DO - 10.1007/978-3-031-72890-7_23

M3 - 会议稿件

AN - SCOPUS:85212985609

SN - 9783031728891

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 369

EP - 386

BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings

A2 - Leonardis, Aleš

A2 - Ricci, Elisa

A2 - Roth, Stefan

A2 - Russakovsky, Olga

A2 - Sattler, Torsten

A2 - Varol, Gül

PB - Springer Science and Business Media Deutschland GmbH

T2 - 18th European Conference on Computer Vision, ECCV 2024

Y2 - 29 September 2024 through 4 October 2024

ER -

Jin Y, Zhang L, Yan S, Fan B, Wang B. Boosting Gaze Object Prediction via Pixel-Level Supervision from Vision Foundation Model. In Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors, Computer Vision – ECCV 2024 - 18th European Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2025. p. 369-386. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-72890-7_23