Micro-expression spotting with multi-scale local transformer in long videos

Xupeng Guo, Xiaobiao Zhang, Lei Li, Zhaoqiang Xia

Research output: Contribution to journalArticlepeer-review

19 Scopus citations

Abstract

Micro-expression analysis by computer vision techniques has attracted much attention as it can reveal the human emotions automatically. Among the analysis tasks, the temporal spotting is the most challenging task for achieving expression-aware frames from long video sequences. Compared to the well studied recognition task, more researches need to be devoted to the spotting task for further improving the performance and benefiting the subsequent tasks. So, in this paper, we propose a convolutional transformer based deep model for micro-expression spotting in long video sequences. A 3D convolutional subnetwork is firstly employed to extract the visual features from the temporal frames in a fixed-size sliding window of original video sequence. Then a multi-scale local transformer module is designed based on the visual features to model the correlation between frames in a local window. By leveraging the correlation information, the description of face movement becomes more representative for various-duration micro-expressions. Finally, the multi-head classifier and the corresponding estimator are jointly combined to predict the temporal position for spotting micro-expressions. The proposed method is evaluated on two publicly-available datasets, namely CAS(ME)2 and SAMM-LV, and achieves the promising performance of 0.2770 F1-score on SAMM-LV and 0.1373 F1-score on CAS(ME)2. The code is publicly available on GitHub (https://github.com/xiazhaoqiang/MULT-MicroExpressionSpot).

Original languageEnglish
Pages (from-to)146-152
Number of pages7
JournalPattern Recognition Letters
Volume168
DOIs
StatePublished - Apr 2023

Keywords

  • Convolutional network
  • Local transformer
  • Micro-expression spotting

Fingerprint

Dive into the research topics of 'Micro-expression spotting with multi-scale local transformer in long videos'. Together they form a unique fingerprint.

Cite this