基于语义感知的变长序列数据预处理框架

Xiaodong Wang; Jiwei Wang; Zhihao Zhong; Huan Yang; Hongjing Yao; Yangming Guo

doi:10.1051/jnwpu/20254320388

基于语义感知的变长序列数据预处理框架

Translated title of the contribution: A framework of variable-length sequence data preprocessing based on semantic perception

Xiaodong Wang, Jiwei Wang, Zhihao Zhong, Huan Yang, Hongjing Yao, Yangming Guo

School of Cybersecurity

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

Abstract

Deep learning frameworks generally adopt padding or truncation operations toward variable-length sequences in order to use efficient yet intensive batch training. However, padding leads to intensive memory consumption, and truncation inevitably loses the original semantic information. To address this dilemma, a variable-length sequence preprocessing framework based on semantic perception is proposed, which leverages a typical unsupervised learning method to reduce the different dimensionality to the exact size and minimize information loss. Under the theoretical umbrella of minimizing information loss, information entropy is adopted to measure the semantic richness, weights to variable-length representations is assigned, and the semantic richness is used to fuse them. Extensive experiments show that the information loss of the present strategy is less than the truncated embeddings, and the apparent superiority of the present method in gaining more information capability and achieving promising performance on several text classification datasets.

Translated title of the contribution	A framework of variable-length sequence data preprocessing based on semantic perception
Original language	Chinese (Traditional)
Pages (from-to)	388-397
Number of pages	10
Journal	Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University
Volume	43
Issue number	2
DOIs	https://doi.org/10.1051/jnwpu/20254320388
State	Published - Apr 2025

Access to Document

10.1051/jnwpu/20254320388

Cite this

@article{db5459132ace4787afcb97b467193b6b,

title = "基于语义感知的变长序列数据预处理框架",

abstract = "Deep learning frameworks generally adopt padding or truncation operations toward variable-length sequences in order to use efficient yet intensive batch training. However, padding leads to intensive memory consumption, and truncation inevitably loses the original semantic information. To address this dilemma, a variable-length sequence preprocessing framework based on semantic perception is proposed, which leverages a typical unsupervised learning method to reduce the different dimensionality to the exact size and minimize information loss. Under the theoretical umbrella of minimizing information loss, information entropy is adopted to measure the semantic richness, weights to variable-length representations is assigned, and the semantic richness is used to fuse them. Extensive experiments show that the information loss of the present strategy is less than the truncated embeddings, and the apparent superiority of the present method in gaining more information capability and achieving promising performance on several text classification datasets.",

keywords = "data preprocessing, maximizing information, padding, semantic information, truncation, variable-length sequence",

author = "Xiaodong Wang and Jiwei Wang and Zhihao Zhong and Huan Yang and Hongjing Yao and Yangming Guo",

note = "Publisher Copyright: {\textcopyright}2025 Journal of Northwestern Polytechnical University.",

year = "2025",

month = apr,

doi = "10.1051/jnwpu/20254320388",

language = "繁体中文",

volume = "43",

pages = "388--397",

journal = "Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University",

issn = "1000-2758",

publisher = "Northwestern Polytechnical University",

number = "2",

}

TY - JOUR

T1 - 基于语义感知的变长序列数据预处理框架

AU - Wang, Xiaodong

AU - Wang, Jiwei

AU - Zhong, Zhihao

AU - Yang, Huan

AU - Yao, Hongjing

AU - Guo, Yangming

PY - 2025/4

Y1 - 2025/4

N2 - Deep learning frameworks generally adopt padding or truncation operations toward variable-length sequences in order to use efficient yet intensive batch training. However, padding leads to intensive memory consumption, and truncation inevitably loses the original semantic information. To address this dilemma, a variable-length sequence preprocessing framework based on semantic perception is proposed, which leverages a typical unsupervised learning method to reduce the different dimensionality to the exact size and minimize information loss. Under the theoretical umbrella of minimizing information loss, information entropy is adopted to measure the semantic richness, weights to variable-length representations is assigned, and the semantic richness is used to fuse them. Extensive experiments show that the information loss of the present strategy is less than the truncated embeddings, and the apparent superiority of the present method in gaining more information capability and achieving promising performance on several text classification datasets.

AB - Deep learning frameworks generally adopt padding or truncation operations toward variable-length sequences in order to use efficient yet intensive batch training. However, padding leads to intensive memory consumption, and truncation inevitably loses the original semantic information. To address this dilemma, a variable-length sequence preprocessing framework based on semantic perception is proposed, which leverages a typical unsupervised learning method to reduce the different dimensionality to the exact size and minimize information loss. Under the theoretical umbrella of minimizing information loss, information entropy is adopted to measure the semantic richness, weights to variable-length representations is assigned, and the semantic richness is used to fuse them. Extensive experiments show that the information loss of the present strategy is less than the truncated embeddings, and the apparent superiority of the present method in gaining more information capability and achieving promising performance on several text classification datasets.

KW - data preprocessing

KW - maximizing information

KW - padding

KW - semantic information

KW - truncation

KW - variable-length sequence

UR - http://www.scopus.com/inward/record.url?scp=105005848409&partnerID=8YFLogxK

U2 - 10.1051/jnwpu/20254320388

DO - 10.1051/jnwpu/20254320388

M3 - 文章

AN - SCOPUS:105005848409

SN - 1000-2758

VL - 43

SP - 388

EP - 397

JO - Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University

JF - Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University

IS - 2

ER -

基于语义感知的变长序列数据预处理框架

Abstract

Access to Document

Other files and links

Fingerprint

Cite this