TY - JOUR
T1 - VaVLM
T2 - Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models
AU - Zhang, Yang
AU - Wang, Hanling
AU - Bai, Qing
AU - Liang, Haifeng
AU - Zhu, Peican
AU - Muntean, Gabriel Miro
AU - Li, Qing
N1 - Publisher Copyright:
© 1963-12012 IEEE.
PY - 2025
Y1 - 2025
N2 - The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.
AB - The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.
KW - edge computing
KW - large language model
KW - Video analytics
KW - vision-language model
UR - http://www.scopus.com/inward/record.url?scp=105002030999&partnerID=8YFLogxK
U2 - 10.1109/TBC.2025.3549983
DO - 10.1109/TBC.2025.3549983
M3 - 文章
AN - SCOPUS:105002030999
SN - 0018-9316
JO - IEEE Transactions on Broadcasting
JF - IEEE Transactions on Broadcasting
ER -