VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models

Yang Zhang; Hanling Wang; Qing Bai; Haifeng Liang; Peican Zhu; Gabriel Miro Muntean; Qing Li

doi:10.1109/TBC.2025.3549983

VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models

Yang Zhang, Hanling Wang, Qing Bai, Haifeng Liang, Peican Zhu, Gabriel Miro Muntean, Qing Li

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.

源语言	英语
期刊	IEEE Transactions on Broadcasting
DOI	https://doi.org/10.1109/TBC.2025.3549983
出版状态	已接受/待刊 - 2025

访问文件

10.1109/TBC.2025.3549983

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{433c0eccb78a4c03bcad860f3a27f223,

title = "VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models",

abstract = "The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM{\textquoteright}s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.",

keywords = "Video analytics, edge computing, large language model, vision-language model",

author = "Yang Zhang and Hanling Wang and Qing Bai and Haifeng Liang and Peican Zhu and Muntean, {Gabriel Miro} and Qing Li",

note = "Publisher Copyright: {\textcopyright} 1963-12012 IEEE.",

year = "2025",

doi = "10.1109/TBC.2025.3549983",

language = "英语",

journal = "IEEE Transactions on Broadcasting",

issn = "0018-9316",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - VaVLM

T2 - Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models

AU - Zhang, Yang

AU - Wang, Hanling

AU - Bai, Qing

AU - Liang, Haifeng

AU - Zhu, Peican

AU - Muntean, Gabriel Miro

AU - Li, Qing

PY - 2025

Y1 - 2025

N2 - The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.

AB - The advancement of Large Language Models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for Vision-Language Models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM’s understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.

KW - Video analytics

KW - edge computing

KW - large language model

KW - vision-language model

UR - http://www.scopus.com/inward/record.url?scp=105002030999&partnerID=8YFLogxK

U2 - 10.1109/TBC.2025.3549983

DO - 10.1109/TBC.2025.3549983

M3 - 文章

AN - SCOPUS:105002030999

SN - 0018-9316

JO - IEEE Transactions on Broadcasting

JF - IEEE Transactions on Broadcasting

ER -

VaVLM: Toward Efficient Edge-Cloud Video Analytics With Vision-Language Models

摘要

访问文件

其它文件与链接

指纹

引用此