多尺度特征图融合的目标检测

Jiang Wentao; Zhang Chi; Zhang Shengchong; Liu Wanjun

doi:10.11834/jig.190021

多尺度特征图融合的目标检测

Translated title of the contribution: Multiscale feature map fusion algorithm for target detection

Jiang Wentao, Zhang Chi, Zhang Shengchong, Liu Wanjun

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Objective The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment, the Internet, or image databases and have increased people, s requirements for image processing. Consequently, image-processing technology has been deeply, widely, and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However, several major difficulties presently exist in the field of target detection. First, many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second, detection accuracy and speed are linked, and many technical bottle necks must be overcome to consider the performance of these two factors. Finally, large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy isa meaningful and urgent problem. Targets with simple background, sufficient illumination, and no occlusion are relatively easy to detect, whereas targets with mixed background and target, occlusion near the target, excessively weak illumination intensity, or diverse target posture are difficult to detect. In natural scene images, the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream, and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio, low abstract semantics, limited position information, and lack of representation of features. High-level features are characterized by high identification, low resolution ratio, and a weak ability to detect small-scale targets. Therefore, in this study, the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed. Method While balancing detection speed and accuracy, the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph, which is then unified into 256 channels through al xl convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence, the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a3 x3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps, and the details of the original feature graph are retained. Lastly, the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results. Result Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets, multiple targets, messy background, occlusion, and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78. 9% and 76. 7%, which are 1. 4 and 0. 9 percentage points higher than those of the classical SSD algorithm,respectively. In addition, the method in this paper improves by8.3% mAP compared with the classical SSDmodel when detecting small-scale targets. Compared with the classical SSD model, the method proposed in this study significantly improves the detection when detecting small-scale targets. Conclusion The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process, thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network, the convolution feature is used to replace the traditional manual feature, thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection for smallscale target detection tasks. While realizing end-to-end training of detection algorithm, the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore, accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.

Translated title of the contribution	Multiscale feature map fusion algorithm for target detection
Original language	Chinese (Traditional)
Pages (from-to)	1918-1931
Number of pages	14
Journal	Journal of Image and Graphics
Volume	24
Issue number	11
DOIs	https://doi.org/10.11834/jig.190021
State	Published - Nov 2019
Externally published	Yes

Access to Document

10.11834/jig.190021

Cite this

@article{e04cd27ca1ac4402987d14e9e804fecb,

title = "多尺度特征图融合的目标检测",

abstract = "Objective The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment, the Internet, or image databases and have increased people, s requirements for image processing. Consequently, image-processing technology has been deeply, widely, and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However, several major difficulties presently exist in the field of target detection. First, many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second, detection accuracy and speed are linked, and many technical bottle necks must be overcome to consider the performance of these two factors. Finally, large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy isa meaningful and urgent problem. Targets with simple background, sufficient illumination, and no occlusion are relatively easy to detect, whereas targets with mixed background and target, occlusion near the target, excessively weak illumination intensity, or diverse target posture are difficult to detect. In natural scene images, the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream, and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio, low abstract semantics, limited position information, and lack of representation of features. High-level features are characterized by high identification, low resolution ratio, and a weak ability to detect small-scale targets. Therefore, in this study, the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed. Method While balancing detection speed and accuracy, the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph, which is then unified into 256 channels through al xl convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence, the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a3 x3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps, and the details of the original feature graph are retained. Lastly, the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results. Result Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets, multiple targets, messy background, occlusion, and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78. 9% and 76. 7%, which are 1. 4 and 0. 9 percentage points higher than those of the classical SSD algorithm,respectively. In addition, the method in this paper improves by8.3% mAP compared with the classical SSDmodel when detecting small-scale targets. Compared with the classical SSD model, the method proposed in this study significantly improves the detection when detecting small-scale targets. Conclusion The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process, thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network, the convolution feature is used to replace the traditional manual feature, thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection for smallscale target detection tasks. While realizing end-to-end training of detection algorithm, the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore, accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.",

keywords = "computer vision, convolutional neural network (CNN), deep learning, Eget detection, multiscale characteristic map",

author = "Jiang Wentao and Zhang Chi and Zhang Shengchong and Liu Wanjun",

year = "2019",

month = nov,

doi = "10.11834/jig.190021",

language = "繁体中文",

volume = "24",

pages = "1918--1931",

journal = "Journal of Image and Graphics",

issn = "1006-8961",

publisher = "Editorial and Publishing Board of JIG",

number = "11",

}

TY - JOUR

T1 - 多尺度特征图融合的目标检测

AU - Wentao, Jiang

AU - Chi, Zhang

AU - Shengchong, Zhang

AU - Wanjun, Liu

PY - 2019/11

Y1 - 2019/11

N2 - Objective The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment, the Internet, or image databases and have increased people, s requirements for image processing. Consequently, image-processing technology has been deeply, widely, and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However, several major difficulties presently exist in the field of target detection. First, many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second, detection accuracy and speed are linked, and many technical bottle necks must be overcome to consider the performance of these two factors. Finally, large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy isa meaningful and urgent problem. Targets with simple background, sufficient illumination, and no occlusion are relatively easy to detect, whereas targets with mixed background and target, occlusion near the target, excessively weak illumination intensity, or diverse target posture are difficult to detect. In natural scene images, the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream, and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio, low abstract semantics, limited position information, and lack of representation of features. High-level features are characterized by high identification, low resolution ratio, and a weak ability to detect small-scale targets. Therefore, in this study, the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed. Method While balancing detection speed and accuracy, the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph, which is then unified into 256 channels through al xl convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence, the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a3 x3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps, and the details of the original feature graph are retained. Lastly, the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results. Result Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets, multiple targets, messy background, occlusion, and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78. 9% and 76. 7%, which are 1. 4 and 0. 9 percentage points higher than those of the classical SSD algorithm,respectively. In addition, the method in this paper improves by8.3% mAP compared with the classical SSDmodel when detecting small-scale targets. Compared with the classical SSD model, the method proposed in this study significantly improves the detection when detecting small-scale targets. Conclusion The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process, thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network, the convolution feature is used to replace the traditional manual feature, thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection for smallscale target detection tasks. While realizing end-to-end training of detection algorithm, the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore, accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.

AB - Objective The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment, the Internet, or image databases and have increased people, s requirements for image processing. Consequently, image-processing technology has been deeply, widely, and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However, several major difficulties presently exist in the field of target detection. First, many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second, detection accuracy and speed are linked, and many technical bottle necks must be overcome to consider the performance of these two factors. Finally, large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy isa meaningful and urgent problem. Targets with simple background, sufficient illumination, and no occlusion are relatively easy to detect, whereas targets with mixed background and target, occlusion near the target, excessively weak illumination intensity, or diverse target posture are difficult to detect. In natural scene images, the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream, and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio, low abstract semantics, limited position information, and lack of representation of features. High-level features are characterized by high identification, low resolution ratio, and a weak ability to detect small-scale targets. Therefore, in this study, the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed. Method While balancing detection speed and accuracy, the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph, which is then unified into 256 channels through al xl convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence, the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a3 x3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps, and the details of the original feature graph are retained. Lastly, the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results. Result Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets, multiple targets, messy background, occlusion, and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78. 9% and 76. 7%, which are 1. 4 and 0. 9 percentage points higher than those of the classical SSD algorithm,respectively. In addition, the method in this paper improves by8.3% mAP compared with the classical SSDmodel when detecting small-scale targets. Compared with the classical SSD model, the method proposed in this study significantly improves the detection when detecting small-scale targets. Conclusion The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process, thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network, the convolution feature is used to replace the traditional manual feature, thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection for smallscale target detection tasks. While realizing end-to-end training of detection algorithm, the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore, accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.

KW - computer vision

KW - convolutional neural network (CNN)

KW - deep learning

KW - Eget detection

KW - multiscale characteristic map

UR - http://www.scopus.com/inward/record.url?scp=85195088252&partnerID=8YFLogxK

U2 - 10.11834/jig.190021

DO - 10.11834/jig.190021

M3 - 文章

AN - SCOPUS:85195088252

SN - 1006-8961

VL - 24

SP - 1918

EP - 1931

JO - Journal of Image and Graphics

JF - Journal of Image and Graphics

IS - 11

ER -

多尺度特征图融合的目标检测

Abstract

Access to Document

Other files and links

Fingerprint

Cite this