Abstract

Info

Title:: Neighborhood fusion-based hierarchical parallel featurepyramid network for object detection

Author(s):: Mo Lingfei; Hu Shuming; School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Keywords:: computer vision; deep convolutional neural network; object detection; hierarchical parallel; feature pyramid network; multi-scale feature fusion

PACS:: TP391.4

DOI:: 10.3969/j.issn.1003-7985.2020.03.002

Abstract:: In order to improve the detection accuracy of small objects, a neighborhood fusion-based hierarchical parallel feature pyramid network(NFPN)is proposed. Unlike the layer-by-layer structure adopted in the feature pyramid network(FPN)and deconvolutional single shot detector(DSSD), where the bottom layer of the feature pyramid network relies on the top layer, NFPN builds the feature pyramid network with no connections between the upper and lower layers. That is, it only fuses shallow features on similar scales. NFPN is highly portable and can be embedded in many models to further boost performance. Extensive experiments on PASCAL VOC 2007, 2012, and COCO datasets demonstrate that the NFPN-based SSD without intricate tricks can exceed the DSSD model in terms of detection accuracy and inference speed, especially for small objects, e.g., 4% to 5% higher mAP(mean average precision)than SSD, and 2% to 3% higher mAP than DSSD. On VOC 2007 test set, the NFPN-based SSD with 300×300 input reaches 79.4% mAP at 34.6 frame/s, and the mAP can raise to 82.9% after using the multi-scale testing strategy.

References:

[1] Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]//Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Kauai, HI, USA, 2001: 511-518. DOI:10.1109/CVPR.2001.990517.
[2] Viola P, Jones M J. Robust real-time face detection[J].International Journal of Computer Vision, 2004, 57(2): 137-154. DOI:10.1023/b:visi.0000013087.49260.fb.
[3] Dalal N, Triggs B.Histograms of oriented gradients for human detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Diego, CA, USA, 2005:886-893.DOI:10.1109/CVPR.2005.177.
[4] Felzenszwalb P F, Girshick R B, McAllester D.Cascade object detection with deformable part models[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Francisco, CA, USA, 2010 2241-2248.DOI:10.1109/CVPR.2010.5539906.
()/((a))()/((b))()/((c))Fig.9Qualitative results for small objects on COCO test-dev set.(a)Input images;(b)Results of SSD300;(c)Results of NFPN-SSD300 [5] Uijlings J R R, Sande K, Gevers T, et al. Selective search for object recognition[J].International Journal of Computer Vision, 2013, 104(2): 154-171. DOI:10.1007/s11263-013-0620-5.
[6] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI:10.1109/TPAMI.2016.2577031.
[7] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, 2014: 580-587. DOI:10.1109/CVPR.2014.81.
[8] He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. DOI:10.1109/TPAMI.2015.2389824.
[9] Everingham M, van Gool L, Williams C K I, et al. The pascal visual object classes(VOC)challenge[J].International Journal of Computer Vision, 2010, 88(2): 303-338. DOI:10.1007/s11263-009-0275-4.
[10] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context[M]//Computer Vision—ECCV 2014. Cham: Springer International Publishing, 2014: 740-755. DOI:10.1007/978-3-319-10602-1_48.
[11] Girshick R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision.Santiago, Chile, 2015: 1440-1448. DOI:10.1109/ICCV.2015.169.
[12] Gidaris S, Komodakis N. Object detection via a multi-region and semantic segmentation-aware CNN model[C]//2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 1134-1142. DOI:10.1109/ICCV.2015.135.
[13] Dai J F, Li Y, He K M, et al. R-FCN: Object detection via region-based fully convolutional networks [J/OL]. arXiv preprint arXiv:1605.06409, 2016. https://arxiv.org/abs/1605.06409.
[14] He K M, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 2980-2988. DOI:10.1109/ICCV.2017.322.
[15] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA., 2017: 936-944. DOI:10.1109/CVPR.2017.106.
[16] Zhu Y S, Zhao C Y, Wang J Q, et al. CoupleNet: Coupling global structure with local parts for object detection[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 4146-4154. DOI:10.1109/ICCV.2017.444.
[17] Liu W, Anguelov D, Erhan D, et al. SSD:Single shot multibox detector[M]//Computer Vision—ECCV 2016. Cham: Springer International Publishing, 2016: 21-37. DOI:10.1007/978-3-319-46448-0_2.
[18] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 779-788. DOI:10.1109/CVPR.2016.91.
[19] Fu C Y, Liu W, Ranga A, et al. DSSD: Deconvolutionalsingle shot detector [J/OL]. arXiv preprint arXiv:1701.06659, 2017. https://arxiv.org/abs/1701.06659.
[20] Jeong J, Park H, Kwak N. Enhancement of SSD by concatenating feature maps for object detection[C]//British Machine Vision Conference. London, UK, 2017. DOI:10.5244/C.31.76.
[21] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 2999-3007. DOI:10.1109/ICCV.2017.324.
[22] Redmon J, Farhadi A. YOLO9000:Better, faster, stronger[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017: 6517-6525. DOI:10.1109/CVPR.2017.690.
[23] Shen Z Q, Liu Z, Li J G, et al. DSOD:Learning deeply supervised object detectors from scratch[C]//2017 IEEE International Conference on Computer Vision. Venice, Italy, 2017: 1937-1945. DOI:10.1109/ICCV.2017.212.
[24] Redmon J, Farhadi A. YOLOv3: An incremental improvement [J/OL]. arXiv preprint arXiv:1804.02767, 2018. https://arxiv.org/abs/1804.02767.
[25] Zhang S F, Wen L Y, Bian X, et al. Single-shot refinement neural network for object detection[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, 2018: 4203-4212. DOI:10.1109/CVPR.2018.00442.
[26] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017: 2261-2269. DOI:10.1109/CVPR.2017.243.
[27] Tian Z, Shen C H, Chen H, et al. FCOS: fully convolutional one-stage object detection[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea, 2019: 9626-9635. DOI:10.1109/ICCV.2019.00972.
[28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition [J/OL]. arXiv preprint arXiv:1409.1556, 2014. https://arxiv.org/abs/1409.1556.
[29] Liu S, Qi L, Qin H F, et al. Path aggregation network for instance segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, 2018: 8759-8768. DOI:10.1109/CVPR.2018.00913.
[30] Ghiasi G, Lin T Y, Le Q V. NAS-FPN: Learning scalable feature pyramid architecture for object detection[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 7029-7038. DOI:10.1109/CVPR.2019.00720.
[31] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the ACM International Conference on Multimedia. Orlando, FL, USA, 2014: 675-678. DOI:10.1145/2647868.2654889.
[32] Liu W, Rabinovich A, Berg A C. ParseNet: Lookingwider to see better [J/OL]. arXiv preprint arXiv:1506.04579, 2015. https://arxiv.org/abs/1506.04579.
[33] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 770-778. DOI:10.1109/CVPR.2016.90.
[34] Li Z, Zhou F. FSSD: Featurefusion single shot multibox detector [J/OL]. arXiv preprint arXiv:1712.00960, 2017. https://arxiv.org/abs/1712.00960.
[35] Bell S, Zitnick C L, Bala K, et al. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 2874-2883. DOI:10.1109/CVPR.2016.314.
[36] Chen K, Wang J, Pang J, et al. MMDetection: Open MMLabdetection toolbox and benchmark [J/OL]. arXiv preprint arXiv:1906.07155, 2019. https://arxiv.org/abs/1906.07155v1.

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics