|Table of Contents|

[1] Wang Chengcheng, Li He, Cao Yanpeng, Song Changjun, et al. WinoNet: Reconfigurable look-up table-based Winograd acceleratorfor arbitrary precision convolutional neural network inference [J]. Journal of Southeast University (English Edition), 2022, 38 (4): 332-339. [doi:10.3969/j.issn.1003-7985.2022.04.002]
Copy

WinoNet: Reconfigurable look-up table-based Winograd acceleratorfor arbitrary precision convolutional neural network inference()
WinoNet: 基于LUT的可配置Winograd多精度卷积网络加速器
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
38
Issue:
2022 4
Page:
332-339
Research Field:
Circuit and System
Publishing date:
2022-12-20

Info

Title:
WinoNet: Reconfigurable look-up table-based Winograd acceleratorfor arbitrary precision convolutional neural network inference
WinoNet: 基于LUT的可配置Winograd多精度卷积网络加速器
Author(s):
Wang Chengcheng Li He Cao Yanpeng Song Changjun Yu Feng Tang Yongming
School of Electronic Science and Engineering, Southeast University, Nanjing 210096, China
王成诚 李鹤 曹闫鹏 宋长俊 俞峰 汤勇明
东南大学电子科学与工程学院, 南京 210096
Keywords:
quantized neural networks look-up table(LUT)-based multiplier Winograd algorithm arbitrary precision
量化神经网络 基于LUT的乘法器 Winograd算法 任意精度
PACS:
TN492
DOI:
10.3969/j.issn.1003-7985.2022.04.002
Abstract:
To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference, a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used. With the help of the Winograd algorithm, the optimization of convolution and multiplication is realized to reduce the computational complexity. The LUT-based operator is further optimized to construct a processing unit(PE). Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints. The data toggle rate is reduced to optimize power consumption. The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration, while the time-division multiplexing of processing units improves resource utilization. Under this experimental condition, compared with the traditional convolution method, the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times. The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.
为了解决卷积层计算复杂度要求高和硬件网络推理的硬件资源有限造成的硬件部署问题, 在基于查找表(LUT)的现场可编程门阵列(FPGA)上搭建了使用整数乘法器和加法树的卷积架构.借助 Winograd 算法实现卷积乘法优化, 降低了计算复杂度.进一步优化基于LUT的算子, 以构建处理单元(PE).优化存储流以提高内存访问效率并解决带宽限制, 降低数据翻转率以减少功耗.试验结果表明, 使用 Winograd 算法构建基本处理单元可以显著减少乘法器数量并实现硬件部署加速, 而处理单元的时分复用提高了资源利用率.与传统卷积方法相比, 架构对计算资源实现了2.25倍优化, 并将峰值吞吐量提升了19.3倍.由此说明, 基于LUT的可配置Winograd网络加速器可以有效解决硬件资源有限造成的部署问题.

References:

[1] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//IEEE Computer Vision and Pattern Recognition. Las Vegas, CA, USA, 2016:2818-2826. DOI:10.1109/CVPR.2016.3 08.
[2] Wang E, Davis J J, Zhao R, et al. Deep neural network approximation for custom hardware:Where we’ve been, where we’re going[J]. ACM Computing Surveys, 2019, 52(2):1-39. DOI:10.1145/3309551
[3] Wang E, Davis J J, Cheung P, et al. LUTNet:Rethinking inference in FPGA soft logic[C]//IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. San Diego, CA, USA, 2019:26-34. DOI:10.1109/FCCM.2019.00014.
[4] Hardieck M, Kumm M, M�F6;ller K, et al. Reconfigurable convolutional kernels for neural networks on FPGAs[C]//ACM International Symposium on Field-Programmable Gate Arrays. San Diego, CA, USA, 2019:43-52. DOI:10.1145/3289602.3293905.
[5] Cao Y, Wang C, Tang Y. Explore efficient LUT-based architecture for quantized convolutional neural networks on FPGA[C]//IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. Fayetteville, AR, USA, 2020:232-232. DOI:10.1109/FCCM48280.2020. 00065.
[6] Hormigo J, Caffarena G, Oliver J P, et al. Self-reconfigurable constant multiplier for FPGA[J]. Acm Transactions on Reconfigurable Technology & Systems, 2013, 6(3):1-17. DOI:10.1145/2490830.
[7] Liang Y, Lu L, Xiao Q, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019:1-10. DOI:10.1109/TCAD.2019. 2897701.
[8] Xiao Q, Liang Y, Lu L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]//ACM Annual Design Automation Conference. Austin, TX, USA, 2017:1-6. DOI:10.1145/3061639.3062244.
[9] Yu J, Hu Y, Ning X, et al. Instruction driven cross-layer CNN accelerator with Winograd transformation on FPGA[C]//IEEE International Conference on Field Programmable Technology. Melbourne, Australia, 2017:227-230. DOI:10.1109/FPT.2017.8280147.
[10] Lu L, Liang Y. SpWA:An efficient sparse Winograd convolutional neural networks accelerator on FPGAs[C]//IEEE Design Automation Conference. San Francisco, CA, USA, 2018:1-6. DOI:10.1109/DAC.2018.8465842.
[11] Yao C, He J, Zhang X, et al. Cloud-DNN:An open framework for mapping DNN models to cloud FPGAs[C]//ACM International Symposium on Field-Programmable Gate Arrays. San Diego, CA, USA, 2019:73-82. DOI:10.1145/3289602.3293915.
[12] Yepez J, Ko S B. Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2020, 28(99):853-863. DOI:10.1109/TVLSI.2019.2961602.
[13] Deng H, Wang J, Ye H, et al. 3D-VNPU:A flexible accelerator for 2D/3D CNNs on FPGA[C]//IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. Orlando, FL, USA, 2021:181-185. DOI:10.1109/FCCM51124.2021.00029.
[14] Niu Y, Kannan R, Srivastava A, et al. Reuse kernels or activations:A flexible dataflow for low-latency spectral CNN acceleration[C]//ACM International Symposium on Field-Programmable Gate Arrays. San Diego, CA, USA, 2020:266-276. DOI:10.1145/3373087.3375302.
[15] Zhang X, Wang J, Chao Z, et al. DNNBuilder:An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]//IEEE International Conference on Computer Aided Design. San Diego, CA, USA, 2018:1-8. DOI:10.1145/3240765.3240801.
[16] Lian X, Liu Z, Song Z, et al. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2019, 27(99):1874-1885. DOI:10.1109/TVLSI.2019.2913958

Memo

Memo:
Biographies: Wang Chengcheng(1999—), male, graduate; Tang Yongming(corresponding author), male, doctor, professor, tym@seu.edu.cn
Foundation item: The Academic Colleges and Universities Innovation Program 2.0(No.BP0719013).
Citation: Wang Chengcheng, Li He, Cao Yanpeng, et al. WinoNet:Reconfigurable look-up table-based Winograd accelerator for arbitrary precision convolutional neural network inference.[J].Journal of Southeast University(English Edition), 2022, 38(4):332-339.DOI:10.3969/j.issn.1003-7985.2022.04.002.
Last Update: 2022-12-20