搜索

x
中国物理学会期刊

基于FPGA的脉冲Transformer硬件高能效加速器实现

An FPGA-Based High-Energy-Efficiency Hardware Accelerator for Spiking Transformer

PDF
导出引用
  • 脉冲神经网络(Spiking Neural Networks,SNNs)凭借低功耗、事件驱动和稀疏计算等特性,在动态视觉等任务中展现出显著潜力,但其算法优势在实际部署中仍受到传统计算架构的制约。为突破事件驱动计算在能效与延迟上的硬件瓶颈,本文针对Spikformer模型开展算法与硬件协同优化,提出了一种基于现场可编程门阵列(Field-Programmable Gate Array,FPGA)的脉冲Transformer通用加速器架构。算法层面,通过卷积层与批归一化(Batch Normalization,BN)层融合以及量化感知训练,将Spikformer-1-384模型参数规模由15.92 MB压缩至原来的四分之一,并将精度损失控制在1%以内。硬件层面,基于Verilog设计了面向脉冲数据流的可配置加速器,支持多时间步并行计算以及卷积、全连接、残差与注意力算子的灵活组合,并提升并行度与存储带宽利用效率。实验结果表明,在Xilinx Zynq UltraScale+MPSoC(xczu7ev-ffvc1156-2-i)平台上,该加速器在CIFAR-10数据集上时间步长4的端到端推理延迟约为53 ms,其中卷积特征提取与注意力模块计算时间分别为48 ms和4.634 ms;端到端系统功耗为7.181 W,对应能效达到2.63 FPS/W,整体性能与能效均优于Intel i9 CPU;对于自注意力机制和前馈神经网络(Multilayer Perceptron,MLP)计算,较GPU和CPU分别加速1.70×和5.73×。本项目开源链接:https://github.com/tooddler/FPGA_SpikingTransformer。

    Spiking Neural Networks (SNNs) exhibit significant potential in tasks such as dynamic vision, thanks to their characteristics of low power consumption, event-driven computing, and sparse computing. However, the algorithmic advantages of SNNs are still constrained by traditional computing architectures in practical deployments. To break through the hardware bottlenecks of event-driven computing in terms of energy efficiency and latency, this paper focuses on the Spikformer model and conducts algorithm and hardware co-optimization. We propose a general accelerator architecture for Spiking Transformer based on Field-Programmable Gate Array (FPGA). At the algorithmic level, by integrating convolutional layers with Batch Normalization (BN) layers and employing quantization-aware training, we compress the parameter size of the Spikformer-1-384 model from 15.92 MB to one-quarter of its original size, while keeping the accuracy loss within 1%. At the hardware level, a configurable accelerator tailored for spiking data streams is designed using Verilog. This accelerator supports parallel computation across multiple time steps and flexible combinations of convolutional, fully connected, residual, and attention operators, enhancing parallelism and storage bandwidth utilization efficiency. Experimental results show that on the Xilinx Zynq UltraScale+ MPSoC (xczu7ev-ffvc1156-2-i) platform, the accelerator achieves an end-to-end inference latency of approximately 53 ms on the CIFAR-10 dataset with a time step of 4. Specifically, the computation times for convolutional feature extraction and attention modules are 48 ms and 4.634 ms, respectively. The end-to-end system power consumption is 7.181 W, corresponding to an energy efficiency of 2.63 FPS/W, outperforming Intel i9 CPUs in both overall performance and energy efficiency. For computations involving self-attention mechanisms and Multilayer Perceptron (MLP), the accelerator achieves speedups of 1.70× and 5.73× compared to GPUs and CPUs, respectively. The open-source link for this project is: https://github.com/tooddler/FPGA_SpikingTransformer.

    目录

    返回文章
    返回
    Baidu
    map