Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

Bin Lin; Ningxin Zheng; Lei Wang; Shijie Cao; Lingxiao Ma; Quanlu Zhang; Yi Zhu; Ting Cao; Jilong Xue; Yuqing Yang; Fan Yang

Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

Bin Lin ,
Ningxin Zheng ,
Lei Wang ,
Shijie Cao ,
Lingxiao Ma ,
Quanlu Zhang ,
Yi Zhu ,
Ting Cao ,
Jilong Xue ,
Yuqing Yang ,
Fan Yang

Sixth Conference on Machine Learning and Systems (MLSys'23) | June 2023

Download BibTex

N:M sparsity is becoming increasingly popular for its potential to deliver high model accuracy and computational efficiency for deep learning. However, the real-world benefit of N:M sparsity is limited as there is a lack of dedicated GPU kernel implementations for general N:M sparsity with various sparsity ratios. In this work, we introduce nmSPARSE, a library of efficient GPU kernels for two fundamental operations in neural networks with N:M sparse weights: sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). By exploiting the intrinsic balance characteristic of N:M sparsity, nmSPARSE kernels rearrange irregular computation and scattered memory accesses in sparse matrix multiplication into hardware-aligned regular computation and conflict-free memory accesses at runtime. When evaluated on NVIDIA A100 GPU, nmSPARSE kernels achieve up to 5.2× speedup on SpMV and 6.0× speedup on SpMM over the fastest baseline. End-to-end studies on transformer models demonstrate that using nmSPARSE outperforms other baselines.