Towards Efficient Vision Transformer Inference: A First Study of Transformers on Mobile Devices

HotMobile'22 |

Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention mechanism, have demonstrated the superiority in accuracy over CNNs. However, vision transformers are with expensive computation costs, and their inference efficiency on resource-constrained mobile devices are still unclear by now. This brings a lot of uncertainty for on-device intelligence to benefit from the vision transformers.

In this work, we carry out the first empirical study to investigate the possibility of efficiently deploying vision transformers on mobile devices. Our twofold study (i) profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and (ii) study multi-dimensional DNN acceleration approaches to achieve minimal latency. Results show that it is too expensive for vision transformer inference on mobile devices. Its inference is 1.58x-41x slower than CNNs. By removing the redundant Attention heads and FFN layers, DeiT-Tiny saves 23.2% latency with negligible 0.75% accuracy loss. Our study provides 7 insightful findings for future efficient vision transformer optimization and design.