Snapchat Nov 28, 2022

EfficientFormer: Vision Transformers at MobileNet Speed

Article Summary

Snap Research just proved that Vision Transformers can run as fast as MobileNet on actual mobile devices. Yes, really.

The team tackled the biggest barrier to deploying Vision Transformers on mobile: speed. While ViTs deliver impressive accuracy, they've been notoriously slow compared to lightweight CNNs. This research redesigns transformers from the ground up for mobile inference.

Key Takeaways

Critical Insight

Properly designed transformers can now match MobileNet speeds on real devices while maintaining superior accuracy.

The secret lies in identifying which ViT operators are actually killing your mobile performance.

About This Article

Problem

Vision Transformers have too many parameters and slow attention mechanisms. They run much slower than lightweight CNNs, which makes them impractical for real-time mobile applications.

Solution

Snap Research examined ViT architectures to find design inefficiencies. They built a dimension-consistent pure transformer and used latency-driven slimming to create EfficientFormer.

Impact

EfficientFormer-L1 reaches 79.2% ImageNet-1K accuracy with 1.6ms inference latency on iPhone 12. The L7 variant hits 83.3% accuracy in 7.0ms. This shows transformers can run as fast as MobileNet without losing accuracy.

Recent from Snapchat

Related Articles