EfficientFormer: Vision Transformers at MobileNet Speed
Article Summary
Snap Research just proved that Vision Transformers can run as fast as MobileNet on actual mobile devices. Yes, really.
The team tackled the biggest barrier to deploying Vision Transformers on mobile: speed. While ViTs deliver impressive accuracy, they've been notoriously slow compared to lightweight CNNs. This research redesigns transformers from the ground up for mobile inference.
Key Takeaways
- EfficientFormer-L1 hits 79.2% ImageNet accuracy in just 1.6ms on iPhone 12
- Matches MobileNetV2 speed while delivering 4.5% better accuracy
- Pure transformer design without hybrid MobileNet blocks
- Largest model (L7) achieves 83.3% accuracy in only 7ms latency
Properly designed transformers can now match MobileNet speeds on real devices while maintaining superior accuracy.
About This Article
Vision Transformers have too many parameters and slow attention mechanisms. They run much slower than lightweight CNNs, which makes them impractical for real-time mobile applications.
Snap Research examined ViT architectures to find design inefficiencies. They built a dimension-consistent pure transformer and used latency-driven slimming to create EfficientFormer.
EfficientFormer-L1 reaches 79.2% ImageNet-1K accuracy with 1.6ms inference latency on iPhone 12. The L7 variant hits 83.3% accuracy in 7.0ms. This shows transformers can run as fast as MobileNet without losing accuracy.