Fine-tuning vision-language models on memory-constrained devices
Article Summary
Rupak Vignesh Swaminathan and team from Amazon AGI just solved a major headache for mobile AI: how do you fine-tune vision-language models on devices that can't handle backpropagation? Their answer challenges conventional wisdom about what's possible on edge hardware.
Amazon researchers developed SharpZO, a hybrid optimization method that fine-tunes vision-language models using only forward passes—no backpropagation required. The technique combines evolutionary algorithms with zeroth-order optimization to smooth out the loss landscape, making it practical to train sophisticated AI models directly on memory-constrained devices like smartphones.
Key Takeaways
- SharpZO achieves up to 7% higher accuracy than existing forward only methods
- Reaches target accuracy in 15.3 minutes versus 170 minutes for BlackVIP baseline
- Two stage process: CMA-ES smooths loss landscape, then modified ZO refines locally
- Performance approaches CoOP, a backpropagation method, on several benchmark tasks
- Robust to distribution shifts on adversarial and sketch recognition tasks
SharpZO enables edge devices to fine-tune vision-language models with only forward passes, achieving near-backpropagation accuracy while dramatically reducing memory requirements and training time.
About This Article
Zeroth-order optimization methods struggle when fine-tuning vision-language models because gradient estimates have high variance and local search dynamics make the loss landscape look jagged with many peaks. These peaks trap optimization algorithms in suboptimal local minima.
Amazon researchers took the standard CMA-ES algorithm and added a worst-case loss term to smooth the loss landscape during initialization. They then used normalized sparse zeroth-order gradient estimation to refine the search and filter out outlier estimates.
SharpZO converged faster than other methods. It reached target accuracy on ImageNet in 15.3 minutes, compared to 19 minutes for ZIP and 170 minutes for BlackVIP. The results matched what backpropagation-based methods achieved.