Amazon • Jan 8, 2026

Fine-tuning vision-language models on memory-constrained devices

Article Summary

Rupak Vignesh Swaminathan and team from Amazon AGI just solved a major headache for mobile AI: how do you fine-tune vision-language models on devices that can't handle backpropagation? Their answer challenges conventional wisdom about what's possible on edge hardware.

Amazon researchers developed SharpZO, a hybrid optimization method that fine-tunes vision-language models using only forward passes—no backpropagation required. The technique combines evolutionary algorithms with zeroth-order optimization to smooth out the loss landscape, making it practical to train sophisticated AI models directly on memory-constrained devices like smartphones.

Key Takeaways

SharpZO achieves up to 7% higher accuracy than existing forward only methods
Reaches target accuracy in 15.3 minutes versus 170 minutes for BlackVIP baseline
Two stage process: CMA-ES smooths loss landscape, then modified ZO refines locally
Performance approaches CoOP, a backpropagation method, on several benchmark tasks
Robust to distribution shifts on adversarial and sketch recognition tasks

Critical Insight

SharpZO enables edge devices to fine-tune vision-language models with only forward passes, achieving near-backpropagation accuracy while dramatically reducing memory requirements and training time.

The researchers reveal why traditional zeroth-order methods fail and how evolutionary strategies borrowed from biology solve the high-variance gradient problem.

About This Article

Problem

Zeroth-order optimization methods struggle when fine-tuning vision-language models because gradient estimates have high variance and local search dynamics make the loss landscape look jagged with many peaks. These peaks trap optimization algorithms in suboptimal local minima.

Solution

Amazon researchers took the standard CMA-ES algorithm and added a worst-case loss term to smooth the loss landscape during initialization. They then used normalized sparse zeroth-order gradient estimation to refine the search and filter out outlier estimates.

Impact

SharpZO converged faster than other methods. It reached target accuracy on ImageNet in 15.3 minutes, compared to 19 minutes for ZIP and 170 minutes for BlackVIP. The results matched what backpropagation-based methods achieved.