One tenth of one percent. That is all it takes. Out of billions of parameters inside a large language model, only a tiny, specially chosen sliver, roughly 0.1%, carries the information that actually matters when the model needs to learn something new. The rest is, in a sense, dead weight during the update. Recognising this has led a team of researchers at Stevens Institute of Technology to an algorithm they call MEERKAT, which might, perhaps more quietly than most AI breakthroughs, change how these models are trained at scale.
The problem MEERKAT addresses is unglamorous but genuinely consequential. Federated learning is the technique that lets many different institutions or devices collaborate on training a shared AI model without anyone having to hand over their raw data. Hospitals can pool medical knowledge; schools can improve tutoring software; research groups can share insights across borders, all without sharing the patient records or student files that make those insights possible. The catch is that making it work requires the participants to constantly synchronise their versions of the model, and that synchronisation is, currently, a bandwidth nightmare.
“It’s too much data to share,” says Yide Ran, the PhD candidate who drove the project at Stevens. “It’s like sending in an entire encyclopedia when you only need to change a few entries. But you really don’t need to do that.” Standard federated learning transmits the entire model, billions of parameters, every time collaborators need to sync. Those transmissions run into gigabytes. Because of the cost, synchronisation happens infrequently, which means the collaborating models drift apart between updates, each pulling in directions shaped by its own local data. The technical term for this is Non-IID drift, and it degrades the final model’s quality in ways that can be surprisingly hard to recover from.
The 0.1% That Does the Work
MEERKAT’s core insight is that not all parameters are equally worth updating. The team identified, using gradients from the model’s pre-training phase, which tiny fraction of parameters are most sensitive to loss, most likely to shift meaningfully in response to new data. These sensitive parameters turned out to be highly concentrated: the top 0.1% had average squared gradients roughly 52 times larger than the next tier. Focus updates on those, ignore the rest, and you can shrink the transmitted update from gigabytes to megabytes.
“So you are no longer sending the entire encyclopedia when only a few key definitions have changed,” says Denghui Zhang, an assistant professor in information systems and analytics at Stevens who advised on the project. His co-advisor, Zhaozhuo Xu, an assistant professor of computer science, points to the downstream benefit: “Because updates are so tiny, data can be now sent back and forth more often. The result is a much better shared model.”
The communication reduction is over 1,000-fold in some configurations. Updates that previously consumed gigabytes of bandwidth can now be transmitted as a few megabytes. But the efficiency gain is only half the story. The other half involves backpropagation, the standard mathematical process AI uses to correct its own errors during training, which requires the model to run calculations backward through its entire network, caching intermediate values along the way. Memory-hungry, energy-intensive work. MEERKAT bypasses it entirely by using zeroth-order optimisation: instead of computing gradients analytically, it simply nudges the model parameters slightly in one direction, checks whether performance improved, then uses that comparison to guide the next step. No backward pass. No gradient caching. Considerably less energy.
There is a subtlety here that took some working through. Zeroth-order methods are generally less precise than backpropagation; they estimate rather than calculate. Applied naively to a full model, that imprecision can destabilise training. But applied to an extremely sparse, carefully chosen subset of parameters? The researchers found, perhaps counter-intuitively, that it actually outperforms full-parameter zeroth-order approaches on most tasks. Something about the concentrated sensitivity of the chosen parameters makes the rough estimation good enough and then some. The paper, published at the International Conference on Learning Representations, tested this across three different language models (LLaMA-3.2-1B, Qwen2-1.5b, and Gemma2-2b) and seven benchmarks, and MEERKAT beat the standard approaches in the large majority of conditions.
Reading the Gradient Tea Leaves
The team also tackled the drift problem directly with what they call MEERKAT-VP. When collaborating devices have very different data, some might be trained almost entirely on one type of input while others see a balanced spread; those outliers pull the shared model in misleading directions. MEERKAT-VP uses something called a virtual path to track how each participant’s model evolves during local training, without ever accessing the underlying data itself. From these paths, the server can compute a metric called GradIP, which measures how a client’s estimated gradients align with the original pre-training gradients. Clients with highly skewed data turn out to produce GradIP scores that steadily decay toward zero; clients with balanced data oscillate. The pattern is distinctive enough to be used as a diagnostic. Those with extreme skew are then given reduced influence in the next synchronisation round, their local training limited to a single step, which substantially improves overall model quality.
Not all of this is settled. The approach relies on the assumption that the most sensitive parameters identified during pre-training remain the most relevant parameters for downstream fine-tuning, and while the experiments support this, the mechanism is not fully understood. There’s also the question of whether the GradIP diagnostic holds up across much larger and more heterogeneous networks; the experiments used ten clients, which is a manageable number, but real federated deployments sometimes involve thousands of participants with wildly varying computational resources.
Still, the implications are worth taking seriously. AI training has a substantial and growing energy footprint, and most public attention goes to the enormous centralised data centres at the top of that footprint. Federated learning represents a different model, one that distributes training across many smaller devices, and could in principle be far more efficient, but only if the communication overhead can be brought under control. MEERKAT pushes meaningfully in that direction. For resource-constrained institutions, the kind that cannot afford to upload and download gigabytes of model data repeatedly, a thousandfold reduction in communication cost is the difference between participation and exclusion. Healthcare systems in countries with limited bandwidth, educational platforms serving schools with modest infrastructure, cross-border research collaborations that cannot share patient data across jurisdictions, all of these stand to benefit if the approach proves out at scale.
Whether MEERKAT travels the usual path from conference paper to widespread adoption depends on how well it generalises beyond the benchmarks tested so far. But the underlying observation, that a vanishingly small fraction of an AI’s parameters do the heavy lifting when it comes to learning, has a kind of elegant parsimony that suggests it might be pointing at something deep about how these models work.
Source: Yide Ran et al., “Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity,” ICLR 2026. https://openreview.net/forum?id=2DuMBKVbX2
Frequently Asked Questions
Why does it matter that hospitals and schools can’t share their raw data when training AI?
Privacy regulations in most countries prohibit sharing identifiable patient records, student files, or other sensitive information with outside parties, even for beneficial research purposes. Federated learning works around this by training AI models locally on each institution’s data and sharing only mathematical updates rather than the data itself. The problem until now has been that those updates were enormous, making frequent synchronisation impractical and limiting which institutions could realistically participate.
Could training an AI on 0.1% of its parameters actually work as well as training all of them?
That’s the counterintuitive finding at the heart of this research. The team identified that the most sensitive parameters during pre-training carry a disproportionate share of the gradient signal, with average squared gradients roughly 52 times higher than the next group. Restricting updates to these parameters not only reduces communication costs dramatically but actually improves performance compared to updating everything, because the sparse approach concentrates the limited signal available from zeroth-order estimation where it matters most.
What is zeroth-order optimisation and why does it use less energy?
Standard AI training uses backpropagation, which computes gradients by running calculations backward through the entire network and caching large amounts of intermediate data. Zeroth-order methods skip all of that; they simply perturb the model slightly, observe whether performance improved, and use the result to guide the next step. This requires only forward passes through the model, which needs far less memory and considerably less compute, making it better suited to devices with limited resources.
How does MEERKAT know which participants have unreliable data?
It uses a signal called GradIP, which measures how closely a participant’s estimated training gradients align with the gradients the model used during its original pre-training. Participants with highly skewed or imbalanced data tend to produce GradIP scores that steadily decline toward zero over their local training steps, while participants with balanced data produce scores that fluctuate. The server can detect this pattern without ever seeing the participants’ actual data, then limit the influence of problematic clients in the next synchronisation round.
Is this approach limited to healthcare and education, or could it apply more broadly?
The underlying problem, training AI models collaboratively without sharing private data while managing bandwidth constraints and unequal data quality, arises in many domains. Financial institutions comparing fraud patterns, legal firms sharing case insights, and manufacturing systems pooling equipment performance data all face versions of the same challenge. MEERKAT’s authors note the approach is designed to be transferable; the sensitive parameter mask worked across multiple different calibration datasets including code and medical text, suggesting it isn’t narrowly specialised.
ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.
Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.
If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.
