Large models “become faster with more use.” SpeedupLLM verifies this for the first time, significantly reducing the inference budget by 56%.
The longer a Large Language Model (LLM) is used, the faster it becomes! Emory University has proposed the SpeedupLLM framework, which utilizes dynamic computational resource allocation and memory mechanisms to reduce the inference cost of LLMs by 56% when handling similar tasks, improve accuracy, and provide new ideas for the development of AI models.
In the human cognitive world, proficiency means being faster and more efficient.
For example, for the seemingly complex Rubik's Cube, one can "blindly solve" it after just dozens of training sessions. When faced with a math problem that has been solved several times, we can often quickly recall the solution in our minds and answer it within seconds.
So, can large language models achieve the same?
Researchers Bo Pan and Liang Zhao from Emory University recently published an exciting result: the performance of large language models is also related to proficiency, and they can indeed "become faster with more use"!
Paper link: https://arxiv.org/abs/2505.20643
The paper systematically verifies for the first time that under the condition of "having experience", LLMs can not only maintain their performance but also significantly reduce inference time and computational resources, revealing a new paradigm that "AI can also improve with practice".
How to make an LLM proficient?
To systematically verify the "proficiency acceleration effect", the authors proposed a unified framework to construct and quantify the "usage experience" under three types of memory mechanisms.
The framework consists of two parts: one is dynamic computational resource allocation during inference, and the other is the memory mechanism.
Regarding dynamic computational resource allocation, the paper systematically extends multiple existing test - time scaling methods to dynamic computational resource allocation, allowing LLMs to allocate fewer computational resources to familiar problems.
Regarding the memory mechanism, the framework introduces a memory mechanism to accelerate current inference through past experiences.
In multiple rounds of use, can large models "become faster from experience" like humans? Is there a method to systematically improve efficiency rather than simply increasing computing power?
Research highlight 1: Save computing power with experience
During the inference process of repeated or similar tasks, researchers found that LLMs can reduce the inference budget by up to 56% by utilizing past experiences (including memory cache, in - context memory, etc.) while maintaining or even improving accuracy.
This means that the model can avoid many detours when handling "familiar" tasks, answering accurately and quickly.
Research highlight 2: Systematic large - scale experiments
To verify the universality, the researchers investigated:
Multiple test - time scaling methods, including Self - Refine, Best - of - N, Tree - of - Thoughts, and the latest Long Chain - of - Thought (o1 - style thinking)
Multiple types of memory, including supervised learning (Supervised Fine - tuning), retrieving past experiences, and three types of self - reflection (Reflection)
Multiple levels of question similarity, including 1) exactly the same, 2) the same meaning but different expressions, 3) the same question but with different numbers, 4) different questions but requiring the same knowledge to answer.
All different mechanisms showed significant inference acceleration, demonstrating the wide - spread nature of this phenomenon.
Experimental results
In tasks such as "repeated Q&A" and "step - by - step reasoning", the more "repeated" the tasks are, the faster the model's inference and the better the effect. Moreover, this trend becomes more obvious with the accumulation of experience.
The experimental results brought the following eight key findings:
Finding 1: LLMs can really "become faster with more use"!
The experimental results show that with appropriate memory mechanisms and computational budget control strategies, LLMs can save up to 56% of the inference cost on average when handling repeated or similar tasks. This behavior showed significant acceleration in 64 out of 80 experimental settings, with a coverage rate of up to 80%, verifying the universality of "experience - based acceleration".
Finding 2: Faster does not mean worse, but more accurate!
Surprisingly, the reduction in inference cost not only did not sacrifice accuracy but generally led to an improvement in accuracy. The Pearson correlation coefficient between inference cost and accuracy improvement was measured to be - 0.41 (p = 0.0002), indicating that "faster" also means "more stable" and "more accurate".
Finding 3: The higher the similarity, the more obvious the speed - up
The study designed 4 similarity levels, from completely repeated (S1) to significantly different in structure (S4). The results showed that the acceleration was most significant for S1 and S2 type questions (saving 16.0% and 15.4% of the computation respectively), while the acceleration effect was the weakest for S4 questions due to the different structures and the non - direct transferability of memory.
Finding 4: When the question similarity is low, the memory mechanism may backfire
When the differences between questions are too large, the memory mechanism may mislead the model, resulting in an increase in inference cost and a decrease in accuracy. This phenomenon was significant in some S4 settings, suggesting that more memory is not always better, and we should "select accurately and use skillfully".
Finding 5: Episodic memory > Reflective memory, more effective in accelerating inference
In the comparison of different memory mechanisms, episodic memory (such as SFT and In - Context) performed better in accelerating inference. For example, In - Context saved an average of 27.4% of the computation, while reflective memory only saved 3.6% - 8.8%. This is consistent with psychological research: humans initially rely on episodic memory of specific instances when forming proficient skills.
Finding 6: In - Context is more efficient than SFT
In low - sample (1 - 3 rounds) scenarios, In - Context learning has stronger generalization ability and less over - fitting compared to SFT. Especially in terms of inference speed in this study, In - Context is faster, more stable, and more accurate, demonstrating the powerful immediate adaptability of non - parametric memory.
Finding 7: Text - based memory is prone to "reach the ceiling", while parameter - based memory can continuously increase the speed
Text - based memory methods such as reflective and In - Context have a "bottleneck" in the context window, and the effect gradually saturates after adding 3 cases. In contrast, SFT remembers content through weight updates, is not limited by the window, and the inference speed continuously improves with experience.
Finding 8: The more "generalized" the reflection, the more obvious the speed - up
Among the three reflection mechanisms, Reflect - Update performed the best. The reason is that it can continuously summarize abstract rules rather than accumulating specific numbers or cases. This kind of "highly generalized" reflection is more likely to transfer across tasks and assist in acceleration, which is worthy of attention when designing better reflection mechanisms in the future.
Give LLMs "memory" and "proficiency"
This study proposed a new paradigm worthy of attention:
Inference efficiency can be improved not only by increasing hardware but also by "learning history".
In repetitive scenarios such as customer service, search, and medical consultations, deploying "memory - based LLMs" will bring lower response latency, less computing power consumption, stronger adaptability, and personalization.
This study not only fills the gap in existing inference acceleration research but also provides new ideas for building AI models "with human - like proficiency".
Reference materials:
https://arxiv.org/abs/2505.20643
This article is from the WeChat official account "New Intelligence Yuan". Author: New Intelligence Yuan. Republished by 36Kr with permission.