DeepSeek's Liang Wenfeng & Peking University's Yang Yaodong's Team Win Best Paper Award at ACL 2025 with NSA Paper

At this ACL conference, the Chinese teams have achieved remarkable results.

At this year's ACL conference, Chinese teams have achieved remarkable results.

ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the Association for Computational Linguistics and is held annually. For a long time, ACL has ranked first in academic influence in the NLP field, and it is also a CCF-A recommended conference. This year's ACL conference is the 63rd edition, which was held from July 27th to August 1st, 2025, in Vienna, Austria.

This year, the total number of submissions reached a record high of over 8,000 (compared to 4,407 last year), divided into main conference papers and Findings. The acceptance rates for the two were 20.3% and 16.7% respectively.

According to official data analysis, among all the first authors of the papers, more than half (51.3%) are from China, compared to less than 30% (30.6%) last year. Following China, the number of authors from the United States ranks second, accounting for only 14.0%.

This year, a total of 4 best papers, 2 best papers on social impact, 3 best resource papers, 3 best theme papers, 26 outstanding papers, 2 best TACL papers, 1 best demo paper, and 47 SAC Highlights were selected.

The following is the specific award information.

Best Paper Award

Among the 4 best papers at this conference, two were won by the DeepSeek team (in which Wenfeng Liang participated) and the team led by Yaodong Yang from Peking University. The other two were won by the team from CISPA Helmholtz Center for Information Security, TCS Research, and Microsoft, and the team from Stanford University and Cornell Tech.

Paper 1: A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Authors: Angelina Wang, Michelle Phan, Daniel E. Ho, Sanmi Koyejo
Institutions: CISPA Helmholtz Center for Information Security, TCS Research, Microsoft
Paper URL: https://arxiv.org/abs/2502.01926

Paper abstract: Large language models (LLMs) are increasingly being used in autonomous decision - making, where they sample options from a vast action space. However, the heuristics guiding this sampling process remain underexplored. The team investigated this sampling behavior and showed that its underlying heuristics are similar to those of human decision - making: consisting of a descriptive component of concepts (reflecting statistical normality) and a prescriptive component (implicit ideal values encoded in the LLM).

The team demonstrated that the deviation of samples from statistical normality towards the prescriptive component consistently exists in concepts across various real - world domains, such as public health and economic trends. To further clarify this theory, the team showed that the concept prototypes in LLMs are influenced by prescriptive norms, similar to the human concept of "normal".

Through case studies and comparisons with human research, the team showed that in real - world applications, the shift of samples towards ideal values in LLM outputs can lead to significant biases in decision - making, raising ethical concerns.

Paper 2: Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs

Authors: Angelina Wang, Michelle Phan, Daniel E. Ho, Sanmi Koyejo
Institutions: Stanford University, Cornell Tech
Paper URL: https://arxiv.org/abs/2502.01926

Paper abstract: Algorithmic fairness has traditionally adopted the mathematically convenient perspective of color - blindness (i.e., treating all groups equally). However, the team argues that in a range of important contexts, group difference awareness is crucial. For example, in legal contexts and hazard assessments, distinguishing between different groups may be necessary. Therefore, different from most fairness research, the team studied fairness from the perspective of differential treatment of people - in appropriate contexts.

The team first introduced an important distinction between descriptive (fact - based), prescriptive (value - based), and relevance (association - based) benchmarks. This distinction is crucial because each category requires separate interpretation and mitigation based on its specific characteristics.

Then, they proposed a benchmark suite consisting of eight different scenarios, with a total of 16,000 questions, enabling us to evaluate difference awareness.

Finally, the study presented the results of ten models, which showed that difference awareness is a distinct dimension of fairness, and existing bias mitigation strategies may backfire.

Paper 3: Language Models Resist Alignment: Evidence From Data Compression

Paper URL: https://aclanthology.org/2025.acl-long.1141.pdf
Project URL: https://pku-lm-resist-alignment.github.io

This paper is the first to systematically reveal from both theoretical and experimental levels that large models are not blank slates that can be arbitrarily shaped. There is an elastic mechanism in their parameter structure, which originates from the pre - training stage and has a structural inertia that drives the model distribution to regress. This mechanism allows the model to bounce back to the pre - trained state after fine - tuning, thereby resisting new instructions given by humans and resulting in alignment - resistant behavior. This means that the difficulty of alignment is much higher than expected, and the resources and computing power required for post - training may not only not decrease but may need to be comparable to or even more than those in the pre - training stage.

The paper points out that the larger the model size and the more sufficient the pre - training, the stronger the elasticity and the higher the risk of rebound during alignment. In other words, the currently seemingly effective alignment methods may only be superficial and shallow, and there is still a long way to go to achieve robust alignment that penetrates into the internal mechanism of the model. This discovery poses a serious challenge to AI safety and alignment: the model may not only fail to learn but may even pretend to have learned, which means that the pre - training and post - training fine - tuning alignment processes of current LLMs, VLMs, and VLAs are facing new difficulties.

The reviewers and the conference chair of ACL 2025 highly recognized this research. They unanimously agreed that the concept of "elasticity" proposed in the paper break - throughly reveals the resistance and rebound mechanism of large language models during the alignment process, providing a new theoretical perspective and a solid foundation for the long - standing alignment vulnerability problem in this field. The domain chair further pointed out that the paper builds a bridge between compression theory, model scalability, and safe alignment. It is not only empirically solid and theoretically in - depth but also has far - reaching implications for governance and safety.

The (independent) corresponding author of the paper is Dr. Yaodong Yang, who is currently a researcher at the Institute of Artificial Intelligence at Peking University, a Zhiyuan Scholar (in charge of large - model safety), and the chief scientist of the Peking University - Lingchu Intelligence Joint Laboratory.

The first authors of the paper are all members of Yaodong Yang's research group, including: Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, and Jiayi Zhou. The collaborators include Dr. Juntao Dai, a researcher at the Security Center of the Zhiyuan Research Institute, and Professor Yunhuai Liu from the School of Computer Science at Peking University.

Paper 4: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
Institutions: DeepSeek, Peking University, University of Washington
Paper URL: https://arxiv.org/pdf/2502.11089

Paper abstract: This paper, personally co - authored by Wenfeng Liang, the founder of Magic Square Technology and DeepSeek, proposes a new attention mechanism - NSA. It is a natively trainable sparse attention mechanism for ultrafast long - context training and inference, and it also has the feature of being hardware - aligned.

Long - context modeling is a key ability for the next generation of large language models (LLMs). This requirement stems from diverse practical applications, including in - depth reasoning, warehouse - level code generation, and multi - round automatic agent systems.

A natural way to achieve efficient long - context modeling is to utilize the inherent sparsity of softmax attention. By selectively computing key query - key pairs, the computational overhead can be significantly reduced while maintaining performance. Recent progress in this line includes various strategies: KV cache eviction methods, block - based KV cache selection methods, and selection methods based on sampling, clustering, or hashing. Although these strategies are promising, existing sparse attention methods often perform poorly in actual deployment. Many methods fail to achieve the acceleration comparable to their theoretical gains; in addition, most methods mainly focus on the inference stage and lack effective training - time support to fully utilize the sparse patterns of attention.

To overcome these limitations, deploying effective sparse attention must address two key challenges: hardware - aligned inference acceleration and training - aware algorithm design. These requirements are crucial for achieving fast long - context inference or training in practical applications. Existing methods still fall short in considering both aspects.

Therefore, to achieve more effective and efficient sparse attention, DeepSeek proposes a natively trainable sparse attention architecture, NSA, which integrates hierarchical token modeling.

As shown in the figure below, NSA reduces the per - query computation by organizing keys and values into temporal blocks and processing them through three attention paths: compressed coarse - grained tokens, selectively retained fine - grained tokens, and a sliding window for local context information. Subsequently, the authors implemented specialized kernels to maximize its practical efficiency.

The research evaluated NSA through comprehensive experiments on real - world language corpora. After pre - training on a 27B - parameter Transformer backbone with 260B tokens, the authors evaluated the performance of NSA in general language evaluation, long - context evaluation, and chain - reasoning evaluation. The authors also further compared the kernel speed on an A100 GPU with an optimized Triton implementation. The experimental results show that NSA achieves performance comparable to or better than the Full Attention baseline while outperforming existing sparse attention methods.

In addition, compared with Full Attention, NSA provides significant acceleration in the decoding, forward, and backward stages, and the acceleration ratio increases with the increase of the sequence length. These results verify that the hierarchical sparse attention design effectively balances model capacity and computational efficiency.

Outstanding Paper Award

A total of 26 outstanding papers were selected at ACL 2025, taking up a full 6 slides:

1. A New Formulation of Zipf's Meaning - Frequency Law through Contextual Diversity.

2. All That Glitters is Not Novel: Plagiarism in Al Generated Research.

3. Between Circuits and Chomsky: Pre - pretraining on Formal Languages Imparts Linguistic Biases.

4. Beyond N - Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

5. Bridging the Language Gaps in Large Language Modeis with inference - Time Cross - Lingual Intervention.

6. Byte Latent Transformer: Patches Scale Better Than Tokens.

7. Capability Salience Vector: Fine - grained Alignment of Loss and Capabilities