Chain of Thought (CoT) Questioned Again: Three Pieces of Evidence Confirm; Is Truly Generalizable Reasoning Still Distant?

Is CoT actually a "mirage"?

The Chain of Thought (CoT) prompting technique has been proven to enhance the performance of Large Language Models (LLMs) in various tasks. When using this method, LLMs seem to first generate human - like reasoning steps (i.e., CoT reasoning) and then give the final answer, which often makes people feel that LLMs are undergoing a deliberate reasoning process.

However, the team from Arizona State University pointed out in a new study that CoT reasoning is actually a fragile illusion and will fail once it goes beyond the training distribution.

In other words, the effectiveness of CoT reasoning does not stem from the model's logical deduction ability but from the memory and interpolation of training data patterns. Its essence is highly structured pattern matching, rather than true generalizable logical reasoning.

This conclusion has triggered a re - examination of the nature of CoT among practitioners in the Artificial Intelligence (AI) industry.

Paper link: https://arxiv.org/abs/2508.01191

The research team said that this work deepens people's understanding of the reasons and conditions for the failure of CoT reasoning and highlights that achieving true generalizable reasoning remains an ongoing challenge.

Why the Doubt?

More and more studies show that LLMs often rely on surface semantics and clues rather than logical reasoning processes.

Therefore, they questioned CoT reasoning by proposing an alternative perspective on data distribution and further explored the reasons and timing of its failure. They dissected CoT reasoning from three dimensions: task, length, and format.

Figure | Data distribution perspective. The effectiveness of CoT reasoning is fundamentally limited by the degree of distribution difference between the training data and the test queries.

1. Task Generalization

The ability of task generalization is the core challenge faced by CoT reasoning. It directly tests the model's ability to apply the learned concepts and reasoning patterns to unknown scenarios.

The task generalization test focuses on the model's ability to adapt to "new tasks with new structures", including two dimensions: Transformation Generalization and Element Generalization.

1) Transformation Generalization

In the transformation generalization experiment, the researchers designed four distribution shift scenarios, gradually escalating from "ID" to "OOD":

In - Distribution (ID): The test task is exactly the same as the training task. For example, both training and testing are "f1∘f1", and the model's exact match rate is 100% at this time;

Composition (CMP): The test task is a new combination of the trained basic operations. For example, if "f1∘f2" and "f2∘f1" are trained, and "f2∘f2" is tested, the exact match rate drops to 0.01%;

Partial Out - of - Distribution (POOD): The test task contains at least one untrained operation, and the exact match rate drops directly to zero;

Out - of - Distribution (OOD): The test task is a brand - new operation combination. For example, if the training set has only seen "f1∘f1", while the test set has to handle "f2∘f2", the model fails completely at this time.

Table | End - to - end evaluation of transformation generalization ability in different scenarios.

In addition, as shown in the following table, from f1∘f2 to f2∘f2, the LLM can correctly answer 0.1% of the questions. But further examination reveals that this is just a coincidence, such as when the query elements are A, N, A, N, which happen to produce the same result in these two operations.

After the research team decomposed the complete reasoning chain into reasoning steps and answers for in - depth analysis, they found a high degree of consistency between the reasoning steps and the corresponding answers.

For example, in the combination generalization setting, the reasoning steps are completely correct on the test data distribution from f1∘f1 to f2∘f2, but the answers obtained are incorrect.

Similarly, when generalizing from f1∘f2 to f2∘f1, the LLM can generate the correct answer, but this is due to the commutativity between two orthogonal transformations, and the reasoning path is not reliable.

Table | Evaluation of different components of CoT reasoning in transformation generalization.

The above results show that CoT reasoning cannot generalize to new transformations, and even cannot generalize to new combined transformations. Rather than truly understanding the text, the performance of CoT reasoning is more like a simple reproduction of the patterns learned during training.

Furthermore, the research team conducted supervised fine - tuning (SFT) on a small amount of unseen data to explore whether CoT reasoning can be generalized to unseen transformations. This method can reduce the distribution difference between the training set and the test set, which may help the LLM generalize to test queries.

Figure | Performance of SFT on unseen data under different degrees of distribution shift.

The results show that only a very small number of example samples are needed to enable the model to quickly generalize to unseen transformation scenarios and significantly improve performance. This indicates that LLMs are very good at quickly learning new patterns from data, but it also shows that their ability range is strictly limited by the seen patterns.

2) Element Generalization

When trying to generalize the LLM to new tasks, element generalization is another key factor.

After fixing other factors, the research team set up three scenarios: ID, CMP, and OOD. In the ID scenario, the test elements use the same letters as the training elements; in the CMP scenario, the test elements are new combinations composed of the letters encountered during training; in the OOD scenario, the test elements are letters that have never been seen during training.

In terms of combination, they tested whether CoT reasoning can generalize to new combinations when all the basic atoms in the elements are observed, such as (A, B, C, D) → (B, C, D, A). Based on the atomic order in the combination, CMP can be further developed. For OOD, the atoms that make up the elements are completely unseen during training.

The results show that, similar to transformation generalization, when the model continuously encounters distribution shifts in all transformations, its performance will decline sharply. From ID to CMP and then to OOD, in all cases, the exact match degree gradually drops from 1.00 to 0.

Figure | Element generalization results under different scenarios and relationships.

They further explored when CoT reasoning can generalize to new elements through SFT, as shown in the following figure. The results show that when similar (small n) examples appear in the training data, the performance improves rapidly. Interestingly, when n = 3, the exact match rate of CoT reasoning is consistent with the lower limit of performance, which may indicate that the generalization ability of CoT reasoning on novel elements is very limited, even when SFT is performed on downstream tasks.

They also found that there is a problem of mismatch in accuracy between the answers and the reasoning steps during training, which may explain to some extent why CoT reasoning is inconsistent in some cases.

Figure | Performance of SFT in element generalization, which helps to generalize to new elements.

2. Length Generalization

Length generalization studies how the CoT reasoning ability of the model degenerates when it encounters test cases with different lengths from the training distribution.

The length difference may come from the text space or the problem reasoning space. Therefore, the research team decomposed length generalization into two complementary dimensions: text length generalization and reasoning step generalization.

1) Text Length Generalization

Text length generalization aims to evaluate how the performance of CoT reasoning changes when the length of the input text is different from the training examples. Considering the way LLMs process long texts, this dimension is crucial because real - world problems usually involve different degrees of complexity, which are manifested as differences in the length of the problem statement, the size of the context, or the information density.

The research team pre - trained the LLM on a dataset with a text length of 4 while fixing other factors and evaluated the performance on various lengths.

The experimental results show that the model only performs well on the training data with a text length of 4, and the exact match rate reaches 100%. As the length difference increases, the effectiveness of CoT reasoning length generalization will decrease, and the exact match rate will also drop to 0. This indicates that LLMs are extremely sensitive to statistical characteristics such as input length.

Table | Evaluation of text length generalization.

They also explored using different padding strategies to reduce the difference between the training data and the test cases. They found that padding to the maximum length does not contribute to length generalization. However, when they used the grouping (Group) strategy to replace the padding with text, the performance improved.

Figure | Performance of text length generalization under different padding strategies.

2) Reasoning Step Generalization

Reasoning step generalization aims to study whether the model can extrapolate to reasoning chains with different steps from those observed during training. This is a common setting in multi - step reasoning tasks.

Similar to text length generalization, they pre - trained the LLM with 2 reasoning steps and evaluated it on data with 1 or 3 reasoning steps.

The results show that CoT reasoning cannot generalize between datasets that require different reasoning steps, and there is a phenomenon of generalization failure. As the proportion of unseen data increases, the performance on the target dataset shows an upward trend. At the same time, due to the insufficient amount of training data, the language model cannot generalize to the original training dataset. This shows that the performance of the model is completely determined by the distribution composition of the training data, and there is no generalization beyond the data distribution.

Figure | Performance of reasoning step generalization tests under different training data combinations.

3. Format Generalization

Format generalization aims to evaluate the robustness of CoT reasoning to changes in the surface form of test queries. This dimension is particularly important for determining whether the model has internalized a flexible and transferable reasoning strategy or still relies on the specific templates and phrases encountered during training.

Therefore, the research team simulated real - world scenarios through the following four perturbation patterns:

Insertion: Insert noise tokens before each original token;
Deletion: Directly remove the original tokens;
Modification: Replace the original tokens with noise tokens;
Hybrid: Combine multiple perturbation methods.

The experimental results show that CoT reasoning is easily affected by format changes. Whether it is insertion, deletion, modification, or the hybrid mode, it will produce format differences that affect the correctness. They further divided the query content into three parts: elements, transformations, and prompt words. They found that elements and transformations play a key role in the format, while changes to other tokens have little impact on the results.

Figure | Performance of format generalization.

Opposing the Doubt

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Chain of Thought (CoT) is questioned again. Three pieces of evidence confirm it. Is truly generalizable reasoning still a long way off?