HomeArticle

The first spatio-temporal sequence reasoning framework: Enabling large models to truly understand spatio-temporal data

新智元2026-04-27 19:49
The first spatio-temporal reasoning model, STReasoner, features low cost and high generalization, supporting causal traceability and prediction.

[Introduction] STReasoner is the first reasoning model that combines time series, spatial structure, and natural language. It can identify the source of anomalies, track the impact path, understand the relationships between nodes, and predict future developments. Compared with mainstream prediction models, STReasoner pays more attention to causal and structural reasoning, and has extremely low computational costs, demonstrating strong generalization and reasoning abilities.

Time series are widely present in real-world systems, such as transportation networks, power systems, and disease spread. These systems not only have temporal dynamics but also complex spatial dependency relationships. Traditional methods focus on one thing: predicting future values more accurately.

However, in real scenarios, more important questions are often: Which node caused the current anomaly? How does the impact spread along the spatial structure? What kind of causal relationships exist between different time steps?

As shown in Figure 1, in a transportation network, if a certain area experiences congestion at 9 o'clock, what we really care about is: "Where did it spread from?"

This kind of problem cannot be solved by single-point prediction but requires multi-step reasoning across time and space. The model first locates the anomaly moment of the target node (time dimension), then traces back the potential impact path along the graph structure (space dimension), and aligns the propagation delay between different nodes (spatio-temporal coupling), and finally identifies the real causal source. This process essentially requires the simultaneous integration of temporal dynamics, spatial dependencies, and semantic queries for structured reasoning across nodes and time steps.

However, existing methods mainly focus on numerical prediction and are difficult to support such complex decision-making problems, thus highlighting the necessity of developing spatio-temporal time series reasoning abilities.

The development of spatio-temporal reasoning is limited by three key issues:

  1. Data issue: Lack of high-quality aligned data. Existing data rarely contains time series, spatial structure, and corresponding natural language descriptions simultaneously. The model lacks a data basis for learning "reasoning".
  2. Evaluation issue: Lack of systematic task definition. There has been no unified framework to systematically evaluate spatio-temporal reasoning abilities in the past. Most work still stays at the prediction task.
  3. Modeling issue: Lack of effective training mechanisms. How to fuse time series + graph + text? How to prevent the model from only using temporal patterns and ignoring spatial information?

A research team from institutions such as Emory University, Microsoft, and Griffith University proposed STReasoner - the first Time Series LLM framework for complex spatio-temporal time series reasoning (Spatio-Temporal Reasoning in Time Series). Experiments show that this model has achieved significant performance improvements in tasks such as causal tracing, spatial relationship reasoning, and time series prediction, and has demonstrated strong generalization ability on real data. Meanwhile, the computational cost is only 0.004× that of closed-source models.

Paper link: https://arxiv.org/abs/2601.03248

Code link: https://github.com/LingFengGold/STReasoner

Build a spatio-temporal model that "truly reasons" in three steps

A cleaner data construction method

To systematically support the training and evaluation of spatio-temporal reasoning models, researchers first built a controllable data generation framework and proposed a unified evaluation benchmark, ST-Bench, based on it.

As shown in the figure, researchers designed a Network SDE + Multi-Agent system specifically for generating three types of strictly aligned data:

  • Time series (how the system changes over time)
  • Graph structure (how nodes interact with each other)
  • Natural language descriptions (what these changes "mean")

The entire process can be understood as: first define the world, then generate data, and then check if it is reasonable.

First, define a complete scenario, such as a transportation system, and clarify the nodes, connection relationships, and temporal dynamics;

  • Scenario Generation Agent: Generate a complete scenario (e.g., a transportation system, a propagation process)
  • Scenario Parsing Agent: Decompose this scenario into structured information (nodes, connection relationships, temporal patterns, etc.)

Then, model the changes of each node through SDE, and simultaneously introduce spatial dependencies and propagation delays;

  • SDE Parameters Agent: Set the temporal dynamics (trend, noise, period, etc.) for each node
  • Time-Varying Adjacency Agent: Set the influence intensity, direction, and propagation delay for the connections between nodes.

Finally, this information is written into the Simulation module to generate real spatio-temporal time series. To avoid "correct data but incorrect semantics", the authors introduced two Judges:

  • Scenario Judge: Check if the scenario itself is reasonable
  • Parameter Judge: Check if the generated data really conforms to the scenario description

As shown in the figure, after obtaining high-quality data, the authors further built the unified benchmark ST-Bench and divided spatio-temporal reasoning into four types of tasks:

T1: Causal tracing → Who caused the current phenomenon?

T2: Entity recognition → What role does each node play?

T3: Correlation reasoning → How do nodes influence and propagate each other?

T4: Spatio-temporal prediction → What will happen in the future under these relationships?

These four types of tasks just cover a complete chain: understand the structure → infer relationships → explain reasons → predict the future

STReasoner model design

In spatio-temporal reasoning tasks, the model needs to process three types of information simultaneously: time series, spatial structure, and natural language questions. Therefore, a core question is: How can a language model "understand time series values", "comprehend graph structures", and complete reasoning?

The design idea of STReasoner is straightforward: Encode the time series into vectors (Time Series Encoder), write the graph structure as text (Graph Prompting), and hand them over to the language model for processing together with the questions.

Three-stage training: From alignment to reasoning to reinforcement

STReasoner adopts a three-stage training strategy:

Stage 1: Modal alignment (Align): This stage mainly uses automatically generated basic question-answer data (ST-Align) to learn the corresponding relationships between time series, graph structures, and text, such as trend recognition and understanding of node relationships.

Stage 2: Injection of reasoning ability (SFT + CoT): In this stage, the authors screened out samples with correct reasoning by Claude-4.5-Sonnat through reject sampling, built CoT data, and performed supervised fine-tuning on the model.

Stage 3: Reinforcement learning (S-GRPO)

This stage further improves the model's reasoning ability through reinforcement learning. The reinforcement learning adopts a spatial-aware reward mechanism (S-GRPO). The core mechanism is to construct two types of inputs for the same question:

  • w/ spatial (with graph structure)
  • w/o spatial (without graph structure)

Only when the model performs better in the "with structure" case is an additional reward given:

This mechanism directly encourages the model to truly rely on the spatial structure rather than just looking at temporal patterns.

Experimental results

From the overall results, STReasoner shows a very consistent advantage in different types of tasks.

In the three types of tasks that emphasize causal and structural reasoning, T1 (causal tracing), T2 (entity recognition), and T3 (spatial correlation reasoning), the model significantly outperforms existing open-source methods and exceeds the compared large models in multiple indicators, indicating that it has indeed learned the reasoning ability based on spatio-temporal structures rather than just pattern fitting.

In contrast, in the T4 (spatio-temporal prediction) task, which is more focused on numerical prediction, STReasoner's performance is basically on par with that of closed-source large models, with only a small gap, reflecting that it does not sacrifice prediction accuracy while maintaining reasoning ability.

More importantly, these performances are achieved at extremely low costs: the overall reasoning overhead is only about 0.004× that of closed-source models, achieving a very competitive balance between cost and performance.

Strong generalization ability

To verify whether the model has really "learned to reason" rather than just fitting synthetic data, the authors conducted a strict zero-shot test (without any fine-tuning) on real-world data. There are two notable points in this comparison:

First, STReasoner's performance on real data not only does not decline but also significantly leads, indicating that the model has learned transferable spatio-temporal reasoning ability rather than the data distribution itself.

Second, and more importantly, regarding the source of training data, STReasoner is completely trained based on synthetic data but can still accurately identify causal relationships in real scenarios, indicating that the "SDE + multi-Agent" data generation mechanism designed earlier has indeed successfully built a training distribution with generalization value.

The model has not memorized the data but has learned how to reason in spatio-temporal structures.

Why is the model effective?

As can be seen from Table 3 and Figure 5, the performance improvement mainly comes from three key designs:

  • Time series encoder: Ensure lossless temporal information. Compared with pure text or image input, the explicit encoder retains both numerical information and overall morphology, which is the basis for subsequent reasoning.
  • Three-stage training: The ability is "gradually established": Table 3 shows that the performance will significantly decline if any stage is missing:

Only Align or only SFT → Insufficient reasoning ability

Direct RL → Unstable results

Only the combination of Align + SFT + S-GRPO can achieve the optimal result.

  • S-GRPO: Let the model truly "reason with structure"

Figure 5 shows that after introducing S-GRPO, the proportion of the model using spatial information significantly increases. The key is not just higher accuracy but: The model changes from "possibly not using the structure" to "actively relying on the structure"

Analysis of training dynamics