StartseiteArtikel

A team from Shenzhen University enables robots to understand instructions and perform precise navigation, with a success rate of up to 72.5% and a 40% improvement in reasoning efficiency.

量子位2025-12-10 14:57
Moreover, it reduces the number of parameters without sacrificing performance.

Enable robots to understand instructions and upgrade precise navigation!

The research team led by Professor Li Jianqiang from Shenzhen University, in collaboration with institutions such as the Beijing Institute of Technology - Moscow Institute of Physics and Technology, recently proposed a new framework for Vision - Language Navigation (VLN) called UNeMo.

Through the multi - modal world model and the hierarchical prediction feedback mechanism, the navigation agent can not only perceive the current environment but also predict what it may encounter next, and make smarter decisions accordingly.

Compared with mainstream methods, UNeMo can significantly reduce resource consumption. The navigation success rate in unseen environments can reach 72.5%, especially excelling in long - trajectory navigation.

Currently, the related paper has been accepted by AAAI 2026.

Here are more detailed contents.

The "Disconnect Dilemma" between Language Reasoning and Visual Navigation

As one of the core tasks in Embodied AI, Vision - Language Navigation requires the agent to autonomously complete target navigation in an unknown environment based solely on visual images and natural language instructions.

With the rise of large language models (LLMs), although navigation methods based on LLMs have made progress, they still face two key bottlenecks:

  • Single reasoning modality: Existing methods only rely on language reasoning and lack the ability to predict the visual environment state, making it difficult to handle dynamic changes in complex scenarios.
  • Conflicting optimization goals: The reasoning module and the navigation strategy are trained separately, resulting in poor adaptability between them. Dynamic collaborative optimization cannot be achieved, and there are performance bottlenecks.

Two - Module Collaboration to Create a "Prediction + Decision - Making" Closed - Loop

Therefore, the research team proposed the UNeMo framework. Its core breakthrough lies in constructing a two - way collaborative architecture of "Multi - Modal World Model (MWM) + Hierarchical Prediction Feedback Navigator (HPFN)", which deeply integrates visual state reasoning and navigation decision - making, fundamentally solving the disconnection problem of existing methods.

Future Visual State Prediction Based on the Multi - Modal World Model

The MWM is built based on a conditional variational autoencoder, with the core being to accurately predict future visual states.

It can receive current visual features, language instructions, and candidate navigation actions, and fuse multi - modal information through the cross - attention mechanism, filling the gap of existing methods that "only focus on the present".

Moreover, without additional labeled data, it can continuously optimize the prediction accuracy through the reverse feedback of navigation decision results, forming an adaptive evolution cycle.

Implementation of the Efficient Hierarchical Prediction Feedback Navigator

The HPFN adopts a two - stage hierarchical mechanism to balance efficiency and accuracy:

First, it generates coarse - grained candidate actions (a') based on the current visual - language features to lock the navigation direction, and then fuses the future visual states predicted by the MWM to optimize fine - grained actions (a'') to correct deviations, enabling the agent to navigate stably in complex scenarios.

Empowering the Dynamic Closed - Loop of Reasoning and Decision - Making

The most core breakthrough of the UNeMo general navigation architecture lies in constructing a closed - loop optimization where "reasoning - decision - making" empowers each other.

The visual prediction of the MWM provides forward - looking information for navigation, improving the accuracy of decision - making; the actual execution results of navigation are fed back to the MWM in real - time to optimize its prediction accuracy.

This two - way promotion enables the agent to continuously iterate during navigation, solving the pain point of the separation of reasoning and decision - making in traditional LLM - based VLN methods.

Experimental Performance

To comprehensively verify the core value of the UNeMo framework, the team designed a comprehensive experimental evaluation plan:

From verifying the dual - excellence of performance and efficiency in core scenarios, to achieving robustness breakthroughs in complex scenarios, and then to verifying the scalability across baselines and datasets, the advantages of the architecture are demonstrated step by step.

1. Breakthrough in Core Scenarios: Dual - Excellence in Efficiency and Performance of LLM - Based Methods on the R2R Dataset

In the experiments on the R2R dataset, the core dataset in the VLN field, UNeMo achieved a key breakthrough in balancing lightweight configuration and high - performance decision - making.

The parameter scale of the FlanT5 - 1.5B model used by UNeMo is only 30% of the FlanT5 - 5B used by the mainstream method NavGPT2, but it has significantly optimized resource consumption:

During training, the GPU memory usage decreased from 27GB to 12GB, a reduction of 56%; the inference speed increased from 1.1 seconds per step to 0.7 seconds, an efficiency improvement of 40%.

This characteristic of "reducing parameters without sacrificing performance" is of great significance for the engineering implementation of VLN methods.

At the same time, UNeMo still outperformed mainstream methods in core performance indicators.

In the unseen test environment (test unseen), its navigation success rate (SR) reached 72.5%, a 1.5 - percentage - point increase compared to NavGPT2's 71%; the path efficiency (SPL) increased from 60% to 61.3%.

2. Robustness in Complex Scenarios: Significant Advantages in Long - Path Navigation

To verify the adaptability of UNeMo in complex scenarios, the team focused on testing the improvement of the pre - exploration mechanism on the robustness of long - distance navigation. They compared the performance of UNeMo and NavGPT2 on the val - unseen dataset under different path lengths.

The results showed that UNeMo's advantages were particularly prominent in long - trajectory navigation:

The navigation success rate (SR) for short paths (length < 7) only slightly increased by 1.2% (from 71.1% to 72.3%); while for long paths (length ≥ 7), the SR increased significantly by 5.6% (from 64.2% to 69.8%), which was 4.7 times the improvement for short paths.

This proves that the multi - modal prediction and hierarchical decision - making mechanism of UNeMo can effectively alleviate the cumulative errors in long - distance navigation, solving the pain point of performance degradation of traditional methods in long - trajectory tasks.

3. Cross - Scenario Scalability: Comprehensive Verification across Multiple Baselines and Datasets

To further verify the generality and scalability of the UNeMo collaborative training architecture, the team migrated it to different types of navigation baselines (DUET) and the goal - oriented navigation dataset REVERIE for cross - scenario verification.

The experimental results showed that there were improvements in both the navigation success rate (SR) and the remote goal success rate (RGS) indicators in unseen scenarios.

This indicates that the UNeMo collaborative training architecture is not limited to LLM - based baselines but can be flexibly adapted to different types of navigation systems, releasing value in different task scenarios and verifying its strong scalability.

Overall, UNeMo addresses the problems of the disconnection between reasoning and decision - making and high resource consumption in traditional VLN methods. The collaborative architecture of "Multi - Modal World Model + Hierarchical Prediction Feedback Navigator" solves these pain points.

Its lightweight configuration has the advantages of high performance, robust long - path navigation, and strong cross - scenario adaptability, providing an efficient and feasible solution for VLN, facilitating the implementation of service robots in practical scenarios, and promoting the development of the VLN field.

Paper link: https://arxiv.org/abs/2511.18845

This article is from the WeChat official account "QbitAI", written by the UNeMo team and published by 36Kr with permission.