Moushen Intelligence releases a brand-new three-dimensional world model HL3DWM with a personal human-like thinking mode, enabling robots to truly understand the real world.
Imagine, when you enter a strange room, how would you look for the remote control?
As humans, we rely on our life experience and quickly recall that "the remote control is usually next to the TV or on the sofa." Then we head to that area, look around, ignore other irrelevant items like water cups and tissue boxes that come into view, and finally locate the target.
In the complex 3D world, this human instinct of "precise positioning and on-demand retrieval" is precisely the core ability urgently needed in the process of making embodied intelligence more general.
The Global Brand - New Three - Dimensional World Model of Human Thinking Mode: HL3DWM
Recently, the research team of Moushen Intelligence, in collaboration with Fudan University and Shanghai Chuangzhi College, innovatively proposed a brand - new Human - Like 3D World Model (HL3DWM) based on the paradigm of human behavior. The team, based on the natural logic of human understanding of the 3D world, has opened up a new path for 3D scene understanding that is more in line with real - world cognition, promoting the embodied intelligence technology from cutting - edge research to large - scale industrial applications.
HL3DWM is like an intelligent assistant with "spatial memory." Its core principle is to imitate the way humans understand the 3D world - first find the relevant area, then integrate the surrounding information, and finally complete the task. Through the self - developed "Object - Aware Image Retrieval" module and "Environment - Aware Information Aggregation" module, combined with the global spatial relationship provided by 3D point clouds and the fine details of images, it enables the large language model to give accurate answers or task plans and successfully complete complex tasks.
Pain Points of Existing Research: 3D Large Language Models Struggle to Balance Global and Local Information
In recent years, multimodal large language models (MLLMs) have achieved remarkable success in the 2D image field. Naturally, researchers hope to transfer this ability to the real 3D world, endowing embodied agents such as robots with the ability to understand physical space. How to further improve the 3D spatial understanding ability of models has become the current research focus.
However, when facing complex 3D tasks, the current mainstream 3D multimodal large language models often face a dilemma. On the one hand, although point clouds can provide accurate 3D coordinates, directly extracting features from point clouds can easily lead to the loss of image detail information, and some objects are difficult to represent through point clouds. For example, small objects cannot be clearly identified in point clouds. On the other hand, after mapping 2D image features to 3D space, the model may have difficulty fully modeling 3D spatial information, especially the global spatial relationship. For example, for two images without overlapping areas, the large language model has difficulty fully understanding their spatial position relationship.
The Way Out for HL3DWM: Imitate the Way Humans Understand the 3D World and Integrate Multimodal Core Information
When humans complete tasks in the 3D world, they can easily integrate global information and relevant details. Take "cooking" as an example. Humans first go to the kitchen based on the task and memory, then observe the surrounding environment, integrate information about kitchen utensils and ingredients in the kitchen, and finally determine the dishes and cooking methods. This process can be summarized in three steps: First, understand the task and retrieve the corresponding location. After receiving the instruction, humans can extract task - related information and then locate the task - related area based on memory to collect more information. Second, information aggregation. After retrieving the target location, integrate the task - related information of the object and the surrounding environment. Finally, execute the task, using the collected information to complete the task.
Notably, HL3DWM (Human - Like 3D World Model) adopts a framework that imitates human cognitive habits and the way of understanding the world. The overall framework can be seen in Figure 2 below.
Step 1: "Highlight the Key Points" First, Extract Information and Locate Precisely - OIR Module
After receiving an instruction, humans' first reaction is to extract keywords and recall the location. The same goes for HL3DWM.
For example, when receiving the question "What color is the armchair?", humans first capture the keyword "armchair", recall its location from vague memory, and then observe the area to obtain more information and further confirm the color of the armchair.
To achieve task understanding and target area retrieval, the research proposed the Object - Aware Image Retrieval (OIR) module to simulate this human behavioral characteristic. This module first extracts keywords or location information from the instruction and then retrieves the corresponding image containing details of the task - related area. Specifically, it locates the target area based on the extracted information through visual foundation models such as CLIP or camera parameters.
Step 2: Then "Look Around", Efficiently Integrate Surrounding Environment Information - EIA Module
Just looking at the target is not enough; it's also important to master environmental elements. When receiving a task such as "Build a music station", humans first observe the spatial environment, find the items needed to build the music station, and then use these items to complete the construction. Especially when the instruction has clear requirements for the relative positions of objects, the surrounding environment information becomes an indispensable part.
Inspired by human behaviors of "looking around" and "filtering out useless information", the research further introduced the Environment - Aware Information Aggregation (EIA) module to collect surrounding environment information to obtain more task - related content. Specifically, this module consists of two parts: information acquisition and information aggregation, aiming to obtain information from the surrounding area and filter and fuse the acquired information respectively. Finally, the collected information and the instruction are input into the large language model to get a solution. Experimental results show that this method can effectively utilize information from point clouds and task - related images, achieving performance improvement in multiple tasks such as 3D visual question - answering and 3D dense description.
Experimental Verification: Leading in Multiple 3D Tasks, Outperforming Multiple Contemporary Mainstream Models
The team conducted a large number of experiments based on authoritative datasets such as ScanNet and ScanQA, and used four indicators, BLEU, ROUGE - L, METEOR, and CIDEr, to evaluate the model's performance. The results confirm the strong strength of HL3DWM: it achieved excellent performance in multiple core 3D vision - language tasks such as 3D dense description, 3D visual question - answering, and 3D scene description, outperforming contemporary top - notch 3D large language models such as LL3DA and Grounded 3D - LLM, with a 5 - 20% improvement in ability. When paired with a large language model with stronger performance, the overall effect can be further improved, fully verifying the effectiveness and adaptability of the solution.
To more intuitively show the model's workflow, this article visualized the workflow of HL3DWM (as shown in Figure 3). The model can extract task - related keywords and retrieve task - related images. When receiving the question "What is on the small cabinet under the window?", HL3DWM first extracts task - related keywords such as "window" and then retrieves the corresponding image from memory. Then, through the information acquisition process, it obtains the surrounding images centered on the retrieved image, and through the information aggregation process, it obtains task - related tokens. Finally, the large language model combines the global information of the point cloud and the fine information of the image to give an accurate answer: "There are books on the small cabinet under the window."
Figure 4 shows the qualitative results of HL3DWM in different tasks, verifying the model's 3D scene understanding and reasoning ability. Experimental results show that HL3DWM can better understand 3D space and achieve performance improvement in multiple tasks such as 3D question - answering and 3D dense description.
Facing the 3D question - answering task, when asked "Which side of the chair is the instrument case on?", HL3DWM can accurately answer "On the right side of the chair." In the 3D dense description task, when required to "Describe this object in the 3D scene", HL3DWM answers "This is a rectangular brown table with chairs placed around it." In the 3D scene description task, when required to "Describe the 3D scene", HL3DWM answers "This is a spacious space, including the floor, walls, and windows. There is a sofa near the center of the room, and another sofa is placed against the wall. There is also an armchair in the room. In addition, there are several tables in the room. There is a partition in the corner of the room. There is also a lamp in the room." Facing the embodied task planning, when asked "I want to organize the books on the bookshelf. How should I do it?", HL3DWM can not only understand the space but also formulate a clear and executable step - by - step plan: "1. Walk to the bookshelf. 2. Pick up the books on the ground and put them on the bookshelf. 3. Pick up the books on the table and put them on the bookshelf. 4. Pick up the remaining books and arrange them on the bookshelf in an orderly manner."
Research Conclusion
This article proposes a human - like paradigm 3D large language model (HL3DWM) that achieves 3D scene understanding and reasoning by imitating the way humans understand the 3D world and human behaviors. The model can effectively provide global information and detailed task - related information for task processing.
Specifically, HL3DWM uses the designed Object - Aware Image Retrieval (OIR) module to extract task - related information after receiving an instruction and retrieve task - related images containing details. Then, it uses the designed Environment - Aware Information Aggregation (EIA) module to integrate surrounding environment information, providing sufficient spatial environment support for the task. Experimental results show that this method achieves excellent performance in multiple 3D vision - language tasks and can effectively fuse the global information of point clouds and the fine details of images.
The emergence of HL3DWM proves the importance of enabling large models to "observe" and "understand" the world like humans. This new paradigm of learning human thinking patterns and integrating global "map - like" and local "ultra - clear close - up" information not only provides a new perspective for 3D scene understanding and performing complex tasks but also opens a door full of imagination and humanistic care for future embodied agents (such as home service robots) to truly enter the complex real environment.
For more method details and experimental analysis, please refer to the original paper.
Paper Title: Human - Like 3D Scene Understanding and Reasoning via Image Retrieval
Paper Authors: Jiakang Yuan, Mingsheng Li, Lin Zhang, Tao Chen
From Cutting - Edge Conferences to Industrial Implementation: Moushen Intelligence Accelerates the Popularization of the "Embodied Brain"
In the core track of the research on the integration of embodied intelligence and 3D vision - language, how to enable robots to truly understand the world, comprehend 3D space, and achieve efficient reasoning has always been a key problem to be solved. Moushen Intelligence, a company focusing on the basic model of embodied intelligence, is endowing robots with the general ability to understand physical laws and action principles through its self - developed World Motion Model, improving the generalization level and enabling robots to have a native brain.