HomeArticle

Robots No Longer Fumble Blindly: University of Hong Kong and Alibaba Jointly Open-Source FineVLA – Choose Which Hand and Grasp Where, All With a Single Sentence

量子位2026-06-23 15:18
For robots to enter open environments, they must also be able to understand human requirements for "how to do things".

Robot models can already complete tasks based on instructions like "put the cup into the basket", but which hand should be used?

From which direction should it grasp? Should it grasp the cup body or the handle? - These key details that determine the execution effect are rarely annotated in existing robot datasets.

Recently, researchers from XLANG Lab at the University of Hong Kong and the Qwen team at Alibaba proposed FineVLA, an open - source framework for controllable VLA strategies.

This framework enables VLA models not only to complete tasks but also to complete tasks in the way specified by humans -

Which hand to use, from which angle to approach, and which part of the object to contact can all be controlled through language.

Its optimal mixed strategy setting achieved a success rate of 86.8%/82.5% in the RoboTwin simulation (a +15.0/+11.1 improvement over the baseline), and 62.7/100 on a real dual - arm robot (49.9 for Raw - only). Controllable factors such as posture (+23), color (+18), and approach direction (+18) have all been improved. The code, model, and evaluation benchmark have all been open - sourced.

Background: Why are VLA models not "obedient" enough?

VLA (Vision - Language - Action) models can already perform operations such as grasping and placing based on natural language, but a long - standing pain point still exists: the granularity of language supervision is too coarse.

In image and video generation, the details of text descriptions directly affect the controllability of the results; robot strategy learning is similar, except that language needs to constrain the real action process.

When picking up a spoon, different trajectories may use the left or right arm, bypass obstacles or move in a straight line, but in the dataset, they often share the same target - level instruction.

This leads to supervision ambiguity: the model can learn to "ultimately succeed", but it is difficult to learn execution constraints such as which hand to use, from which direction to approach, and which part of the object to contact from the language.

Currently, most robot datasets still lack such fine - grained annotations.

Building a controllable VLA system faces three core challenges:

  • Lack of infrastructure for fine - grained annotation from heterogeneous data;
  • Lack of benchmarks for evaluating the fine - grained understanding of robots and scalable low - cost annotators;
  • Lack of systematic evidence that fine - grained language truly improves strategy learning. The FineVLA framework addresses these three issues one by one.

Technical Solution

FineVLA constructs a complete closed - loop of action - instruction alignment, connecting fine - grained data construction, robot video understanding, scalable annotation, and controllable VLA strategy learning.

Left: FineVLA - Tool unifies heterogeneous robot trajectories from 10 open - source datasets, removes redundant demonstrations through clustering and sampling, and annotates action - aligned descriptions for representative trajectories along ten fine - grained dimensions.

The generated FineVLA - Data supports RoboFine - Bench (measuring fine - grained robot video understanding through Grounding VQA, ReasoningVQA, and Caption evaluations) and RoboFine - VLM (a dedicated VLM annotator for robots).

Right: FineVLA - Policy uses mixed data of original target - level instructions and fine - grained process - level instructions, trains under two action decoding architectures, and evaluates in the RoboTwin simulation and real dual - arm operations.

Controllable control examples show how fine - grained language can specify execution - sensitive factors such as contact areas, target objects, execution arms, trajectory directions, and failure recovery.

FineVLA consists of four core components, forming a complete closed - loop of "data - model - evaluation - strategy".

FineVLA - Tool: 970,000 trajectories to fine - grained data

FineVLA - Tool transforms heterogeneous robot data into high - quality fine - grained supervision through four stages:

  • Stage 1, Format Unification: Aggregate 972,247 trajectories from 10 open - source datasets such as Bridge V2, BC - Z, RT - 1, and RoboMIND, and uniformly convert them into the LeRobot2.1 format.
  • Stage 2, Action Normalization: Unify the different time references and kinematic representations of different datasets into absolute coordinates + normalized quaternion rotations, and remove damaged trajectories with excessive action and state differences.
  • Stage 3, DTW Clustering and Deduplication: Calculate the similarity of action trajectories based on dynamic time warping (DTW) and perform hierarchical clustering. Select 47,159 representative samples from 970,000 to retain the diversity of operation strategies.
  • Stage 4, Ten - Dimensional Fine - Grained Annotation: Annotate along 10 dimensions such as action sequence, execution body (left/right arm), target object, contact and approach methods, trajectory direction, and failure recovery. First generated by Qwen3.5 - Plus and then verified by manual review. The average number of words after annotation increased from 9.3 to 96.8 (10.4 times).

RoboFine - VLM: Let VLM learn to describe how robots "move"

General VLMs often miss execution details such as object ambiguity distinction, contact areas, and motion paths. The researchers further performed full - parameter supervised fine - tuning on Qwen3.5 - VL - 397B - A17B and obtained RoboFine - VLM based on the aforementioned manually verified fine - grained instructions. It can output step - level action descriptions covering 10 control dimensions and serve as a scalable annotator for future data expansion.

RoboFine - Bench: Evaluate fine - grained action understanding

RoboFine - Bench contains 500 video segments, 32 robot forms, and 11,631 atomic facts, and is strictly non - overlapping with the training set. It has two tracks:

  • VQA Track: Contains 1,030 questions distributed along the ten fine - grained dimensions of the annotation, aggregated into three evaluation axes - entity and scene positioning (Grounding), action and motion understanding (Action), and interaction and state reasoning (State). The model receives video frames and all questions, and the answers are scored through deterministic matching.
  • Caption Track: Requires the model to generate step - level fine - grained descriptions aligned with actions. The LLM judges the alignment degree between the model output and the pre - extracted 11,631 atomic facts, and produces three indicators: consistency (Consistency), coverage (Coverage), and anti - hallucination (Anti - Hallucination). There are two modes: the easy mode provides the original task instruction as a prompt, and the hard mode requires the model to infer the operation process only from visual observation without providing any language clues.

FineVLA - Policy: Verify the strategic benefits of fine - grained language

Keep the visual observations and action labels unchanged, and only change the paired language (Raw - only vs FG - only vs Mixed) to strictly isolate the effect of language supervision.

To systematically verify the effectiveness of fine - grained annotation, the experiment designed three strategy configurations to separate the effects of architecture and data scale: RDT - OFT and RDT - GR00T use the same pre - trained data but different action decoding architectures (OFT vs GR00T), and RDT - OFT and AlohaMix - OFT use the same architecture but different scales of pre - trained data (AlohaMix is about 13 times that of RDT).

Each configuration is evaluated under seven FG:Raw instruction ratios to ensure that the conclusion is not affected by a specific architecture or data scale.

Experimental Results

Model Understanding Ability

RoboFine - VLM achieved an accuracy of 68.2% in the VQA track, exceeding the strongest general baseline GPT - 5.4 (60.2%) by 8.0 percentage points;

It obtained 82.2% in the Caption hard setting, exceeding GPT - 5.4 (78.0%). The automatic scoring is highly consistent with the manual ranking (Spearman 0.943).

Simulation Experiment RoboTwin

Evaluating seven FG:Raw ratios on RoboTwin revealed two key findings:

Finding 1: FG - only outperforms Raw - only in all settings (a gain of +1.4 to +8.1). Fine - grained supervision does not harm the task success rate.

Finding 2: The success rate shows an inverted U - shaped trend, with the peak between FG:Raw = 1:2 and 1:1.

The optimal setting reached 86.8%/82.5%, a +15.0/+11.1 improvement over the baseline. Raw tells the model "what to do", and FG tells the model "how to do it", and the two are complementary.

Real Robot Experiment

On the CobotMagic dual - arm platform, the research team designed a "paired evaluation": in the same visual scene, only change one language control factor and observe whether the strategy changes the execution method according to the instruction. The following table shows the real - world scoring results in the original paper, and all scores are normalized to 100 points.

In the table, Avg(ID) represents the average score of 7 in - distribution tasks, and Avg(All) further includes the OOD L→R combination probe. FG:Raw = 1:1 reached 62.7/100 on Avg(ID) (49.9 for Raw - only and 54.4 for FG - only); after including OOD, Avg(All) was 56.1 (43.6 for Raw - only).

In terms of specific control factors, FG:Raw = 1:1 showed improvements compared to Raw - only in color (22→40), posture (24→47), approach direction (60→78), rotation direction (76→86), and execution arm (60→64). The larger gains were concentrated on factors not specified by the target - level instructions: posture (+23), color, and approach direction (each +18). The OOD L→R requires the robot to use the left hand to put the object into the right - side bowl, which is an actor - target combination not seen in training; this item increased from 0 to 10/100, indicating that mixed fine - grained supervision brings a certain degree of factor - level generalization, but complete combinatorial instructions still pose challenges.

In addition, fine - grained supervision also shows a scaling trend: it narrows the architecture gap (the Easy/Hard gap between OFT and GR00T decreased from 6.4/6.6 to 0.8/0.5) and benefits more from a larger data scale.

Project Value

The core contribution of FineVLA is