🦾 Any3D-VLA: Enhancing VLA Robustness
via Diverse Point Clouds ☁️

In submission


Xianzhe Fan1,2    Shengliang Deng1,2    Xiaoyang Wu1    Yuxiang Lu1    Zhuoling Li1    Mi Yan2,3    Yujia Zhang1    Zhizheng Zhang2    He Wang2,3    Hengshuang Zhao1   

1The University of Hong Kong    2Galbot    3Peking University   

Figure 1. We propose Any3D-VLA. It unifies simulator, sensor, and model-estimated point clouds in the training pipeline (a), enabling diverse inputs and learning domain-agnostic 3D representations that are fused with the corresponding 2D representations (b). (c) shows our experimental results in real-world settings.


Abstract


Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to maximize model capability gains? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap.



Real-World Post-Training


To validate Any3D-VLA's generalization capabilities on new tasks including specific rules and new language instructions, we design two challenging evaluation scenarios.

Notations: For the training dataset, Setting 2 incorporates both sensor-based and point clouds estimated by multiple models, while Setting 3 utilizes only the sensor-based point cloud. During inference, RealSense denotes the use of the sensor-based point cloud, whereas DA3 refers to the point cloud derived from Depth Anything 3 depth predictions.


Table 1. Success rates of post-training tasks.


Task 1: Grasp a flower and place it into a vase.

Fail

pi0.5

Fail

GraspVLA

Fail

SpatialVLA


Success (1 trial)

Ours (RealSense, Setting 3)

Success (1 trial)

Ours (DA3, Setting 3)

Success (1 trial)

Ours (RealSense, Setting 2)

Success (1 trial)

Ours (DA3, Setting 2)


Task 2: Place a transparent condiment cup into a specific slot of a cup carrier.

Fail

pi0.5

Fail

GraspVLA

Fail

SpatialVLA


Success (1 trial)

Ours (RealSense, Setting 3)

Success (1 trial)

Ours (DA3, Setting 3)

Success (2 trials)

Ours (RealSense, Setting 2)

Success (1 trial)

Ours (DA3, Setting 2)




Zero-Shot Comparisons in the Real World


To evaluate Any3D-VLA's zero-shot generalization ability and robustness in the real world, we design four challenging test sets.

Notations: For the training dataset, Setting 1 utilizes only the simulator-based point cloud, whereas Setting 2 incorporates both simulator-based and point clouds estimated by multiple models. During inference, RealSense denotes the use of the sensor-based point cloud, while DA3 refers to the point cloud derived from Depth Anything 3 depth predictions.


Figure 2. Zero-shot comparisons in the real world.


1. Standard

Relatively simple scenes, with no more than six objects on the tabletop, and target objects mostly of conventional shapes and scales.

Fail

pi0.5

Fail

GraspVLA

Fail

SpatialVLA


Success (2 trials)

Ours (RealSense, Setting 1)

Success (3 trials)

Ours (DA3, Setting 1)

Success (2 trials)

Ours (RealSense, Setting 2)

Success (2 trials)

Ours (DA3, Setting 2)


2. Scale & Shape Challenge

Scenes with substantial intra-class variations in size and shape, e.g., dogs and bottles of different sizes and appearances; this set also includes geometrically challenging target objects, such as elongated objects (pen, fork, spoon, etc.) and small objects (diameter < 3cm, e.g., bottle cap).

Fail

pi0.5

Success (2 trials)

GraspVLA

Fail

SpatialVLA


Success (3 trials)

Ours (RealSense, Setting 1)

Success (1 trial)

Ours (DA3, Setting 1)

Success (2 trials)

Ours (RealSense, Setting 2)

Success (1 trial)

Ours (DA3, Setting 2)


3. Viewpoint Challenge

While keeping the coordinate-system origin fixed, we rotate the camera viewpoint around the z-axis (perpendicular to the tabletop) by 5°, 15°, and 30°, respectively.

15° Success (3 trials)

pi0.5

15° Fail

GraspVLA

15° Fail

SpatialVLA


15° Fail

Ours (RealSense, Setting 1)

15° Fail

Ours (DA3, Setting 1)

15° Success (3 trials)

Ours (RealSense, Setting 2)

15° Success (3 trials)

Ours (DA3, Setting 2)


4. Appearance-Deprived Challenge

Scenes designed to weaken informative 2D cues, including transparent objects, textureless objects (solid white, solid green, solid blue, etc.), and visual camouflage (objects with the same color as the tabletop), forcing the model to rely more on 3D geometry rather than 2D color and texture information.

Fail

pi0.5

Fail

GraspVLA

Success (2 trials)

SpatialVLA


Success (3 trials)

Ours (RealSense, Setting 1)

Success (1 trial)

Ours (DA3, Setting 1)

Success (2 trials)

Ours (RealSense, Setting 2)

Success (1 trial)

Ours (DA3, Setting 2)