Vision–language–action (VLA) models have enabled language-conditioned, long-horizon robot manipulation, but most existing systems are limited to grippers. Scaling VLA policies to bimanual platforms with high-DoF dexterous hands remains challenging due to the expanded action space, frequent hand–object occlusions, and the cost of collecting demonstrations. We present GR-Dexter, an integrated hardware–model–data framework for VLA-based generalist manipulation on a bimanual dexterous-hand robot. Our approach combines a compact high-DoF robotic hand, an intuitive bimanual teleoperation pipeline for collecting demonstrations, and a training recipe that leverages teleoperated robot trajectories together with large-scale vision–language and carefully curated cross-embodiment data. Across real-world evaluations spanning long-horizon everyday manipulation and generalizable pick-and-place, GR-Dexter achieves strong in-domain performance and improved robustness to unseen objects and unseen instructions under out-of-distribution settings. We hope GR-Dexter serves as a practical step toward generalist dexterous-hand robotic manipulation.
The ByteDexter hand series employ a linkage-driven transmission mechanism for its advantages in force transparency, durability, and ease of maintenance. As an upgraded successor to the V1 hand, the ByteDexter V2 hand introduces an additional thumb DoF, bringing the total to 21 DoFs, while simultaneously reducing the overall hand size (height: 219mm, width: 108mm). Each finger provides four DoFs, and the thumb incorporates five to enable a wider range of oppositional and dexterous motions. The five fingertips of ByteDexter V2 are covered with high-density piezoresistive sensor arrays that measures normal forces with fine spatial granularity across the finger tip, finger pad, and fingertip’s lateral surface.
Real-world robot data are collected via a bimanual teleoperation interface comprising a Meta Quest VR setup for wrist pose tracking, two Manus Metagloves for hand movement capture, and foot pedals for arm control. Two Meta Quest controllers are mounted on the dorsal side of the gloves to ensure reliable wrist-hand coordinated motion tracking. This setup allows teleoperators to simultaneously coordinate two Franka arms for long-horizon manipulation tasks. Human motions are retargeted in realtime to joint position commands, providing kinematically consistent mapping via whole-body control. The system incorporates robust adaptive mechanisms to handle visual tracking loss and prevent hazardous operation. Hand motion retargeting is formulated as a constrained optimization problem aggregating wrist-tip vectors, thumb-tip vectors, collision avoidance, and a regularization term, solved using Sequential Quadratic Programming.
GR-Dexter follows GR-3 and adopts a Mixture-of-Transformer architecture for a vision-language-action (VLA) model of 4B parameters. controls a bi-manual robot with fixed base by generating a -length action chunk conditioned on the input language instruction , observation , and robot state . Specifically, each action is a vector consisting of: 1) arm joint actions, 2) arm end-effector poses, 3) hand joint actions, and 4) fingertip positions.
We employ a co-training strategy for GR-Dexter using a mixture of three distinct data sources: web-scale vision-language data, cross-embodiment real-robot data, and human trajectory data. To handle the structural differences across datasets, we mask out unavailable or unreliable action dimensions (e.g., specific joints not present in the target embodiment).
• Vision-language data: We reuse the vision-language data from GR-3, which covers a spectrum of tasks including image captioning, visual question answering, image grounding, and grounded image captioning.
• Cross-embodiment data: We leverage existing open-source bi-manual humanoid datasets. Specifically, we select three dual-arm dexterous manipulation datasets that encompass diverse embodiments and task settings: Fourier ActionNet Dataset, OpenLoong Baihu Dataset, and RoboMIND.
• Human trajectories: While cross-embodiment data offers accurate robot information, the scale and diversity of tasks are inevitably limited by costs. Crowdsourcing human demonstrations via easily accessible VR devices offers a promising solution to scale up data quantity and diversity. We adopt open-source dataset and supplement it with data collected using Pico VR devices.
• Transferring cross-embodiment trajectories: We first standardize camera observations across datasets. We then perform careful retargeting to ByteDexter V2 hand by aligning the fingertips. This fingertip-centric alignment preserves task-relevant contact geometry while remaining agnostic to joint-level discrepancies. The resulting trajectories are then resampled by task category to produce a balanced cross-embodiment training corpus.
• Transferring human trajectories: The gap between human and robotic hands is substantial: VR data collection introduces ego-motion due to head-mounted cameras, and single-frame hand pose estimation commonly leads to temporal jitter and inconsistency. We first perform careful filtering based on hand visibility and velocity. Next, human trajectories are mapped into the same visual and kinematic representation as robot data similar to the cross-embodiment data cleaning process.
We conduct extensive real-world experiments to evaluate the performance of GR-Dexter on long-horizon bimanual manipulation and generalizable pick-and-place tasks. We evaluate GR-Dexter in: (1) challenging dexterous tool use tasks, (2) long-horizon task execution, and (3) OOD scenarios with novel relative spatial configurations, unseen objects, and unseen instructions.
Basic Settings: the relative spatial configurations (layouts) of objects are present in the training data. Here, plain-VLA has a comparable performance with GR-Dexter, achieving 0.96 and 0.97 success rates correspondingly. This shows that co-training preserves the strong in-domain capability of the teleop-only baseline
Our-of-Distribution Settings: the relative spatial configurations of objects are novel at test time. We evaluate on five unseen layouts while keeping the instruction order the same as Basic. In OOD settings, the performance of the plain-VLA drops to 0.64, whereas GR-Dexter improves substantially to 0.89. These results indicate that co-training with vision-language data significantly enhances generalization to unseen spatial layouts, while maintaining in-domain performance.
Additional Qualitative Results: we further consider two more complex long-horizon tasks: 1) Vacuuming: the robot learns a stable four-finger grasp to hold the tabletop vacuum while using the thumb to press the power button (on/off). Next, it presses again to increase power, then sweeps to clear confetti. 2) Bread serving: the robot learns to stably grasp food tongs to retrieve a croissant from a pastry container while the other hand holds a plate. It then releases the tongs and places the croissant onto the plate. We observe GR-Dexter performs both tasks reliably across time.
Basic Settings: we observe that in the in-domain Basic setting, plain VLA reaches 0.87, GR-Dexter (w/o cross-embodiment data) reaches 0.85, and GR-Dexter achieves the best performance at 0.93. We find the results interesting because: 1) GR-Dexter w/o cross-embodiment data performs slightly worse than plain VLA, as in the in-distribution setting, VL data gives no additional information but makes optimization more challenging; 2) with cross-embodiment data, GR-Dexter significantly outperforms the two baselines, which suggests after careful data processing and alignment, larger scale cross-embodiment training for the action expert can improve the overall robustness and performance of GR-Dexter.
Unseen Objects and Instructions: we observe that 1) the performance of plain VLA drops significantly; 2) VLM co-training largely improves the robustness and generalization of GR-Dexter, but empirically, GR-Dexter w/o cross-embodiment data still suffer from inaccurate grasping; 3) with carefully filtered and aligned cross-embodiment co-training, GR-Dexter demonstrates strong generalization capabilities to both unseen objects and instructions, achieving a final success rate of 0.85 and 0.83 respectively.
These gains are consistent with the qualitative examples in the following figure, where GR-Dexter successfully grasps unseen objects by leveraging skills learned from cross-embodiment data, and correctly interprets and executes previously unseen instructions.
@article{wen2025grdextertechnicalreport,
title={GR-Dexter Technical Report},
author={Wen, Ruoshi and Chen, Guangzeng and Cui, Zhongren and Du, Min and Gou, Yang and Han Zhigang and Huang Liqun and Lei Mingyu and Li Yunfei and Li Zhuohang and Liu Wenlei and Liu Yuxiao and Ma Xiao and Niu Hao and Ouyang Yutao and Ren Zeyu and Shi Haixin and Xu Wei and Zhang Haoxiang and Zhang Jiajun and Zhang Xiao and Zheng Liwei and Zhong Weiheng and Zhou Yifei and Zhu Zhengming and Li Hang},
journal={arXiv preprint arXiv:2512.24210},
year={2025}
}