Under Review

Contrastive Action-Image Pre‑training
for Visuomotor Control

Yuvan Sharma^1,* Dantong Niu^1,2,* Anirudh Pai^1,* Zekai Wang¹ Zhuoyang Liu¹ Baifeng Shi¹ Stefano Saravalle^3,4 Boning Shao¹ Ruijie Zheng² Jing Wang² Konstantinos Kallidromitis⁴ Yusuke Kato⁴ Fabio Galasso³ Yuke Zhu² Danfei Xu² Linxi “Jim” Fan² Jitendra Malik^1,† Trevor Darrell^1,† Roei Herzig^1,†

¹UC Berkeley ²NVIDIA ³Sapienza University of Rome ⁴Panasonic

^*Equal Contribution ^†Equal Advising

Paper Model Training Code (soon) BibTeX

TL;DR. Robot data is too scarce for large-scale pre-training. CAIP instead learns action-centric visual features from 32,041 hours of egocentric human video by treating 3D hand poses as a proxy for end-effector actions, aligning vision and action through a contrastive objective. With only 88 hours of robot data, CAIP reaches a 76% average success rate on real-world dexterous manipulation, over 30 points above the strongest vision-encoder baseline.

Motivation

Why action-centric, egocentric pre-training?

Visual perception underpins robotic manipulation, yet the encoders that power modern policies were never designed for physical interaction. Image-text models like CLIP and SigLIP capture semantics; self-distilled models like DINOv2 capture fine-grained spatial details. But neither sees manipulation environments during training, so their representations lack any action-centric structure.

The most direct fix, pre-training on robot trajectories, runs into a wall: robot data is orders of magnitude smaller than internet-scale corpora. Egocentric human video, by contrast, is abundant and captures the same first-person viewpoint a head-mounted robot camera observes, with naturally co-occurring hand motion. CAIP turns human hand poses into a proxy for end-effector actions, unlocking action-conditioned pre-training at scale.

CAIP saliency (overlaid) against the raw egocentric input across diverse scenes. Rather than spreading attention across the frame, CAIP produces manipulation-centric features concentrated on the hands and the objects they interact with.

32,041 h

Egocentric human video

88 h

Robot manipulation data

76%

Avg. real-world success

+30 pts

Over best baseline

Method

Aligning vision and action by contrast

CAIP uses three encoders: vision, language, and action. A ViT-L/16 image tower and a 24-layer text tower (initialized from SigLIP 2) produce patch and token features; a lightweight 4-layer action transformer encodes a chunk of future hand motion into a single embedding via a [CLS] token. We attention-pool patch tokens using text tokens as queries, then pool again with a learnable query to form a text-conditioned image embedding. This embedding and the action embedding are aligned with a SigLIP-style sigmoid contrastive loss.

CAIP architecture: vision, text, and action transformers aligned via attention pooling and a contrastive loss — **CAIP architecture.** A ViT encodes image patches and a text transformer encodes language tokens, while an action transformer encodes a T-step action chunk into a single embedding. Two stages of attention pooling produce a text-conditioned image embedding, which is aligned to the action embedding via a sigmoid contrastive loss.

Hand poses as actions

Each hand is represented by 21 keypoints as SE(3) transforms (MANO convention). Action chunks emulate end-effector delta control over T = 64 steps, roughly two seconds of future motion at 30 Hz, aligning naturally with robot action spaces.

Scale from human video

Training spans ~1,000 h of lab data (with wrist views), ~31,000 h of in-the-wild egocentric video, and ~88 h of tabletop humanoid data for embodiment diversity, far beyond what robot datasets alone can offer.

A modular, frozen backbone

The pre-trained encoder transfers to a closed-loop policy: a decoder-only transformer (Qwen3.5-0.8B) trained from scratch with a flow-matching action head, on top of frozen CAIP features.

Experiments

Results

Across six real-world dexterous tasks on a Dexmate Vega bimanual robot with two 22-DoF Sharpa Wave hands (12 trials each), CAIP reaches the highest average success rate, 76.0%, over 30 points above the strongest baseline, and tops 5 of 6 tasks. Baselines are competitive on some tasks but collapse on others; CAIP stays strong across the board.

Method	Fold Shorts	Pour	Pick Fruits	Dispense Soap	Turn On Lamp	Pull Tissue	Avg.
R3M	14.6	12.5	2.1	29.2	8.3	37.5	17.4
Qwen3.5 ViT	27.1	22.9	60.4	72.9	8.3	12.5	34.0
VideoMAE	22.9	52.1	0.0	37.5	25.0	18.8	26.0
VC-1	18.8	56.3	0.0	62.5	0.0	22.9	26.7
MVP	54.2	62.5	2.1	93.8	8.3	31.3	42.0
DINOv2	22.9	81.3	52.1	50.0	25.0	20.8	42.0
SigLIP	12.5	70.8	37.5	83.3	25.0	25.0	42.4
SigLIP 2	4.2	35.4	52.1	93.8	50.0	25.0	43.4
CAIP (Ours)	68.8	83.3	56.3	100.0	75.0	72.9	76.0

Success rates (%) over 12 trials per task. Bold = best, underline = second best.

Policy rollouts

Closed-loop rollouts of CAIP-conditioned policies across the six evaluation tasks on the Dexmate Vega with Sharpa Wave dexterous hands. All videos play at 1× (real-time) speed.

Fold Shorts

Pour Almonds

Pick Fruits

Dispense Soap

Turn On Lamp

Pull Tissue

To test transfer beyond the embodiment and domain CAIP was trained for, we evaluate on the ManiSkill2 Franka benchmark: a single 7-DoF arm with a parallel-jaw gripper on tabletop tasks. This is a large shift from CAIP’s egocentric, dexterous, bimanual pre-training: different embodiment, action space, and visual domain (simulation vs. real video). We keep the same protocol: frozen encoder, policy trained from scratch on 200 demonstrations per task, 12 trials each. Despite the domain gap, CAIP again achieves the highest average success rate (77.8%), ahead of SigLIP 2 (72.2%) and DINOv2 (69.4%).

Encoder	Lift Peg	Stack Cubes	Push Cube	Avg.
VC-1	16.7	0.0	100.0	36.1
R3M	25.0	16.7	91.7	44.4
MVP	33.3	0.0	100.0	44.4
DINOv2	58.3	50.0	100.0	69.4
SigLIP 2	58.3	58.3	100.0	72.2
CAIP (Ours)	75.0	58.3	100.0	77.8

ManiSkill2 Franka success rates (%) over 12 trials per task. Bold = best, underline = second best.

Policies are trained only under standard conditions, so any change in success rate reflects the visual representation’s sensitivity to perturbation. CAIP is the most robust encoder under every lighting and distractor perturbation we test.

Lighting variation

Encoder	Turn on Lamp			Dispense Soap			Fold Shorts			Average
Encoder	Orig.	Light	Dark	Orig.	Light	Dark	Orig.	Light	Dark	Orig.	Light	Dark
Qwen3.5 ViT	8.3	0.0	16.7	72.9	58.3	56.3	27.1	8.3	0.0	36.1	22.2	24.3
MVP	8.3	0.0	8.3	93.8	8.3	39.6	54.2	8.3	4.2	52.1	5.6	17.4
CAIP (Ours)	75.0	25.0	33.3	100.0	93.8	75.0	68.8	35.4	20.8	81.3	51.4	43.1

Distractor objects

Encoder	Turn on Lamp		Dispense Soap		Fold Shorts		Average
Encoder	Orig.	2 Distr.	Orig.	2 Distr.	Orig.	2 Distr.	Orig.	2 Distr.
Qwen3.5 ViT	8.3	8.3	72.9	60.4	27.1	16.7	36.1	28.5
MVP	8.3	0.0	93.8	12.5	54.2	16.7	52.1	9.7
CAIP (Ours)	75.0	33.0	100.0	85.4	68.8	39.6	81.3	52.8

Success rates (%). Policies trained only under original conditions. Bold = best.

Fold Shorts scene under Normal, Light, and Dark lighting — **Lighting.** Normal, Light, and Dark conditions for the Fold Shorts task.

Turn On Lamp scene without and with distractor objects — **Distractors.** The Turn On Lamp scene without (left) and with two distractor objects: a red book and a multi-colored Hanoi tower (right).

Analysis

Each encoder’s regions of focus

Side-by-side rollouts compare CAIP against DINOv2 and SigLIP 2 on the same task. Each encoder is shown from a side view and the robot’s head camera, with a saliency map computed using its native query mechanism: CAIP’s text-conditioned cross-attention pool, SigLIP’s learned-probe pooling, and DINOv2’s per-image PCA of patch features. CAIP concentrates on the hands and manipulated objects, while SigLIP 2 scatters across the background and DINOv2 segments by appearance but is instruction-unaware. All videos play at 1× (real-time) speed.

Turn On Lamp

CAIP (Ours)

DINOv2

SigLIP 2

Side view

Head cam

Saliency

Pour

CAIP (Ours)

DINOv2

SigLIP 2

Side view

Head cam

Saliency

Pull Tissue

CAIP (Ours)

DINOv2

SigLIP 2

Side view

Head cam

Saliency

Ablations

Scaling the vision encoder

We ablate CAIP along two axes: encoder capacity and pre-training data scale, under identical data, training, and optimization settings.

Capacity scaling

Scaling the backbone across ViT-B/16, ViT-L/16, and ViT-SO400M/16 improves performance consistently. The largest jump is ViT-B → ViT-L (+33 points average), driven by the harder tasks. ViT-SO400M adds only a few more points at a large compute cost, so we adopt ViT-L as the primary encoder.

Task	ViT-B/16	ViT-L/16	ViT-SO400M/16
Turn On Lamp	16.7	75.0	83.3
Fold Shorts	54.2	68.8	79.2
Dispense Soap	72.9	100.0	100.0
Average	47.9	81.3	87.5

Downstream success rates (%). Bold = best per row. ViT-L (used in all main experiments) shown for reference.

Data scaling

Holding the encoder fixed at ViT-L/16 and varying the fraction of pre-training data, performance improves monotonically, from 50.0% at 20% of the data to 77.8% at full scale, with no sign of saturation. CAIP would likely keep benefiting from more egocentric data. Evaluated on the ManiSkill2 Franka tasks.

Pre-training data	Lift Peg	Stack Cubes	Push Cube	Avg.
20%	58.3	25.0	66.7	50.0
50%	66.7	33.3	83.3	61.1
100%	75.0	58.3	100.0	77.8

ManiSkill2 Franka success rates (%) over 12 trials per task. Bold = best per column.

Get started

Using the encoder

CAIP loads as a standard 🤗 Transformers model. Install transformers, torch, and pillow, then encode an image and instruction into text-conditioned visual features.

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

REPO = "yuvansharma/caip-vitl256"
model = AutoModel.from_pretrained(REPO, trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(REPO, trust_remote_code=True)

image = Image.open("example.png").convert("RGB")
inputs = processor(images=image, text="pick up the red cup", return_tensors="pt")

with torch.no_grad():
    out = model(**inputs)

# out.image_pooled    [B, 1024]       text-conditioned pooled image embedding
# out.patch_features  [B, 256, 1024]  patch tokens
# out.text_tokens     [B, 64, 1024]   text token embeddings
# out.text_pooled     [B, 1024]       pooled text embedding

Citation

BibTeX

@misc{sharma2026contrastiveactionimagepretrainingvisuomotor,
      title={Contrastive Action-Image Pre-training for Visuomotor Control},
      author={Yuvan Sharma and Dantong Niu and Anirudh Pai and Zekai Wang and Zhuoyang Liu and Baifeng Shi and Stefano Saravalle and Boning Shao and Ruijie Zheng and Jing Wang and Konstantinos Kallidromitis and Yusuke Kato and Fabio Galasso and Yuke Zhu and Danfei Xu and Linxi "Jim" Fan and Jitendra Malik and Trevor Darrell and Roei Herzig},
      year={2026},
      eprint={2606.17256},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17256},
}

Contrastive Action-Image Pre‑trainingfor Visuomotor Control

Why action-centric, egocentric pre-training?

Aligning vision and action by contrast

Hand poses as actions

Scale from human video

A modular, frozen backbone

Results

Policy rollouts

Lighting variation

Distractor objects

Each encoder’s regions of focus

Turn On Lamp

Pour

Pull Tissue

Scaling the vision encoder

Capacity scaling

Data scaling

Using the encoder

BibTeX

Contrastive Action-Image Pre‑training
for Visuomotor Control