Under Review

Contrastive Action-Image Pre‑training
for Visuomotor Control

1UC Berkeley 2NVIDIA 3Sapienza University of Rome 4Panasonic
*Equal Contribution Equal Advising

TL;DR. Robot data is too scarce for large-scale pre-training. CAIP instead learns action-centric visual features from 32,041 hours of egocentric human video by treating 3D hand poses as a proxy for end-effector actions, aligning vision and action through a contrastive objective. With only 88 hours of robot data, CAIP reaches a 76% average success rate on real-world dexterous manipulation, over 30 points above the strongest vision-encoder baseline.

Motivation

Why action-centric, egocentric pre-training?

Visual perception underpins robotic manipulation, yet the encoders that power modern policies were never designed for physical interaction. Image-text models like CLIP and SigLIP capture semantics; self-distilled models like DINOv2 capture fine-grained spatial details. But neither sees manipulation environments during training, so their representations lack any action-centric structure.

The most direct fix, pre-training on robot trajectories, runs into a wall: robot data is orders of magnitude smaller than internet-scale corpora. Egocentric human video, by contrast, is abundant and captures the same first-person viewpoint a head-mounted robot camera observes, with naturally co-occurring hand motion. CAIP turns human hand poses into a proxy for end-effector actions, unlocking action-conditioned pre-training at scale.

CAIP saliency (overlaid) against the raw egocentric input across diverse scenes. Rather than spreading attention across the frame, CAIP produces manipulation-centric features concentrated on the hands and the objects they interact with.
32,041 h
Egocentric human video
88 h
Robot manipulation data
76%
Avg. real-world success
+30 pts
Over best baseline

Method

Aligning vision and action by contrast

CAIP uses three encoders: vision, language, and action. A ViT-L/16 image tower and a 24-layer text tower (initialized from SigLIP 2) produce patch and token features; a lightweight 4-layer action transformer encodes a chunk of future hand motion into a single embedding via a [CLS] token. We attention-pool patch tokens using text tokens as queries, then pool again with a learnable query to form a text-conditioned image embedding. This embedding and the action embedding are aligned with a SigLIP-style sigmoid contrastive loss.

CAIP architecture: vision, text, and action transformers aligned via attention pooling and a contrastive loss
CAIP architecture. A ViT encodes image patches and a text transformer encodes language tokens, while an action transformer encodes a T-step action chunk into a single embedding. Two stages of attention pooling produce a text-conditioned image embedding, which is aligned to the action embedding via a sigmoid contrastive loss.

Hand poses as actions

Each hand is represented by 21 keypoints as SE(3) transforms (MANO convention). Action chunks emulate end-effector delta control over T = 64 steps, roughly two seconds of future motion at 30 Hz, aligning naturally with robot action spaces.

Scale from human video

Training spans ~1,000 h of lab data (with wrist views), ~31,000 h of in-the-wild egocentric video, and ~88 h of tabletop humanoid data for embodiment diversity, far beyond what robot datasets alone can offer.

A modular, frozen backbone

The pre-trained encoder transfers to a closed-loop policy: a decoder-only transformer (Qwen3.5-0.8B) trained from scratch with a flow-matching action head, on top of frozen CAIP features.

Experiments

Results

Across six real-world dexterous tasks on a Dexmate Vega bimanual robot with two 22-DoF Sharpa Wave hands (12 trials each), CAIP reaches the highest average success rate, 76.0%, over 30 points above the strongest baseline, and tops 5 of 6 tasks. Baselines are competitive on some tasks but collapse on others; CAIP stays strong across the board.

Method Fold Shorts Pour Pick Fruits Dispense Soap Turn On Lamp Pull Tissue Avg.
R3M14.612.52.129.28.337.517.4
Qwen3.5 ViT27.122.960.472.98.312.534.0
VideoMAE22.952.10.037.525.018.826.0
VC-118.856.30.062.50.022.926.7
MVP54.262.52.193.88.331.342.0
DINOv222.981.352.150.025.020.842.0
SigLIP12.570.837.583.325.025.042.4
SigLIP 24.235.452.193.850.025.043.4
CAIP (Ours)68.883.356.3100.075.072.976.0

Success rates (%) over 12 trials per task. Bold = best, underline = second best.

Policy rollouts

Closed-loop rollouts of CAIP-conditioned policies across the six evaluation tasks on the Dexmate Vega with Sharpa Wave dexterous hands. All videos play at 1× (real-time) speed.

Fold Shorts
Pour Almonds
Pick Fruits
Dispense Soap
Turn On Lamp
Pull Tissue

Analysis

Each encoder’s regions of focus

Side-by-side rollouts compare CAIP against DINOv2 and SigLIP 2 on the same task. Each encoder is shown from a side view and the robot’s head camera, with a saliency map computed using its native query mechanism: CAIP’s text-conditioned cross-attention pool, SigLIP’s learned-probe pooling, and DINOv2’s per-image PCA of patch features. CAIP concentrates on the hands and manipulated objects, while SigLIP 2 scatters across the background and DINOv2 segments by appearance but is instruction-unaware. All videos play at 1× (real-time) speed.

Turn On Lamp

CAIP (Ours)
DINOv2
SigLIP 2
Side view
Head cam
Saliency

Pour

CAIP (Ours)
DINOv2
SigLIP 2
Side view
Head cam
Saliency

Pull Tissue

CAIP (Ours)
DINOv2
SigLIP 2
Side view
Head cam
Saliency

Ablations

Scaling the vision encoder

We ablate CAIP along two axes: encoder capacity and pre-training data scale, under identical data, training, and optimization settings.

Capacity scaling

Scaling the backbone across ViT-B/16, ViT-L/16, and ViT-SO400M/16 improves performance consistently. The largest jump is ViT-B → ViT-L (+33 points average), driven by the harder tasks. ViT-SO400M adds only a few more points at a large compute cost, so we adopt ViT-L as the primary encoder.

Task ViT-B/16 ViT-L/16 ViT-SO400M/16
Turn On Lamp16.775.083.3
Fold Shorts54.268.879.2
Dispense Soap72.9100.0100.0
Average47.981.387.5

Downstream success rates (%). Bold = best per row. ViT-L (used in all main experiments) shown for reference.

Data scaling

Holding the encoder fixed at ViT-L/16 and varying the fraction of pre-training data, performance improves monotonically, from 50.0% at 20% of the data to 77.8% at full scale, with no sign of saturation. CAIP would likely keep benefiting from more egocentric data. Evaluated on the ManiSkill2 Franka tasks.

Pre-training data Lift Peg Stack Cubes Push Cube Avg.
20%58.325.066.750.0
50%66.733.383.361.1
100%75.058.3100.077.8

ManiSkill2 Franka success rates (%) over 12 trials per task. Bold = best per column.

Get started

Using the encoder

CAIP loads as a standard 🤗 Transformers model. Install transformers, torch, and pillow, then encode an image and instruction into text-conditioned visual features.

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

REPO = "yuvansharma/caip-vitl256"
model = AutoModel.from_pretrained(REPO, trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(REPO, trust_remote_code=True)

image = Image.open("example.png").convert("RGB")
inputs = processor(images=image, text="pick up the red cup", return_tensors="pt")

with torch.no_grad():
    out = model(**inputs)

# out.image_pooled    [B, 1024]       text-conditioned pooled image embedding
# out.patch_features  [B, 256, 1024]  patch tokens
# out.text_tokens     [B, 64, 1024]   text token embeddings
# out.text_pooled     [B, 1024]       pooled text embedding

Citation

BibTeX

@misc{sharma2026contrastiveactionimagepretrainingvisuomotor,
      title={Contrastive Action-Image Pre-training for Visuomotor Control},
      author={Yuvan Sharma and Dantong Niu and Anirudh Pai and Zekai Wang and Zhuoyang Liu and Baifeng Shi and Stefano Saravalle and Boning Shao and Ruijie Zheng and Jing Wang and Konstantinos Kallidromitis and Yusuke Kato and Fabio Galasso and Yuke Zhu and Danfei Xu and Linxi "Jim" Fan and Jitendra Malik and Trevor Darrell and Roei Herzig},
      year={2026},
      eprint={2606.17256},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17256},
}