Object Pose Transformer: Unifying Unseen Object Pose Estimation

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2026

Weihang Li1,2, Lorenzo Garattoni3, Fabien Despinoy3, Nassir Navab1,2 Benjamin Busam1,2
1Technical University of Munich    2Munich Center for Machine Learning    3Toyota Motor Europe   
Input Image 1

OPT-Pose utilizes a feed-forward transformer to predict point map, depth, NOCS, and camera parameters. Existing category-level methods predict canonical absolute 9-DoF SA(3) poses (equivalent to Depth + NOCS), but require predefined category labels and calibrated cameras. Relative pose methods align unseen objects across views in 6-DoF SE(3) (equivalent to Pointmap + Depth), but do not support single-view absolute pose prediction. OPT-Pose enables the simultaneous recovery of both unseen-object relative and category-level absolute poses (right-most column) for flexible single or multi-view RGB or RGB-D input, without the need for CAD models or semantic labels.

Abstract

Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (OPT-Pose), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. OPT-Pose jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, OPT-Pose is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

Method

teaser-fig.

Illustration of OPT-Pose: A multiview transformer aggregates image tokens and emits predictions from light heads: camera parameters, depth, and point maps for camera-space geometry; a keypoint-centric module fuses RGB and 3D features to discover object keypoints, predict NOCS coordinates, and build an object latent embedding. Absolute pose (SA(3)) and relative pose (SE(3)) are recovered in a single forward pass. Optional sensor depth provides metric scale, while the system remains fully functional in RGB-only mode.

BibTeX


      Coming soon.