A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance's global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275.
Illustration of GCE-Pose: (A) Semantic and geometric features are extracted from an RGB-D input. A keypoint feature detector identifies robust keypoints and extracts their corresponding features. (B) An instance-specific and category-level semantic global feature is reconstructed using our SSR module. (C) The global features are fused with the keypoint features to predict the keypoint NOCS coordinates. (D) The predicted keypoints, NOCS coordinates, and fused keypoint features are utilized for pose and size estimation.
Illustration of Deep Linear Semantic Shape Model. A Deep Linear Semantic Shape model is composed of a prototype shape $c$, a scale network $\mathcal{S}$, a deformation network $\mathcal{D}$, a Deformation field $\mathcal{V}$ and a category-level semantic features $c^k_\text{sem}$. At stage 1, we build a Deep Linear Shape (DLS) model using sampled point clouds from all ground truth instances within each category, training a linear parameterization network to represent each instance. At stage 2, we retrain the DLS model to regress the corresponding DLS parameters from partial point cloud inputs using a deformation and scale network. During testing, the network predicts DLS parameters for unseen objects and reconstructs their point clouds based on the learned deformation field to get semantic reconstruction.
@article{li2025gce,
title={GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation},
author={Li, Weihang and Xu, Hongli and Huang, Junwen and Jung, Hyunjun and Yu, Peter KT and Navab, Nassir and Busam, Benjamin},
journal={arXiv preprint arXiv:2502.04293},
year={2025}
}