DynSUP: Dynamic Gaussian Splatting
from An Unposed Image Pair

arXiv 2024

Weihang Li1,3, Weirong Chen1,2, Shenhan Qian1,2, Jiajie Chen1,3, Daniel Cremers1,2, Haoang Li3
1Technical University of Munich    2Munich Center for Machine Learning    3The Hong Kong University of Science and Technology (Guangzhou)   

Input Image 1 Input Image 2

Input Image Pair

Given two images captured at distinct moments with unknown poses in a dynamic environment, DynSUP can fit dynamic Gaussian splatting and then synthesize a new image from a novel viewpoint at a different time.

Abstract

Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses.

Method

teaser-fig.

Overview: Given two unposed images, we first perform Object-level Dense Bundle Adjustment to estimate initial camera poses and object motions by decomposing the scene into piece-wise rigid components. The dense 3D Gaussian primitives are initialized with per-object SE(3) transformations. In the SE(3) Field-driven 3DGS stage, we jointly optimize the camera poses, per-Gaussian SE(3) transformations, and Gaussian parameters to reconstruct the dynamic scene. The optimized SE(3) field captures fine-grained motion details while maintaining temporal consistency. Finally, the dynamic scene is rendered using the optimized camera poses and SE(3) field to generate high-quality novel-view synthesis results.

Novel View Synthesis (Interpolation)

Qualitative results for Novel View Synthesis on KITTI and Kubric datasets.

Novel View Synthesis (Extrapolation)

Visualization of our method with Novel View Synthesis (Extrapolation).

Object-level Editing

Visual demonstration of object-level editing.

BibTeX