Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses.
Overview: Given two unposed images, we first perform Object-level Dense Bundle Adjustment to estimate initial camera poses and object motions by decomposing the scene into piece-wise rigid components. The dense 3D Gaussian primitives are initialized with per-object SE(3) transformations. In the SE(3) Field-driven 3DGS stage, we jointly optimize the camera poses, per-Gaussian SE(3) transformations, and Gaussian parameters to reconstruct the dynamic scene. The optimized SE(3) field captures fine-grained motion details while maintaining temporal consistency. Finally, the dynamic scene is rendered using the optimized camera poses and SE(3) field to generate high-quality novel-view synthesis results.
Qualitative comparison results for novel view synthesis on the KITTI and Kubric datasets. We generate novel view images at new timestamps using camera pose interpolation and object motion interpolation. Our method significantly outperforms baseline approaches in rendering quality.
Visualization of our method applied to novel view synthesis via extrapolation. Given input images at \( t_1 \) and \( t_2 \), our approach renders novel view images outside the given time window (e.g., \( t_0 \to t_1 \), \( t_2 \to t_3 \)). This is achieved through explicit modeling of both camera motion and object motion.
Visualization of the application of our method to object-level editing. By recovering the motion, geometry, and appearance of individual objects, we enable independent control of object motion. This allows for novel view rendering by selectively moving specific objects.
@article{li2024dynsup,
title={DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair},
author={Li, Weihang and Chen, Weirong and Qian, Shenhan and Chen, Jiajie and Cremers, Daniel and Li, Haoang},
journal={arXiv preprint arXiv:2412.00851},
year={2024}
}