Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses.
Overview: Given two unposed images, we first perform Object-level Dense Bundle Adjustment to estimate initial camera poses and object motions by decomposing the scene into piece-wise rigid components. The dense 3D Gaussian primitives are initialized with per-object SE(3) transformations. In the SE(3) Field-driven 3DGS stage, we jointly optimize the camera poses, per-Gaussian SE(3) transformations, and Gaussian parameters to reconstruct the dynamic scene. The optimized SE(3) field captures fine-grained motion details while maintaining temporal consistency. Finally, the dynamic scene is rendered using the optimized camera poses and SE(3) field to generate high-quality novel-view synthesis results.
Qualitative results for Novel View Synthesis on KITTI and Kubric datasets.
Visualization of our method with Novel View Synthesis (Extrapolation).
Visual demonstration of object-level editing.