BulletGen: Improving 4D Reconstruction with Bullet-Time Generation
arXiv 2025

Abstract

Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen "bullet-time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.

Pipeline

BulletGen architecture. Starting from a monocular RGB video, we reconstruct the dynamic scene with Shape-of-Motion given data-driven priors (motion masks, depths, longterm 2D tracks). Then, we generate novel views at selected frozen timesteps (bullet times) using a conditioned generative model. These generated views are localized and mapped to the current scene using an optimization based on photometric, perceptual, semantic, and depth errors. Final 4D reconstruction augments the scene and allows for higher quality extreme novel view synthesis.

Results on the DyCheck iPhone dataset


Results on the NVIDIA dynamic dataset


Citation

Acknowledgements

The website template was borrowed from Michaël Gharbi and Ref-NeRF.