SimVS: Simulating World Inconsistencies for Robust View Synthesis
4Columbia University 5University of Maryland
TL;DR: Turn inconsistent captures into consistent multiview images
Challenge
3D reconstruction requires everything in a scene to be frozen, but the real world breaks this assumption constantly; elements of the scene often move, and the lighting naturally changes over time. Gathering paired data to solve this challenge at scale would be extremely challenging. Instead, we propose a generative augmentation strategy that
How it works
Given a multi-view dataset, we perform generative augmentation with a video model, simulating inconsistencies from the individual images and generated inconsistency text prompts. These images are fed to a multiview generative model along with a held out "target state" image. The model is trained to predict a consistent set of images corresponding to the target state, i.e., the original set of multiview images.
Harmonizing Sparse Images of Dynamic Scenes
We take (unordered) sparse image sets from DyCheck and make them consistent with the image in orange. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). Note that the diffusion samples follow the input video trajectory which is not smooth.
Harmonizing Sparse Images of Varying Illumination
We collect our own dataset of varying illuminations with an iPhone camera, in which the same scene is observed under 3 different lighting conditions. The models reconstruct the scene under the target state specified in orange conditioned on the images with varying illumination. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). In this case, the baseline CAT3D samples supervise a GLO-based NeRF, and the GLO embedding corresponding to the target state image is used for testing.
Interactive Grid of Video Samples
Related and Concurrent Work
- CAT4D trains with the synthetic data generated by our proposed method (among other datasets) to enable 4D content creation.
- Dreamitate leverages synthetic video data for policy learning.
Acknowledgements
We would like to thank Paul-Edouard Sarlin, Jiamu Sun, Songyou Peng, Linyi Jin, Richard Tucker, Rick Szeliski and Stan Szymanowicz for insightful conversations and help. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model. This work was supported in part by an NSF Fellowship, ONR grant N00014-23-1-2526, gifts from Google, Adobe, Qualcomm and Rembrand, the Ronald L. Graham Chair, and the UC San Diego Center for Visual Computing.