TL;DR: Turn inconsistent captures into consistent multiview images


Challenge

3D reconstruction requires everything in a scene to be frozen, but the real world breaks this assumption constantly; elements of the scene often move, and the lighting naturally changes over time. Gathering paired data to solve this challenge at scale would be extremely challenging. Instead, we propose a generative augmentation strategy that simulates those inconsistencies, and train another generative model to sample consistent multiview images conditioned on the inconsistent information. At test-time, our model generalizes to the real world due to the improving quality of video models.



How it works

Given a multi-view dataset, we perform generative augmentation with a video model, simulating inconsistencies from the individual images and generated inconsistency text prompts. These images are fed to a multiview generative model along with a held out "target state" image. The model is trained to predict a consistent set of images corresponding to the target state, i.e., the original set of multiview images.







Harmonizing Sparse Images of Dynamic Scenes

We take (unordered) sparse image sets from DyCheck and make them consistent with the image in orange. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). Note that the diffusion samples follow the input video trajectory which is not smooth.

Input Image
Input
Baseline




Harmonizing Sparse Images of Varying Illumination

We collect our own dataset of varying illuminations with an iPhone camera, in which the same scene is observed under 3 different lighting conditions. The models reconstruct the scene under the target state specified in orange conditioned on the images with varying illumination. Compare the renders, depth maps, and diffusion samples of our method SimVS (right) with CAT3D (left). In this case, the baseline CAT3D samples supervise a GLO-based NeRF, and the GLO embedding corresponding to the target state image is used for testing.

Input Image 1 Input Image 2 Input Image 3
Inputs
Baseline





Interactive Grid of Video Samples

Click on the interactive grid to view our generative video augmentation in comparison to the original image and two heuristic augmentation strategies. These videos were sampled from Lumiere using the inconsistency prompt generation methods specified in the paper. We use this data for training SimVS. Note that some videos include flashing lights.





Related and Concurrent Work

  • CAT4D trains with the synthetic data generated by our proposed method (among other datasets) to enable 4D content creation.
  • Dreamitate leverages synthetic video data for policy learning.

Acknowledgements

We would like to thank Paul-Edouard Sarlin, Jiamu Sun, Songyou Peng, Linyi Jin, Richard Tucker, Rick Szeliski and Stan Szymanowicz for insightful conversations and help. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model. This work was supported in part by an NSF Fellowship, ONR grant N00014-23-1-2526, gifts from Google, Adobe, Qualcomm and Rembrand, the Ronald L. Graham Chair, and the UC San Diego Center for Visual Computing.