MVD2: Efficient Multiview 3D Reconstruction for Multiview Diffusion

Xin-Yang Zheng 1    Hao Pan 2    Yu-Xiao Guo 2    Xin Tong 2    Yang Liu 2
1 Tsinghua University    2 Microsoft Research Asia   
SIGGRAPH Conference Proceedings (SIGGRAPH 2024)

TL; DR

MVD2 reconstructs high-quality 3D shapes from multiview diffusion (MVD) images by efficient feedforward network prediction. MVD2 addresses MVD image inconsistency to achieve high-quality reconstruction, and scales to different MVD image generators (Zero123++, SyncDreamer, Wonder3D,etc.).

Abstract

Multiview diffusion (MVD) has emerged as a prominent 3D generation technique, acclaimed for its generalizability, quality, and efficiency. MVD models finetune image diffusion models with 3D data to generate multiple views of a 3D object from an image or text prompt, followed by a multiview 3D reconstruction process. However, the sparsity of views and inconsistent details in the generated multiview images pose challenges for 3D reconstruction. We present MVD2, an efficient 3D reconstruction method tailored for MVD images. MVD2 integrates multiview image features into a 3D feature volume, then transforms this volume into a textureless 3D mesh, onto which the MVD images are mapped as textures. It employs a simple-yet-efficient view-dependent training scheme to mitigate discrepancies between MVD images and ground-truth views of 3D shapes, effectively improving 3D generation quality and robustness. MVD2 is trained with 3D collections and MVD images, and the trained MVD2 efficiently reconstructs 3D meshes from multiview images within one second and exhibits great model generalizability in dealing with images generated by various MVD methods.

Method

The MVD model generates multiple images from specific viewpoints using a reference image. MVD2 extracts and fuses the features of these images into a coarse 3D grid G, and interpolates them into a dense grid G', from which the surface mesh is extracted in a differentiable manner. During training, mesh reconstruction is guided by pixelwise loss (red arrow) against depth/normal/mask maps at the reference view v0, and structural loss (yellow arrow) against normal maps from other views. The resulting mesh can be textured by mapping MVD images onto it.

Results

Input Geometry Textured Mesh Input Geometry Textured Mesh

1/3

Comparision on GSO dataset

Comparision on Internet images

Links

Paper [PDF]

Code [Github]

Citation [BibTeX]