Three-dimensional Scene Reconstruction from multi-view Images (CVPR’22 Oral)
We introduce a paper on 3D scene reconstruction of 2022 CVPR Oral: Neural 3D Scene Reconstruction with the Manhattan-world Assumption, this paper is proposed by CAD&CG State Key Laboratory of Zhejiang University/Zhejiang University-Sensetime Joint Laboratory of 3D Vision.
The thesis links: https://arxiv.org/abs/2205.02836
Paper code: https://github.com/zju3dv/manhattan_sdf
Project page: https://zju3dv.github.io/manhattan_sdf/
- Introduction
1.1 Problem description of the paper
Input the image sequence collected in the indoor scene, the paper hopes to generate the three-dimensional model of the indoor scene. There are many applications for this problem, such as virtual and augmented reality, robotics, etc.
1.2 Limitations of the current approach to this problem
In the traditional method, scene reconstruction is generally done through MVS (Multi-View Stereo) [1,2]. Firstly, the depth map of each view is estimated based on multi-view matching, and then the depth of each view is fused in three-dimensional space. The biggest problem of this method is that it is difficult to deal with weak texture areas and non-Lambert surfaces, because it is difficult to match these areas, resulting in incomplete reconstruction.
Multi-view Stereo via Depth Map Fusion: A Coordinate Decent Optimization Method
Recently, methods have been proposed for 3D reconstruction based on implicit neural representation. NeRF [3] learns implicit radiation fields from images through differentiable volume rendering techniques. NeRF can achieve realistic view synthesis, but the geometric reconstruction results are very noisy, mainly due to the lack of surface constraints. NeuS [4] and VolSDF [5] model the geometry of the scene with SDF (direction distance field), and realize volume rendering based on SDF, which can obtain smoother geometric reconstruction results compared with NeRF. These methods are based on the principle of luminosity consistency, so it is difficult to deal with weak texture areas, and the reconstruction quality of indoor scenes is poor.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
1.3 Our observations and solutions to the problems
In order to overcome the ambiguity of interior scene reconstruction in the weak texture plane area, we adopted corresponding geometric constraints based on the Manhattan hypothesis in the optimization process. Manhattan hypothesis is a widely used indoor scene hypothesis, that is, the ground, wall and ceiling of the indoor scene are usually aligned in three main directions perpendicular to each other. Based on this, we design corresponding geometric constraints for the ground and wall area.
Manhattan hypothesis diagram
- Paper method
2.1 Method Overview
This paper uses a neural implicit representation of the geometry, appearance, and semantics of the modeling scene, and optimizes the representation from multiple view images. The specific steps are as follows:
1) Use differentiable volume rendering technology to optimize geometry and appearance according to input images.
2) Predict the semantic segmentation of walls and floors, and apply corresponding geometric constraints to these areas based on the Manhattan hypothesis.
3) In order to improve the robustness of semantic segmentation inaccuracies, we propose a joint optimization strategy to optimize both geometry and semantics, so as to achieve higher quality reconstruction results.
2.2 Volume rendering based on SDF
To adopt the volume rendering technique, we first convert the directed distance field to volume density:
2.3 Geometric Constraints
We first used DeepLabV3+ [6] to divide the floor and wall areas in the image space. For each pixel in the ground area, we first do volume rendering to get the corresponding surface point. By calculating the gradient of directed distance field there, we get the normal direction. The design loss function restricts its normal vertical direction:
2.4 Joint optimization
Geometric constraints can play a good role in accurate semantic segmentation regions, but the semantic segmentation of network prediction may be inaccurate in some regions, which will affect the reconstruction results. As shown in the figure below, due to inaccurate semantic segmentation, the reconstruction results become worse after geometric constraints are added.
To overcome this problem, we study semantic fields in 3D space. We use volume rendering technology to render semantics into the image space, and obtain the probability that each pixel belongs to the ground and wall area through softmax normalization. We use this probability to weight the geometric constraints:
�joint=∑ ∈� ^ (�) +∑�∈� ^ (�)
At the same time, in order to avoid trivial solution (the probability of belonging to the ground or wall is reduced to 0), we also use the predictive computational cross entropy of the 2D semantic segmentation network as the supervision:
∑ =−∑ ∈{�, �}�� ^ (�)
- Experimental analysis
3.1 Ablation studies
Through qualitative and quantitative experimental results, we find that the use of volume constraints can improve the reconstruction effect in the plane region, but it will also lead to the reconstruction of some non-plane regions deteriorates due to the inaccuracy of semantic segmentation. By using the joint optimization strategy proposed by us, the reconstruction results can be comprehensively improved.
3.2 Comparison with SOTA method
We compared the previous MVS and volume rendering methods on ScanNet and 7-Scenes datasets, and the numerical results were significantly better than the previous methods.
Related posts:
- Low-cost ESP32 solution to support OpenHarmony system development (with 10+ sample project demos)
- Simulation experiments on single frequency signal spectrum detection method
- Keypad driver based on N32G45
- SNOW Micro 32T Edge Smart Gateway released to help digital upgrade in multiple industries, reduce costs and increase efficiency in operations and maintenance, equipped with RK3588 flagship core