🌎Real-Time Processing
Last updated
Last updated
In recent years, Deep Learning (DL) has demonstrated outstanding capabilities in solving 2D-image tasks such as image classification, object detection, semantic segmentation, etc. Not an exception, DL has showed tremendous progresses in applying it to 3D graphic problems. In this post we will explore a recent attempt of extending DL to the Single image 3D reconstruction task, one of the most important and profound challenge in the field of 3D computer graphics.
A single image is only a projection of 3D object into a 2D plane, so some data from the higher dimension space must be lost in the lower dimension representation. Therefore from a single-view 2D image, there will never be enough data construct its 3D component.
A method to create the 3D perception from a single 2D image therefore requires prior knowledge of the 3D shape in itself.
In 2D Deep Learning, a Convolutional AutoEncoder is a very efficient method to learn a compressed representation of input images. Extending this architecture into learning a compact shape knowledge is the most promising way to apply Deep Learning to 3D data.
Unlike a 2D image that has only one universal representation in computer format (pixel), there are many ways to represent 3D data in in digital format. They come with their own advantages and disadvantages, so the choice of data representation directly affected the approach that can be utilized.
Voxel, in short for volumetric pixel, is the direct extension of spatial-grid pixels into volume-grid voxels. The locality of each voxels together define the unique structure of this volumetric data, so the locality assumption of ConvNet still hold true in volumetric format.
However this representation is sparse and wasteful. The density of useful voxels decreases as the resolution increases.
Advantage: Can directly apply CNN from 2D to 3D representation.
Disadvantage: Wasteful representation, high tradeoffs between details and resources (computation, memory).
Polygonal mesh: is collection of vertices, edges and faces that defines the objects’ surface in 3 dimensions. It can capture granular details in a fairly compact representation.
Point Cloud: A collection of points in 3D coordinate (x, y, z), together these points form a cloud that resemble the shape of object in 3 dimension. The larger the collection of points, the more details it gets. The same set of points in different order still represents the same 3D object.
Advantage: Compact representation, focus on the details surface of 3D objects.
Disadvantage: Cannot directly apply CNN.
We will show an implementation that combine the advantages of Point Cloud compact representation but use traditional 2D ConvNet to learn the prior shape knowledge.
We will build a standard 2D CNN Structure Generator that learn the prior shape knowledge of an object. The voxel approach is not desired because it’s inefficient, and it’s not possible to directly learn a point cloud with CNN. Therefore we will instead learn the mapping from a single image to multiple 2D projection of a point cloud, with a 2D projection at a viewpoint defined as: 2D projection == 3D coordinates (x,y,z) + binary mask (m)
Input: Single RGB image
Output: 2D projections at predetermined viewpoints.
Fuse the predicted 2D projections into a native 3D point cloud data. This is possible because the viewpoints of these predictions are fixed and known beforehand.
Input: 2D projections at predetermined viewpoints.
Output: Point cloud
We reason that, if the Point Cloud fused from the predicted 2D projections are of any good, then if we rendered different 2D projections from new viewpoints, it should resemble the projections from the ground truth 3D model too.
Input: Point cloud
Output: depth images at novel viewpoints.
Combining the 3 modules together, we obtained and end-to-end model that learns to generate a compact point cloud representation from one single 2D image, using only 2D convolution structure generator.
The clever trick of this model is to make the fusion + pseudo-rendering modules purely differentiable, geometric reasoning:
Geometric algebra means no learnable parameters, make the model size smaller and easier to train.
Differentiable means we can back-propagate the gradients through it, making it possible to use the loss from 2D projections to learn to generate 3D point cloud.
Comparison of novel depth image from ground truth 3D model and the rendered depth image from the learned point cloud model.
Final result: From one single RBG image → 3D point cloud