https://arxiv.org/pdf/1703.10956v1.pdf InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image

We introduce InverseFaceNet, a deep convolutional inverse rendering framework for faces that jointly estimates facial pose, shape, expression, reflectance and illumination from a single input image in a single shot. By estimating all these parameters from just a single image, advanced editing possibilities on a single face image, such as appearance editing and relighting, become feasible. Previous learning-based face reconstruction approaches do not jointly recover all dimensions, or are severely limited in terms of visual quality. In contrast, we propose to recover high-quality facial pose, shape, expression, reflectance and illumination using a deep neural network that is trained using a large, synthetically created dataset. Our approach builds on a novel loss function that measures model-space similarity directly in parameter space and significantly improves reconstruction accuracy. In addition, we propose an analysis-by-synthesis ‘breeding’ approach which iteratively updates the synthetic training corpus based on the distribution of real-world images, and we demonstrate that this strategy outperforms completely synthetically trained networks. Finally, we show high-quality reconstructions and compare our approach to several state-of-the-art approaches.

https://arxiv.org/abs/1704.01047v1 OctNetFusion: Learning Depth Fusion from Data

In this paper, we present a learning based approach to depth fusion, i.e., dense 3D reconstruction from multiple depth images. The most common approach to depth fusion is based on averaging truncated signed distance functions, which was originally proposed by Curless and Levoy in 1996. While this method achieves great results, it can not reconstruct surfaces occluded in the input views and requires a large number frames to filter out sensor noise and outliers. Motivated by large 3D model databases and recent advances in deep learning, we present a novel 3D convolutional network architecture that learns to predict an implicit surface representation from the input depth maps. Our learning based fusion approach significantly outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. By learning the structure of real world 3D objects and scenes, our approach is further able to reconstruct occluded regions and to fill gaps in the reconstruction. We evaluate our approach extensively on both synthetic and real-world datasets for volumetric fusion. Further, we apply our approach to the problem of 3D shape completion from a single view where our approach achieves state-of-the-art results.

https://arxiv.org/abs/1704.02157 Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation

This paper addresses the problem of depth estimation from a single still image. Inspired by recent works on multi- scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end.

http://graphics.stanford.edu/projects/cnncomplete/

https://arxiv.org/pdf/1704.04086.pdf Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis

https://arxiv.org/abs/1704.03489 CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction

https://arxiv.org/abs/1707.07410 Toward Geometric Deep SLAM

We present a point tracking system powered by two deep convolutional neural networks. The first network, MagicPoint, operates on single images and extracts salient 2D points. The extracted points are “SLAM-ready” because they are by design isolated and well-distributed throughout the image. We compare this network against classical point detectors and discover a significant performance gap in the presence of image noise. As transformation estimation is more simple when the detected points are geometrically stable, we designed a second network, MagicWarp, which operates on pairs of point images (outputs of MagicPoint), and estimates the homography that relates the inputs. This transformation engine differs from traditional approaches because it does not use local point descriptors, only point locations. Both networks are trained with simple synthetic data, alleviating the requirement of expensive external camera ground truthing and advanced graphics rendering pipelines. The system is fast and lean, easily running 30+ FPS on a single CPU.

https://arxiv.org/abs/1708.07969 3D Object Reconstruction from a Single Depth View with Adversarial Learning

https://github.com/Yang7879/3D-RecGAN

https://arxiv.org/abs/1804.01622 Image Generation from Scene Graphs

To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.

https://arxiv.org/abs/1808.09351v2 3D-Aware Scene Manipulation via Inverse Graphics

Our scene encoder performs inverse graphics, translating a scene into the structured object representation. Our decoder has two components: a differentiable shape renderer and a neural texture generator. The disentanglement of geometry, appearance, and pose supports various 3D-aware scene manipulations, e.g., rotating and moving objects freely while maintaining consistent shape and texture, changing object appearance without affecting its shape. We systematically evaluate our model and demonstrate that our editing scheme is superior to its 2D counterpart.

http://gibsonenv.stanford.edu/

https://keypointnet.github.io/

https://arxiv.org/abs/1802.06857 Global Pose Estimation with an Attention-based Recurrent Network

https://openreview.net/forum?id=rJe10iC5K7 Modeling Parts, Structure, and System Dynamics via Predictive Learning

we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos in a self-supervised manner. Our Parts, Structure, and Dynamics (PSD) model learns to first recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

https://arxiv.org/abs/1806.08756 Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

What is the right object representation for manipulation? We would like robots to visually perceive scenes and learn an understanding of the objects in them that (i) is task-agnostic and can be used as a building block for a variety of manipulation tasks, (ii) is generally applicable to both rigid and non-rigid objects, (iii) takes advantage of the strong priors provided by 3D vision, and (iv) is entirely learned from self-supervision. This is hard to achieve with previous methods: much recent work in grasping does not extend to grasping specific objects or other tasks, whereas task-specific learning may require many trials to generalize well across object configurations or other tasks. In this paper we present Dense Object Nets, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation. We demonstrate they can be trained quickly (approximately 20 minutes) for a wide variety of previously unseen and potentially non-rigid objects. We additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance. Finally, we demonstrate the novel application of learned dense descriptors to robotic manipulation. We demonstrate grasping of specific points on an object across potentially deformed object configurations, and demonstrate using class general descriptors to transfer specific grasps across objects in a class.

Multimodal unsupervised image translation https://github.com/NVlabs/MUNIT

Video Translation https://github.com/NVIDIA/vid2vid

Depth Estimation using cyclegan https://github.com/gautam678/Pix2Depth

Twin GAN https://arxiv.org/abs/1809.00946