https://arxiv.org/abs/1706.01427 A simple neural network module for relational reasoning

https://arxiv.org/abs/1704.05526 Learning to Reason: End-to-End Module Networks for Visual Question Answering

https://arxiv.org/abs/1612.06890 CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

https://arxiv.org/abs/1705.03633 Inferring and Executing Programs for Visual Reasoning

https://arxiv.org/abs/1611.09978 Modeling Relationships in Referential Expressions with Compositional Modular Networks

People often refer to entities in an image in terms of their relationships with other entities. For example, “the black cat sitting under the table” refers to both a “black cat” entity and its relationship with another “table” entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks

https://arxiv.org/abs/1806.08047v1 Flexible Neural Representation for Physics Prediction

we propose a hierarchical particle-based object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes and deformable materials. We then describe the Hierarchical Relation Network (HRN), an end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation.

https://arxiv.org/abs/1807.10982v1 Actor-Centric Relation Network

Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system.