Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It allows for a multi-perspective study of an image, from pixel-level information like objects, to relationships that require further inference, and to even deeper cognitive tasks like question answering. It is a comprehensive dataset for training and benchmarking the next generation of computer vision models. With Visual Genome, we expect these models to develop a broader understanding of our visual world, complementing computers’ capacities to detect objects with abilities to describe those objects and explain their interactions and relationships. Visual Genome is a large formalized knowledge representation for visual understanding and a more complete set of descriptions and question answers that grounds visual concepts to language.