Connecting Vision and Language via Interpretation, Grounding and Imagination

  • Mr Ramakrishna Vedantam | Georgia Tech

Understanding how to model vision and language jointly is a long-standing challenge in artificial intelligence. Vision is one of the primary sensors we use to perceive the world, while language is our data structure to represent and communicate knowledge. In this talk, we will take up three lines of attack to this problem: interpretation, grounding, and imagination. In interpretation, the goal will be to get machine learning models to understand an image and describe its contents using natural language in a contextually relevant manner. In grounding, we will connect natural language to referents in the physical world, and show how this can help learn common sense. Finally, we will study how to ‘imagine’ visual concepts completely and accurately across the full range and (potentially unseen) compositions of their visual attributes. We will study these problems from computational as well as algorithmic perspectives and suggest exciting directions for future work.

Speaker Details

Ramakrishna Vedantam is a Computer Science (C.S.) Ph.D. student in the School of Interactive Computing in the College of Computing at Georgia Tech. His advisor is Devi Parikh. Previously, he has held visiting positions at INRIA-Saclay (2014 Summer), Google Research – Mountain View (2016 Summer, 2017 Winter) and Facebook AI Research – Menlo Park (2017 Summer). He got his masters degree in Computer Engineering from Virginia Tech in 2016, and an undergraduate degree in Electronics and Communication Engineering from the International Institute of Information Technology (IIIT) – Hyderabad in 2013. He is broadly interested in computer vision, machine learning, and artificial intelligence. On the vision side, he is interested in problems connecting vision to natural language, common sense reasoning, and visually grounded reasoning. On the machine learning side, he is interested in generative models and variational inference. Among other things, he is the inventor of the CIDEr metric commonly used for evaluating image captioning models. He has worked on topics like context-aware captioning, multimodality variational autoencoders, learning grounded word embeddings, common sense reasoning etc. over the course of his Ph.D. He is the recipient of the Google PhD Fellowship in Machine Perception for 2018, an “Outstanding Reviewer Award” for his reviewing at CVPR, 2017, and was a finalist for the Adobe Research Fellowship in 2016 and 2018.