Recent Advances in Vision-and-Language Research

CVPR 2020 Tutorial

Click here to join our tutorial

Time: 06/15/2020, 1:15 - 5:00 PM PDT
Location: Zoom


Vision-and-Language (V+L) research is an interesting area at the nexus of Computer Vision and Natural Language Processing, and has attracted rapidly growing attention from both communities. A variety of V+L tasks, benchmarked over large-scale human-annotated datasets, have driven tremendous progress in joint multimodal representation learning. This tutorial will focus on some of the recently popular tasks in this domain such as visual captioning, visual grounding, visual question answering and reasoning, text-to-image generation, and self-supervised learning for universal image-text representations. We will cover state-of-the-art approaches in each of these areas and discuss key principles that epitomize the core challenges & opportunities in multimodal understanding, reasoning, and generation.


1:15 - 1:25    Opening Remarks presented by JJ Liu and Xiaodong He ( Slides , YouTube , Bilibili )

1:25 - 2:15    Visual QA and Reasoning presented by Zhe Gan ( Slides , YouTube , Bilibili )

2:15 - 2:30    Coffee Break

2:30 - 3:10    Visual Captioning presented by Luowei Zhou ( Slides , YouTube , Bilibili )

3:10 - 3:40    Text-to-image Synthesis presented by Yu Cheng ( Slides , YouTube , Bilibili )

3:40 - 4:00    Coffee Break

4:00 - 5:00    Self-supervised Learning presented by Licheng Yu , Linjie Li and Yen-Chun Chen ( Slides , YouTube , Bilibili )


Contact: Zhe Gan for more details/questions

Website made by Rohit Pillai