Tutorial

Overview

Vision-and-Language (V+L) research is an interesting area at the nexus of Computer Vision and Natural Language Processing, and has attracted rapidly growing attention from both communities. A variety of V+L tasks, benchmarked over large-scale human-annotated datasets, have driven tremendous progress in joint multimodal representation learning. This tutorial will focus on some of the recently popular tasks in this domain such as visual captioning, visual grounding, visual question answering and reasoning, text-to-image generation, and self-supervised learning for universal image-text representations. We will cover state-of-the-art approaches in each of these areas and discuss key principles that epitomize the core challenges & opportunities in multimodal understanding, reasoning, and generation.

Agenda

1:15 - 1:25 Opening Remarks presented by JJ Liu and Xiaodong He ( Slides , YouTube , Bilibili )

1:25 - 2:15 Visual QA and Reasoning presented by Zhe Gan ( Slides , YouTube , Bilibili )

2:15 - 2:30 Coffee Break

2:30 - 3:10 Visual Captioning presented by Luowei Zhou ( Slides , YouTube , Bilibili )

3:10 - 3:40 Text-to-image Synthesis presented by Yu Cheng ( Slides , YouTube , Bilibili )

3:40 - 4:00 Coffee Break

4:00 - 5:00 Self-supervised Learning presented by Licheng Yu , Linjie Li and Yen-Chun Chen ( Slides , YouTube , Bilibili )

Recent Advances in Vision-and-Language Research CVPR 2020 Tutorial
Click here to join our tutorial
Time: 06/15/2020, 1:15 - 5:00 PM PDT
Location: Zoom

Recent Advances in Vision-and-Language Research

CVPR 2020 Tutorial

Click here to join our tutorial

Time: 06/15/2020, 1:15 - 5:00 PM PDT

Location: Zoom

Overview

Agenda

Organizers