Keynote Speaker #1

賴尚宏  教授



Deep Multimodal Learning for Computer Vision and AI


Deep multimodal learning has attracted increasing attention since it can benefit from data of different modalities to boost accuracy and robustness of the deep neural network models. In this talk, I will present some recent progress of computer vision research based on deep multimodal learning. For the first part, I will introduce the multimodal learning approaches for training face recognition systems to enhance the accuracy, robustness, and generalization. In the second part, I will present some visual-linguistic foundation models that have been used for solving several vision-language combined problems and some computer vision problems under zero-shot or few-shot settings. My talk will include some research outcomes from my teams in NTHU and Microsoft in recent years along these research directions.

Keynote Speaker #2

王鈺強 教授/總監




 Vision, Language, and Generative Models


The convergence of language, vision, and generative models is a captivating and rapidly advancing research domain. In this talk, we will delve into the intricate interplay between these disciplines, showcasing how generative models have sparked a revolution in creative and analytical applications. We will explore the mechanisms behind models’ ability to decode images into text and vice versa, shedding light on their potential to reshape human-machine interaction. With the introduction of a number of our ongoing research directions in vision and language, we will also discuss its challenges and emerging opportunities.