Event details
Apr
1
Multimodal AI - Video and temporal understanding
Vision-language models can process video and image series. Additionally, they are trained to ground their response in time. This allows them to indicate when a particular event or shift occurs in the film. This session will explore the use of vision-language models for analyzing and interpreting moving images.
Image: Elise Racine & Digit / Woven Dialogues / Licensed by CC-BY 4.0
Image: Elise Racine & Digit / Woven Dialogues / Licensed by CC-BY 4.0
University programs and activities are open to all eligible participants without regard to identity or other protected characteristics. Sponsorship of an event does not constitute institutional endorsement of external speakers or views presented.
View physical accessibility information for campus buildings and find accessible routes using the Princeton Campus Map app.