Video Modeling

The role of video has increased tremendously, with an estimated 3.1 billion people consuming videos on the Internet daily. Our group aims to develop new spatiotemporal models and representations for efficient and effective video data analysis.

Related Publications:

Long Movie Clip Classification with State-Space Video Models

Md Mohaiminul Islam, Gedas Bertasius

ECCV 2022

[arxiv] [code] [bibtex]

TALLFormer: Temporal Action Localization with a Long-memory Transformer

Feng Cheng, Gedas Bertasius

ECCV 2022

[arxiv] [code] [bibtex]

Long-Short Temporal Contrastive Learning of Video Transformers

Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

CVPR 2022

[arxiv] [bibtex]

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

ICML 2021

[arxiv] [code] [talk] [slides] [blog] [VentureBeat] [SiliconAngle] [bibtex]

Multimodal Learning

Humans understand the world by processing signals from different modalities (e.g., speech, sound, vision, etc). Similarly, we aim to equip computational video models with multimodal processing capabilities to understand visual content, audio, speech, and other modalities.

Related Publications:

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin, Gedas Bertasius

ECCV 2024

[arxiv] [code] [bibtex]

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

EMNLP 2024

[arxiv] [code] [bibtex]

VindLU: A Recipe for Effective Video-and-Language Pretraining

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

CVPR 2023

[arxiv] [code] [bibtex]

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

CVPR 2023

[arxiv] [code] [project page] [bibtex]

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius

ECCV 2022 (Oral)

[arxiv] [code] [project page] [bibtex]

Virtual AI Assistants

Our group aims to develop AI systems that could help people with various daily tasks. Our work in this area includes analyzing human behavior from first-person videos, assisting people with procedural action planning, understanding human skills from video, and others.

Related Publications:

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, Xitong Yang

ECCV 2024 (Oral)

[arxiv] [project page] [bibtex]

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

CVPR 2024

[arxiv] [project website] [code] [dataset] [bibtex]

Learning To Recognize Procedural Activities with Distant Supervision

Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

CVPR 2022

[arxiv] [code] [project page] [bibtex]

Unsupervised Learning of Important Objects from First-Person Videos
Gedas Bertasius, Hyun Soo Park, Stella X. Yu and Jianbo Shi
ICCV 2017
[arxiv] [bibtex]

CV for Basketball

The rapidly growing video broadcasts have made basketball one of the most widely watched sports in the world. It is a competitive, goal-oriented team sport that requires exceptional physical and technical skills as well as sophisticated strategic thinking. As a former basketball player, I am passionate about applying state-of-the-art computer vision models to basketball videos to advance our understanding of this exciting game.

Related Publications:

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Gedas Bertasius, ... , Michael Wray

CVPR 2024

[arxiv] [project website] [blog] [video] [bibtex]