![](https://static.wixstatic.com/media/77d769_3308abf26af544649783945462d83d53~mv2.png/v1/fill/w_214,h_185,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/77d769_3308abf26af544649783945462d83d53~mv2.png)
A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
arXiv
![](https://static.wixstatic.com/media/77d769_d8e9d5718c614b05984b613d98d5067b~mv2.png/v1/fill/w_214,h_146,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/77d769_d8e9d5718c614b05984b613d98d5067b~mv2.png)
Video ReCap: Recursive Captioning of Hour-Long Videos
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
CVPR 2024
[arxiv] [project website] [code] [dataset] [bibtex]
![](https://static.wixstatic.com/media/77d769_316a2e573cb74da1a130600597fa3b11~mv2.png/v1/fill/w_232,h_151,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/77d769_316a2e573cb74da1a130600597fa3b11~mv2.png)
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Gedas Bertasius, ... , Michael Wray
CVPR 2024
![](https://static.wixstatic.com/media/77d769_2be3e3aa54234fe4b751bfb87ff559a2~mv2.png/v1/fill/w_222,h_141,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/77d769_2be3e3aa54234fe4b751bfb87ff559a2~mv2.png)
LoCoNet: Long-Short Context Network for Active Speaker Detection
Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall
CVPR 2024
![](https://static.wixstatic.com/media/77d769_abaac9d6968041bda160257bbbaf32fb~mv2.png/v1/fill/w_222,h_160,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/77d769_abaac9d6968041bda160257bbbaf32fb~mv2.png)
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
ICCV 2023
![](https://static.wixstatic.com/media/77d769_e04b81d46d714aec80cb27eea2f95690~mv2.png/v1/fill/w_56,h_30,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_e04b81d46d714aec80cb27eea2f95690~mv2.png)
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
ICCV 2023
![](https://static.wixstatic.com/media/77d769_26ed215911024baf8f5e661dd0e7c96f~mv2.jpeg/v1/fill/w_56,h_37,al_c,q_80,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_26ed215911024baf8f5e661dd0e7c96f~mv2.jpeg)
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius
CVPR 2023
![](https://static.wixstatic.com/media/77d769_87376f86e0df4b2782de914d2ffb9d7e~mv2.png/v1/fill/w_52,h_46,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_87376f86e0df4b2782de914d2ffb9d7e~mv2.png)
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
CVPR 2023
[arxiv] [code] [project page] [bibtex]
![](https://static.wixstatic.com/media/77d769_d8590f49a78e43b08a77e8ae66a0d0c6~mv2.png/v1/fill/w_58,h_46,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_d8590f49a78e43b08a77e8ae66a0d0c6~mv2.png)
Efficient Movie Scene Detection using State-Space Transformers
Md Mohaiminul Islam, Mahmudul Hasan, Kishan Athrey, Tony Braskich, Gedas Bertasius
CVPR 2023
![](https://static.wixstatic.com/media/77d769_11c748838a5f437b9855211b67c58663~mv2.png/v1/fill/w_53,h_47,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_11c748838a5f437b9855211b67c58663~mv2.png)
Improving Video Retrieval Using Multilingual Knowledge Transfer
Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
ECIR 2023 (Best Student Paper Award)
[arxiv]
![](https://static.wixstatic.com/media/77d769_9856b7b325ca4c3fa3120440dd97fd71~mv2.png/v1/fill/w_58,h_28,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_9856b7b325ca4c3fa3120440dd97fd71~mv2.png)
Learning to Retrieve Videos by Asking Questions
Avinash Madasu, Junier Oliva, Gedas Bertasius
ACM Multimedia 2022
![](https://static.wixstatic.com/media/77d769_81da5332c9fb47daae009cb7145c2203~mv2.png/v1/fill/w_59,h_33,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_81da5332c9fb47daae009cb7145c2203~mv2.png)
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
ECCV 2022 (Oral)
[arxiv] [code] [project page] [bibtex]
![](https://static.wixstatic.com/media/77d769_3db7aeb9692b4a6fb066485c294226a8~mv2.png/v1/fill/w_59,h_37,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_3db7aeb9692b4a6fb066485c294226a8~mv2.png)
TALLFormer: Temporal Action Localization with a Long-memory Transformer
Feng Cheng, Gedas Bertasius
ECCV 2022
![](https://static.wixstatic.com/media/77d769_a9705b39197a48049b742b04756ad587~mv2.png/v1/fill/w_60,h_30,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_a9705b39197a48049b742b04756ad587~mv2.png)
Long Movie Clip Classification with State-Space Video Models
Md Mohaiminul Islam, Gedas Bertasius
ECCV 2022
![](https://static.wixstatic.com/media/77d769_793a97094a5749fa9dfdc1cc33fa105d~mv2.png/v1/fill/w_60,h_37,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_793a97094a5749fa9dfdc1cc33fa105d~mv2.png)
Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani
CVPR 2022
[arxiv] [code] [project page] [bibtex]
![](https://static.wixstatic.com/media/77d769_2a5b61498f9b458f9e3467154636c7d8~mv2.png/v1/fill/w_59,h_28,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_2a5b61498f9b458f9e3467154636c7d8~mv2.png)
Long-Short Temporal Contrastive Learning of Video Transformers
Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani
CVPR 2022
![](https://static.wixstatic.com/media/77d769_b14924d16e3547bba30ab12b900d5705~mv2.gif)
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani
ICML 2021
[arxiv] [code] [talk] [slides] [Facebook AI Blog] [VentureBeat] [SiliconAngle] [bibtex]
![](https://static.wixstatic.com/media/77d769_ef24203bc3214bc39bc47ee6922bf81b~mv2.png/v1/fill/w_66,h_40,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_ef24203bc3214bc39bc47ee6922bf81b~mv2.png)
Vx2Text: End-to-End Learning of Video-Based Text Generation from Multimodal Inputs
Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
CVPR 2021
[arxiv] [VentureBeat] [bibtex]
![](https://static.wixstatic.com/media/77d769_380821f912cf487eb55540475428d264~mv2.png/v1/fill/w_64,h_30,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/wacv21_cover.png)
Supervoxel Attention Graphs for Long-Range Video Modeling
Yang Wang, Gedas Bertasius, Tae-Hyun Oh, Abhinav Gupta, Minh Hoai, Lorenzo Torresani
WACV 2021
![](https://static.wixstatic.com/media/77d769_76bd3373c77d43ad8f102bc139f9fe2d~mv2.png/v1/fill/w_60,h_33,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_76bd3373c77d43ad8f102bc139f9fe2d~mv2.png)
COBE: Contextualized Object Embeddings from Narrated Instructional Video
Gedas Bertasius, Lorenzo Torresani
NeurIPS 2020
[arxiv] [talk] [slides] [HowTo100M_BB pseudo annotations] [bibtex]
![](https://static.wixstatic.com/media/77d769_acfc42203a0546a7b11b7e1b81f625bf~mv2.png/v1/fill/w_59,h_43,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_acfc42203a0546a7b11b7e1b81f625bf~mv2.png)
Attentive Action and Context Factorization
Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai
BMVC 2020
[arxiv]
![](https://static.wixstatic.com/media/77d769_618b3950515541abae2c4a2ef7657d89~mv2.gif)
Classifying, Segmenting, and Tracking Objects in Video with Mask Propagation
Gedas Bertasius, Lorenzo Torresani
CVPR 2020 (Best Paper Nominee)
Ranked 1st on YouTube-VIS Leaderboard and EPIC-Kitchens Detection Challenge.
[arxiv] [talk] [slides] [bibtex]
![](https://static.wixstatic.com/media/77d769_08a5244eee874848aba6a73005c5bca0~mv2.gif)
Learning Temporal Pose Estimation from Sparsely-Labeled Videos
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
NeurIPS 2019
Ranked 1st on PoseTrack Leaderboard for multi-frame pose estimation.
[arxiv] [poster] [code] [bibtex]
![](https://static.wixstatic.com/media/77d769_35eeede8a1de445db67393ff6d225530~mv2.gif)
Object Detection in Video with Spatiotemporal Sampling Networks
Gedas Bertasius, Lorenzo Torresani and Jianbo Shi
ECCV 2018
[arxiv] [results] [bibtex]
![cvpr18_cover.gif](https://static.wixstatic.com/media/77d769_1e22e30023ef47739d6d212472a70aae~mv2.gif)
Egocentric Basketball Motion Planning from a Single First-Person Image
Gedas Bertasius, Aaron Chan and Jianbo Shi
CVPR 2018
[arxiv] [results] [MIT SSAC Poster] [bibtex]
![iccv17_baller_cover.gif](https://static.wixstatic.com/media/77d769_7217f6590c7a440aa120d3c43e490d05~mv2.gif)
Am I a Baller? Basketball Performance Assessment from First-Person Videos
Gedas Bertasius, Stella X. Yu, Hyun Soo Park and Jianbo Shi
ICCV 2017
[arxiv] [results] [bibtex]
![](https://static.wixstatic.com/media/77d769_5cff56bf880c4381af5c8f9a7dcb0c4d~mv2.gif)
Unsupervised Learning of Important Objects from First-Person Videos
Gedas Bertasius, Hyun Soo Park, Stella X. Yu and Jianbo Shi
ICCV 2017
[arxiv] [bibtex]
![](https://static.wixstatic.com/media/77d769_6e74f82aa7e54c89863392117dc90d60~mv2.gif)
Convolutional Random Walk Networks for Semantic Image Segmentation
Gedas Bertasius, Lorenzo Torresani, Stella X. Yu and Jianbo Shi
CVPR 2017
[arxiv] [bibtex]
![](https://static.wixstatic.com/media/77d769_e300ec1d468f49709fae52c149d1c38d~mv2.gif)
First-Person Action-Object Detection with EgoNet
Gedas Bertasius, Hyun Soo Park, Stella X. Yu, and Jianbo Shi
RSS 2017
[arxiv] [New Scientist Article] [Impact Article] [results] [bibtex]
![](https://static.wixstatic.com/media/77d769_2ea7f15d85bc4489b2a00489931fcafb~mv2.png/v1/fill/w_64,h_41,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_2ea7f15d85bc4489b2a00489931fcafb~mv2.png)
Local Perturb-and-MAP for Structured Prediction
Gedas Bertasius, Qiang Liu, Lorenzo Torresani, and Jianbo Shi
AISTATS 2017
[arxiv] [bibtex]
![](https://static.wixstatic.com/media/77d769_ddc38215c6a5437987413437853eb989~mv2.png/v1/crop/x_6,y_19,w_487,h_328/fill/w_64,h_43,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_ddc38215c6a5437987413437853eb989~mv2.png)
Semantic Segmentation with Boundary Neural Fields
Gedas Bertasius, Jianbo Shi and Lorenzo Torresani
CVPR 2016
[arxiv] [code] [bibtex]
![](https://static.wixstatic.com/media/77d769_5c1f49c9be5a445db21c1183318b7352~mv2.png/v1/fill/w_64,h_43,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_5c1f49c9be5a445db21c1183318b7352~mv2.png)
High-for-Low, Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision
Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani
ICCV 2015
[arxiv] [code] [bibtex]
![](https://static.wixstatic.com/media/77d769_cb40f7d25d0948ee937308d1132de6c1~mv2.png/v1/fill/w_64,h_43,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_auto/77d769_cb40f7d25d0948ee937308d1132de6c1~mv2.png)
DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection
Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani
CVPR 2015
[arxiv] [bibtex]