Siddhant Bansal

I am a third-year PhD student at University of Bristol working with Prof. Dima Damen. My research interests lie in Computer Vision, Pattern Recognition, and Machine Learning. Currently, I am working on devising learning-based methods for understanding and exploring various aspects of first-person (egocentric) vision. Previously, at CVIT, IIIT Hyderabad, I worked with Prof. C.V. Jawahar and Prof. Chetan Arora on unsupervised procedure learning from egocentric videos. Earlier, I worked on improving word recognition and retrieval in large document collection with Prof. C.V. Jawahar and on 3D Computer Vision with Prof. Shanmuganathan Raman.

My ultimate goal is to contribute to the development of systems capable of understanding the world as we do. I’m an inquisitive person, and I’m always willing to learn about fields including, but not limited to, science, technology, astrophysics, and physics.

CV / Google Scholar / Github / LinkedIn / arXiv / ORCID

News

Feb, 2025 : HD-EPIC: A Highly-Detailed Egocentric Video Dataset got accepted to CVPR 2025!

Feb, 2025 : Introducing HD-EPIC, A Highly-Detailed Egocentric Video Dataset featuring ground-truth covering recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations all grounded in 3D!

Oct, 2024 : Our Survey paper “An Outlook into the Future of Egocentric Vision” appeared in Vol 132 at IJCV in its final format.

June, 2024 : Co-organising the First Joint Egocentric Vision (EgoVis) Workshop at CVPR 2024 in Seattle. Come join us in Summit 428!

April, 2024 : An Outlook into the Future of Egocentric Vision got accepted to IJCV (PDF)!

April, 2024 : Introducing HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision. A new task to understand hand-object interaction using VLMs.

March, 2024 : Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives got accepted to CVPR 2024!

See all news

Machine Learning and Computer Vision (MaVi) @ University of Bristol

Publications

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

A validation dataset of kitchen-based egocentric videos with detailed, interconnected annotations on recipe steps, actions, ingredients, objects, audio, and 3D-grounded scene elements via digital twinning and gaze. Annotations include nutritional values, object locations, and fixture details.

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu

Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Paper (ArXiv) / Project Page / Download HD-EPIC / Explore Samples / Video (YouTube)

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

We propose the HOI-Ref task to understand hand-object interaction using Vision Language Models (VLMs). We introduce the HOI-QA dataset with 3.9M question-answer pairs for training and evaluating VLMs. Finally, we train the first VLM for HOI-Ref, achieving state-of-the-art performance.

Siddhant Bansal, Michael Wray, Dima Damen

Paper / Download the HOI-QA Dataset / Project Page / Code

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Ego-Exo4D is a diverse, large-scale multi-modal, multi-view, video dataset and benchmark collected across 13 cities worldwide by 839 camera wearers, capturing 1422 hours of video of skilled human activities.
We present three synchronized natural language datasets paired with videos. (1) expert commentary, (2) participant-provided narrate-and-act, and (3) one-sentence atomic action descriptions.
Finally, our camera configuration features Aria glasses for ego capture which is time-synchronized with 4-5 (stationary) GoPros as the exo capture devices.

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Paper / Project Page / Video / Meta blog post

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

We propose Graph-based Procedure Learning (GPL) framework for procedure learning. GPL creates novel UnityGraph that represents all the task videos as a graph to encode both intra-video and inter-videos context. We achieve an improvement of 2% on third-person datasets and 3.6% on EgoProceL.

Siddhant Bansal, Chetan Arora, C.V. Jawahar

Winter Conference on Applications of Computer Vision (WACV), 2024

Paper / Download the EgoProceL Dataset / Project Page / Video

An Outlook into the Future of Egocentric Vision

The survey looks at the difference between what we’re studying now in egocentric vision and what we expect in the future. We imagine the future using stories and connect them to current research. We highlight problems, analyze current progress, and suggest areas to explore in egocentric vision, aiming for a future that’s always on, personalized, and improves our lives.

Chiara Plizzari*, Gabriele Goletto*, Antonino Furnari*, Siddhant Bansal*, Francesco Ragusa*, Giovanni Maria Farinella, Dima Damen, Tatiana Tommasi

International Journal of Computer Vision (IJCV)

Project Page / Paper (IJCV) / Paper + comments (OpenReview) / Paper (arXiv)

See all publications

Miscellaneous

Invited Talks

Egocentric Videos for Procedure Learning @ Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2022) (Vision India). [slides; tweet; linkedin]
Egocentric Videos for Procedure Learning @ IPLAB, University of Catania [slides; tweet]
Egocentric Videos for Procedure Learning @ Computer Vision Centre, Universitat Autònoma de Barcelona [slides; tweet]

Workshop Organizer

Conference Reviewer

CVPR 2022, 2023, 2024
ICCV 2023, 2025
ECCV 2022, 2024
WACV 2023, 2024
ICRA 2025

Journal Reviewer

IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)
International Journal of Computer Vision (IJCV)
Computer Vision and Image Understanding (CVIU)

Workshop Reviewer

Joint 1st Ego4D and 10th EPIC Workshop @ CVPR 2022

Thesis

Masters: I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos