I am a first-year PhD student at the University of Bristol working with Prof. Dima Damen. My research interest is Computer Vision, Pattern Recognition, and Machine Learning. Currently, I am working on devising learning-based methods for understanding and exploring various aspects of first-person (egocentric) vision. Previously, at CVIT, IIIT Hyderabad, I worked with Prof. C.V. Jawahar and Prof. Chetan Arora on unsupervised procedure learning from egocentric videos. Earlier, I worked on improving word recognition and retrieval in large document collection with Prof. C.V. Jawahar and on 3D Computer Vision with Prof. Shanmuganathan Raman.

My ultimate goal is to contribute to the development of systems capable of understanding the world as we do. I’m an inquisitive person, and I’m always willing to learn about fields including, but not limited to, science, technology, astrophysics, and physics.

CV / Google Scholar / Github / LinkedIn / arXiv / ORCID


June, 2023 : Co-organising the Joint International 3rd Ego4D and 11th EPIC Workshop @ CVPR 2023.

April, 2023 : Joining University of Bristol as a PhD student. I will be working with Prof. Dima Damen on egocentric video understanding.

Feb, 2023 : Successfully defended my master’s thesis, I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos.

Dec, 2022 : Gave a talk on Procedure Learning from Egocentric Videos at ICVGIP 2022 (Vision India), hosted by Prof. Shanmuganathan Raman and Dr. Rajendra Nagar.

Nov, 2022 : Gave a talk on Procedure Learning from Egocentric Videos at Computer Vision Centre, Universitat Autònoma de Barcelona, hosted by Prof. Dimosthenis Karatzas.

Nov, 2022 : Gave a talk on Procedure Learning from Egocentric Videos at University of Catania, hosted by Prof. Giovanni Maria Farinella and Dr. Antonino Furnari.

Nov, 2022 : Co-organised the 2nd International Ego4D Workshop @ ECCV 2022.

See all news

Machine Learning and Computer Vision (MaVi) @ University of Bristol


My View is the Best View: Procedure Learning from Egocentric Videos

We propose the EgoProceL dataset consisting of 62 hours of videos captured by 130 subjects performing 16 tasks and a self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure.

Siddhant Bansal, Chetan Arora, C.V. Jawahar

European Conference on Computer Vision (ECCV), 2022

Paper / Download the EgoProceL Dataset / Project Page / Code

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We offer 3,670 hours of daily-life activity video spanning hundreds of scenarios captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries.
We present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities).

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (ORAL; Best paper finalist [link])

Paper / Project Page / Video / Benchmark’s description / EPIC@ICCV2021 Ego4D Reveal Session

Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

We propose to fuse recognition-based and recognition-free approaches for word recognition using learning-based methods.

Siddhant Bansal, Praveen Krishnan , and C.V. Jawahar

International Conference on Pattern Recognition (ICPR), 2020

Paper / Project Page / Code (GitHub) / Video (YouTube)

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Fusing recognition-based and recognition-free approaches using rule-based methods for improving word recognition and retrieval.

Siddhant Bansal, Praveen Krishnan , and C.V. Jawahar

IAPR International Workshop on Document Analysis and System (DAS), 2020 (ORAL)

Paper / Demo / Project Page / Code (Github) / Poster

See all publications


Invited Talks

Workshop Organizer

Conference Reviewer

  • CVPR 2022, 2023
  • ICCV 2023
  • ECCV 2022
  • WACV 2023

Journal Reviewer

  • IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)
  • Computer Vision and Image Understanding (CVIU)

Workshop Reviewer



Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks

This paper proposes a Generative Adversarial Network (GAN) based architecture called Deep Future Gaze (DFG) for addressing the task of gaze anticipation in egocentric videos.

Link to the article!

First Person Action Recognition Using Deep Learned Descriptors

This paper proposes a three-stream convolutional neural network architecture for the task of action recognition in first-person videos.

Link to the article!

Two-Stream Convolutional Networks for Action Recognition in Videos

This paper proposes a two-stream convolutional neural network architecture for the task of action recognition in a video.

Link to the article!


See all articles