Siddhant Bansal Chetan Arora C.V. Jawahar
Paper Dataset Code Video Poster
Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task.
Using third-person videos for procedure learning makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
EgoProceL consists of
A portion of EgoProceL consist of videos from the following datasets:
We recommend referring to the README before downloading the videos. Mirror link.
Link: ~OneDrive~ (expired) Google Drive
Link: ~OneDrive~ (expired) Google Drive
We present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively.
The code for this work is available on GitHub!
Link: Sid2697/EgoProceL-egocentric-procedure-learning
This work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. A portion of the data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation under Grant No. EEEC-0540865. We acknowledge Pravin Nagar and Sagar Verma for recording and sharing the PC Assembly and Disassembly videos at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.
Please consider citing the following works if you make use of the EgoProceL dataset:
@InProceedings{EgoProceLECCV2022,
author="Bansal, Siddhant
and Arora, Chetan
and Jawahar, C.V.",
title="My View is the Best View: Procedure Learning from Egocentric Videos",
booktitle = "European Conference on Computer Vision (ECCV)",
year="2022"
}
@InProceedings{CMU_Kitchens,
author = "De La Torre, F. and Hodgins, J. and Bargteil, A. and Martin, X. and Macey, J. and Collado, A. and Beltran, P.",
title = "Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database.",
booktitle = "Robotics Institute",
year = "2008"
}
@InProceedings{egtea_gaze_p,
author = "Li, Yin and Liu, Miao and Rehg, James M.",
title = "In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video",
booktitle = "European Conference on Computer Vision (ECCV)",
year = "2018"
}
@InProceedings{meccano,
author = "Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria",
title = "The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain",
booktitle = "Winter Conference on Applications of Computer Vision (WACV)",
year = "2021"
}
@InProceedings{tent,
author = "Jang, Youngkyoon and Sullivan, Brian and Ludwig, Casimir and Gilchrist, Iain and Damen, Dima and Mayol-Cuevas, Walterio",
title = "EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly",
booktitle = "International Conference on Computer Vision (ICCV) Workshops",
year = "2019"
}