EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous studies have been limited to specific domains, focused on detecting mistakes from videos without procedural texts, and analyzed whether actio...

Full description

Saved in:

Bibliographic Details
Main Authors	Haneji, Yuto, Nishimura, Taichi, Kameko, Hirotaka, Shirai, Keisuke, Yoshida, Tomoya, Kajimura, Keiya, Yamamoto, Koki, Cui, Taiyu, Nishimoto, Tomohiro, Mori, Shinsuke
Format	Journal Article
Language	English
Published	07.10.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

Abstract	Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous studies have been limited to specific domains, focused on detecting mistakes from videos without procedural texts, and analyzed whether actions are mistakes. To address these limitations, in this paper, we propose the EgoOops dataset, which includes egocentric videos, procedural texts, and three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. EgoOops covers five procedural domains and includes 50 egocentric videos. The video-text alignment allows the model to detect mistakes based on both videos and procedural texts. The mistake labels and descriptions enable detailed analysis of real-world mistakes. Based on EgoOops, we tackle two tasks: video-text alignment and mistake detection. For video-text alignment, we enhance the recent StepFormer model with an additional loss for fine-tuning. Based on the alignment results, we propose a multi-modal classifier to predict mistake labels. In our experiments, the proposed methods achieve higher performance than the baselines. In addition, our ablation study demonstrates the effectiveness of combining videos and texts. We will release the dataset and codes upon publication.
AbstractList	Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous studies have been limited to specific domains, focused on detecting mistakes from videos without procedural texts, and analyzed whether actions are mistakes. To address these limitations, in this paper, we propose the EgoOops dataset, which includes egocentric videos, procedural texts, and three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. EgoOops covers five procedural domains and includes 50 egocentric videos. The video-text alignment allows the model to detect mistakes based on both videos and procedural texts. The mistake labels and descriptions enable detailed analysis of real-world mistakes. Based on EgoOops, we tackle two tasks: video-text alignment and mistake detection. For video-text alignment, we enhance the recent StepFormer model with an additional loss for fine-tuning. Based on the alignment results, we propose a multi-modal classifier to predict mistake labels. In our experiments, the proposed methods achieve higher performance than the baselines. In addition, our ablation study demonstrates the effectiveness of combining videos and texts. We will release the dataset and codes upon publication.
Author	Shirai, Keisuke Yoshida, Tomoya Nishimoto, Tomohiro Haneji, Yuto Kameko, Hirotaka Mori, Shinsuke Cui, Taiyu Yamamoto, Koki Nishimura, Taichi Kajimura, Keiya
Author_xml	– sequence: 1 givenname: Yuto surname: Haneji fullname: Haneji, Yuto – sequence: 2 givenname: Taichi surname: Nishimura fullname: Nishimura, Taichi – sequence: 3 givenname: Hirotaka surname: Kameko fullname: Kameko, Hirotaka – sequence: 4 givenname: Keisuke surname: Shirai fullname: Shirai, Keisuke – sequence: 5 givenname: Tomoya surname: Yoshida fullname: Yoshida, Tomoya – sequence: 6 givenname: Keiya surname: Kajimura fullname: Kajimura, Keiya – sequence: 7 givenname: Koki surname: Yamamoto fullname: Yamamoto, Koki – sequence: 8 givenname: Taiyu surname: Cui fullname: Cui, Taiyu – sequence: 9 givenname: Tomohiro surname: Nishimoto fullname: Nishimoto, Tomohiro – sequence: 10 givenname: Shinsuke surname: Mori fullname: Mori, Shinsuke
BackLink	https://doi.org/10.48550/arXiv.2410.05343$$DView paper in arXiv
BookMark	eNqFjr0OgkAQhK_Qwr8HsHJfQESBxNgRwdgYLYyVCbnAohfhjuytim8vor3VTL7MJF9fdLTRKMR47jr-MgjcmaRaPZyF3wA38HyvJ87xxexNZVcQQiRZWmTIDcFOWZY3hDBlZTREyPhtOZkSmlOKmkmlcFIZGgtPxVc4UIOzO8kCjlizHYpuLguLo18OxGQTH9fbaeuRVKRKSa_k45O0Pt7_xRs_m0Il
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2410.05343
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2410_05343
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2410_053433
IEDL.DBID	GOX
IngestDate	Fri Oct 11 20:38:53 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2410_053433
OpenAccessLink	https://arxiv.org/abs/2410.05343
ParticipantIDs	arxiv_primary_2410_05343
PublicationCentury	2000
PublicationDate	2024-10-07
PublicationDateYYYYMMDD	2024-10-07
PublicationDate_xml	– month: 10 year: 2024 text: 2024-10-07 day: 07
PublicationDecade	2020
PublicationYear	2024
Score	3.8762374
SecondaryResourceType	preprint
Snippet	Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Title	EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts
URI	https://arxiv.org/abs/2410.05343
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8MwED61nVgQCFB538AaaOy8Oka0pUIqXQrKgBTF8QVFSAQ1KeLnc7aLYOlm-Xny6_ts350BbngzVKGoQk9QZL4wK31ec1XikVS-r8e-osTYDi-eovlz8JiFWQ_w1xamWH_XX84_sGrvGF5GtzxNAtmHvhBGZethmbnHSeuKa5v_Lx9zTBv1DyRmB7C_ZXeYuuE4hB59HMHr9K1ZNtwupjgpOoaNDpkq4sJQt3fC1JoW4IQ6ciFj8oFcyCpO1iW-1JqaFs2VKVrFfm18ZeCK99X2GK5n09X93LPy5J_OeURuRM2tqPIEBnzEpyGgEmqkhS61iKMgqSrlU8BEUouilPG4kKcw3FXL2e6kc9gTDMFW9Sy-gEG33tAlQ2inrmw__gDoonZJ
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=EgoOops%3A+A+Dataset+for+Mistake+Action+Detection+from+Egocentric+Videos+with+Procedural+Texts&rft.au=Haneji%2C+Yuto&rft.au=Nishimura%2C+Taichi&rft.au=Kameko%2C+Hirotaka&rft.au=Shirai%2C+Keisuke&rft.date=2024-10-07&rft_id=info:doi/10.48550%2Farxiv.2410.05343&rft.externalDocID=2410_05343