OOPS! denotes an annotated failure moment and is not present in the actual videos.
Download (Videos and annotations, 45GB)
Optical flow frames (1019 GB)
Natural language descriptions (new!) (11 MB)
Pre-trained models (new!) (697 MB)
By pressing any of the links above, you acknowledge that we do not own the copyright to these videos and that they are solely provided for non-commercial research and/or educational purposes. This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


We present the _o_ops_!_ dataset for studying unintentional human action. The dataset consists of 20,723 videos from YouTube fail compilation videos, adding up to over 50 hours of data. These clips, filmed by amateur videographers in the real world, are diverse in action, environment, and intention. The dataset covers many causes of failure and unintentional action, including physical and social errors, errors in planning and execution, limited agent skill, knowledge, or perceptual ability, and environmental factors.

Dataset statistics

We summarize our dataset with the (a) distribution of clip lengths, (b) the distribution of temporal locations where failure starts, and (c) the standard deviation between human annotators. The median and mean clip lengths are 7.6 and 9.4 seconds respectively. Median standard deviation of the labels given across three workers is 6.6% of the video duration, about half a second, suggesting high agreement. We also show the distribution of (d) action categories and (e) scene categories, which naturally has a long tail. For legibility, we only display the top and bottom 5 most common classes for each.


We present the _o_ops_!_ dataset along with various baseline models in a paper available on arXiv. If you use our dataset, please cite:
title={Oops! Predicting Unintentional Action in Video},
author={Epstein, Dave and Chen, Boyuan and Vondrick, Carl.},
journal={arXiv preprint arXiv:1911.11206},


Dave Epstein Boyuan Chen Carl Vondrick