Oops! Predicting Unintentional Action in Video

Time Intentional Action Unintentional Action Predicted Failure

Did this person intend to fall into the water, or was it an accident? In this paper, we introduce a large in-the-wild video dataset of unintentional action. We define three tasks on this dataset: classifying the intentionality of action, localizing the transition from intentional to unintentional, and forecasting the onset of unintentional action.

From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks.

We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains.

Paper

arXiv PDF

@InProceedings{Epstein_2020_CVPR,
author = {Epstein, Dave and Chen, Boyuan and Vondrick, Carl},
title = {Oops! Predicting Unintentional Action in Video},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Dataset

Analyzing human goals from videos is a fundamental challenge in computer vision. Since people are usually competent, existing datasets are biased towards successful outcomes. However, this bias for success makes discriminating and localizing visual intentionality difficult for both learning and quantitative evaluation.

We introduce a new annotated video dataset that is abundant with unintentional action, which we have collected by crawling publicly available “fail” videos from the web. The dataset is both large (over 50 hours of video) and diverse (covering hundreds of scenes and activities). We annotate videos with the temporal location at which the video transitions from intentional to unintentional action (shown in the video as OOPS!).

Explore dataset

Results

Our trained model learns to localize the transition to unintentional action in video. Below, we show model outputs on a sliding window passed through the input. The predicted transition is at the location with highest probability of transition (shown in yellow).

While hovering over a graph, click to seek the video to that timestamp. A vertical line will track along the graph as you play a video.