Toward a machine learning model that can reason about everyday actions


Researchers train a model to reach human-level performance at recognizing abstract concepts in video.

The ability to reason abstractly about events as they unfold is a defining feature of human intelligence. We know instinctively that crying and writing are means of communicating, and that a panda falling from a tree and a plane landing are variations on descending.

A computer vision model developed by researchers at MIT, IBM, and Columbia University can compare and contrast dynamic events captured on video to tease out the high-level concepts connecting them. In a set of experiments, the model picked out the video in each vertical-column set that conceptually didn’t belong. Highlighted in red, the odd-one-out videos show a woman folding a blanket, a dog barking, a man chopping greens, and a man offering grass to a llama.