Research: Even smartest AI models don’t match human visual processing

computer plays chess with a human

A York University study highlights how deep-network models used in artificial intelligence (AI) take potentially dangerous shortcuts in solving complex recognition tasks.

James Elder
James Elder

Deep convolutional neural networks (DCNNs) used in computer vision don’t see objects the way humans do – using configural shape perception – and that could be dangerous in real-world Artificial Intelligence (AI) applications, says York University Professor James Elder, co-author of a new study released Friday, Sept. 16. Elder, who is a professor in York’s Faculty of Health and the Lassonde School of Engineering, is the York Research Chair in Human and Computer Vision and co-director of York’s Centre for AI & Society. He collaborated on this study with Nicholas Baker, assistant professor of psychology at Loyola College in Chicago, and a former VISTA postdoctoral fellow at York.

Published in the journal iScience the study, titled Deep learning models fail to capture the configural nature of human shape perception, employed novel visual stimuli called “Frankensteins” to explore how the human brain and DCNNs process holistic, configural object properties.

“Frankensteins are simply objects that have been taken apart and put back together the wrong way around,” says Elder. “As a result, they have all the right local features, but in the wrong places.”  

The investigators found that while the human visual system is confused by Frankensteins, DCNNs are not – revealing an insensitivity to configural object properties.

“Our results explain why deep AI models fail under certain conditions and point to the need to consider tasks beyond object recognition in order to understand visual processing in the brain,” says Elder. “These deep models tend to take ‘shortcuts’ when solving complex recognition tasks. While these shortcuts may work in many cases, they can be dangerous in some of the real-world AI applications we are currently working on with our industry and government partners.” Elder points out.

One such application is traffic video safety systems: “The objects in a busy traffic scene – the vehicles, bicycles and pedestrians – obstruct each other and arrive at the eye of a driver as a jumble of disconnected fragments,” explains Elder. “The brain needs to correctly group those fragments to identify the correct categories and locations of the objects. An AI system for traffic safety monitoring that is only able to perceive the fragments individually will fail at this task, potentially misunderstanding risks to vulnerable road users.”

According to the researchers, modifications to training and architecture aimed at making networks more brain-like did not lead to configural processing, and none of the networks were able to accurately predict trial-by-trial human object judgements. “We speculate that to match human configural sensitivity, networks must be trained to solve a broader range of object tasks beyond category recognition,” says Elder.