Drew Linsley1, Peisen Zhou1, Alekh Karkada1, Akash Nagaraj1, Gaurav Gaonkar1, Francis E Lewis1, Zygmunt Pizio2, Thomas Serre1
1Carney Institute for Brain Science, Brown University, Providence, RI.
2Department of Cognitive Sciences, University of California-Irvine, Irvine, CA.
Project Page · Paper · Data
Visual perspective taking (VPT), the ability to accurately perceive and reason about the perspectives of others, is an essential feature of human intelligence. VPT is a byproduct of capabilities for analyzing 3-Dimensional (3D) scenes, which develop over the first decade of life. Deep neural networks (DNNs) may be a good candidate for modeling VPT and its computational demands in light of a growing number of reports indicating that DNNs gain the ability to analyze 3D scenes after training on large static-image datasets. Here, we investigated this possibility by developing the 3D perception challenge (3D-PC) for comparing 3D perceptual capabilities in humans and DNNs. We tested over 30 human participants and ''linearly probed'' or text-prompted over 300 DNNs on several 3D-analysis tasks posed within natural scene images: (i.) a simple test of object depth order, (ii.) a basic VPT task (VPT-basic), and (iii.) a more challenging version of VPT designed to limit the effectiveness of ''shortcut'' visual strategies. Nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order, and surprisingly, their accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic: humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-perturbation. Our challenge demonstrates that the training routines and architectures of today's DNN are well-suited for learning basic 3D properties of scenes and objects, but not for reasoning about these properties in the way that humans rely on for their everyday lives. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.
We release data for all three tasks: VPT-basic, VPT-strategy, and depth order on Hugging Face.
https://huggingface.co/datasets/3D-PC/3D-PC
from datasets import load_dataset
# config_name: one of ["vpt-basic", "vpt-strategy", "depth"]
dataset = load_dataset("pzhou10/3D-PC", "vpt-basic")
We release the complete 3D-PC dataset along with data splits for training and testing.
https://connectomics.clps.brown.edu/tf_records/VPT/
train
contains all training images organized by categories.
train
|
|_<category>
| |_<object>
| |_<setting>
| |_<*.png>
The corresponding labels are train_perspective.csv
and depth_perspective.csv
. We also provide train_perspective_balanced.csv
and depth_perspective_balanced.csv
, where the numbers of positive and negative samples are equal.
perspective
and depth
contain all data splits for 'VPT' and 'depth order' tasks.
perspective/depth
|
|_<split>
| |_<category> 0/1
| |_<*.png>
To linear probe a timm model
python run_linear_probe.py --task <task> --data_dir <data_folder>/<task>/ --model_name <model_name>
To fine-tune a timm model
python run_finetune.py --task <task> --data_dir <data_folder>/<task>/ --model_name <model_name>
data_folder
: Root directory for the dataset
task
: Either perspective
or depth
model_name
: TIMM model name
@misc{linsley20243dpc,
title={The 3D-PC: a benchmark for visual perspective taking in humans and machines},
author={Drew Linsley and Peisen Zhou and Alekh Karkada Ashok and Akash Nagaraj and Gaurav Gaonkar and Francis E Lewis and Zygmunt Pizlo and Thomas Serre},
year={2024},
eprint={2406.04138},
archivePrefix={arXiv},
primaryClass={cs.CV}
}