A Dynamic Benchmark for Spatial Understanding

We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.

A key question for understanding multimodal model performance is how well is can understand images, in particular basic vs. detailed spatial understanding of images. These capabilities are needed for models to be used in real-world tasks, such as an assistant in the physical world.  We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.  The datasets are challenging and by being procedurally generated and non-public thus the results can’t be due to memorization.

The benchmark has 4 sub-tasks that test high-level and detailed understanding image using a Visual-Question-Answering (VQA) approach. The figure below shows the four tasks. Each one has a single object and pair of objects condition. For each image we show the question that can be presented as a prompt to a Multimodal language model.

Examples of the benchmark images and tasks