See No Evil, Say No Evil: Description Generation from Densely Labeled Images

Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014) |

Publication

This paper studies generation of descriptive sentences from densely annotated images. Previous work studied generation from automatically detected visual information but produced a limited class of sentences, hindered by currently unreliable recognition of activities and attributes. Instead, we collect human annotations of objects, parts, attributes and activities in images. These annotations allow us to build a significantly more comprehensive model of language generation and allow us to study what visual information is required to generate human-like descriptions. Experiments demonstrate high quality output and that activity annotations and relative spatial location of objects contribute most to producing high quality sentences.

[Data (opens in new tab)]
[Captions (opens in new tab)]
[Output (opens in new tab)]