Seeing Bot

  • Yingwei Pan ,
  • Zhaofan Qiu ,
  • Ting Yao ,
  • Houqiang Li ,
  • Tao Mei

Published by International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

We demonstrate a video captioning bot, named Seeing Bot, which can generate a natural language description about what it is seeing in near real time. Specifcally, given a live streaming video, Seeing Bot runs two pre-learned and complementary captioning modules in parallel—one for generating image-level caption for each sampled frame, and the other for generating video-level caption for each sampled video clip. In particular, both the image and video captioning modules are boosted by incorporating semantic attributes which can enrich the generated descriptions, leading to human-level caption generation. A visual-semantic embedding model is then exploited to rank and select the final caption from the two parallel modules by considering the semantic relevance between video content and the generated captions. The Seeing Bot finally converts the generated description to speech and sends the speech to an end user via an earphone. Our demonstration is conducted on any videos in the wild and supports live video captioning.