When watching images/videos on a TV, we often have many questions about the image/video. What is the name of the beautiful places? What is the name of the actors? Which store sell the actor’s car at big discounts? Imagine one day we have a smart TV which can interactively answer your questions, and recommend relevant shopping/travel advertisements. We will enjoy more convenience and have more funs on watching TV.

MCL members, Bing Li, Zhehang Ding and Yuhang Su are collaborating with Samsung Company on Interactive Advisement for Smart TV. At the first year, we focus on automatic image/video caption. Image/video caption is to describe an image/video by a sentence instead of detecting objects.

Currently, we propose three pipelines for this project. The first pipeline is general image caption. The second and third pipeline are respectively place aware caption and face aware caption, such that our system can achieve better performance in vertical industrials such as travel, entertainment, sport and etc. For general image caption, we develop a detection method which achieves 84% mAP. For place-ware annotation, since no image datasets is for world-wide famous places, we collect images from 118 famous places in 21 countries to construct a landmark dataset. For face aware annotation, we construct a celebrity dataset, and face detection and face recognition method based on CNN.

In our future work, we will put more efforts into video caption.