[1] Farhadi A, Hejrati M, Sadeghi M A, et al. Every picture tells a story: Generating sentences from images[C]//European Conference on Computer Vision. Berlin: Springer-Verlag, 2010: 15-29.
[2] Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics [J]. Journal of Artificial Intelligence Resource, 2013, 47(1): 853-899.
[3] Yang Y, Teo C L, Daume H, et al. Corpus-guided sentence generation of natural images[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK, 2011:444-454.
[4] Kulkarni G, Premraj V, Dhar S, et al. Baby talk: Understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903. DOI:10.1109/TPAMI.2012.162.
[5] Ushiku Y, Harada T, Kuniyoshi Y. Automatic sentence generation from images[C]//Proceedings of the 19th ACM International Conference on Multimedia. New York: ACM, 2011:1533-1536.
[6] Feng F, Lapata M. Automatic caption generation for news images [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(4):797-812. DOI:10.1109/TPAMI.2012.118.
[7] Gupta A, Verma Y, Jawahar C V, et al. Choosing linguistics over vision to describe images[C]//American Association for Artificial Intelligence. Palo Alto, CA, USA: Association for the Advancement of Artificial Intelligence, 2012:606-611.
[8] Berg T L, Berg A C, Shih J. Automatic attribute discovery and characterization from noisy web data[C]//European Conference on Computer Vision. Berlin: Springer, 2010: 663-676.
[9] Kiapour H, Yamaguchi K, Berg A C, et al. Hipster Wars: Discovering elements of fashion styles[C]//European Conference on Computer Vision. Zurich, Switzerland, 2014: 472-488.
[10] Mason R. Domain-independent captioning of domain-specific images[C]//North American Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics Publication, 2013:69-76.
[11] Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models[C]//International Conference on Machine Learning. Beijing, China, 2014: 595-603.
[12] Bo L, Ren X, Fox D. Kernel descriptors for visual recognition[C]//Advances in Neural Information Processing Systems. Vancouver, Canada, 2010:1734-1742.
[13] Hwang S, Grauman K. Learning the relative importance of objects from tagged images for retrieval and cross-modal search [J]. International Journal of Computer Vision, 2012, 100(2): 134-153. DOI:10.1007/s11263-011-0494-3.
[14] Su Y, Jurie F. Visual word disambiguation by semantic contexts[C]//IEEE International Conference on Computer Vision. Barcelona, Spain, 2011: 311-318.