Special issue on Cross-Media Learning for Visual Question Answering (VQA)
• 大类 : 工程技术 - 3区
• 小类 : 计算机：人工智能 - 3区
Visual Question Answering (VQA) is a recent hot topic which involves multimedia analysis, computer vision (CV), natural language processing (NLP), and even a broad perspective of artificial intelligence, which has attracted a large amount of interest from the deep learning, CV, and NLP communities. The definition of this task is shown as follows: a VQA system takes a picture and a free, open-ended question in the form of natural language about the picture as input and takes the generation of a piece of answer in the form of natural language as the output. It is required that pictures and problems should be taken as input of a VQA system, and a piece of human language is required to be generated as output by integrating information of these two parts. For a specific picture, if we want that the machine can answer a specific question about the picture in natural language, we need to enable the machine to have certain understanding of the content of the picture, and the meaning and intention of the question, as well as relevant knowledge. VQA relates to AI technologies in multiple aspects: fine-grained recognition, object recognition, behavior recognition, and understanding of the text contained in the question (NLP). Because VQA is closely related to the content both in CV and NLP, a natural QA solution is integrating CNN with RNN, which are successfully used in CV and NLP, to construct a composite model. To sum up, VQA is a learning task linked to CV and NLP.
The task of VQA is rather challenging because it requires to comprehend textual questions, and analyze visual questions and image elements, as well as reasoning about these forms. Moreover, sometimes external or commonsense knowledge is required as the basis. Although some achievements have been made in VQA study currently, the overall accuracy rate is not high as far as the effect achieved by the current model is concerned. As the present VQA model is relatively simple in structure, single in the content and form of the answer, the correct answer is not so easy to obtain for the slightly complex questions which requires more prior knowledge for simple reasoning. Therefore, this Special Section in Journal of Visual Communication and Image Representation aims to solicit original technical papers with novel contributions on the convergence of CV, NLP and Deep Leaning, as well as theoretical contributions that are relevant to the connection between natural language and CV.
The topics of interest include, but are not limited to:
Deep learning methodology and its applications on VQA, e.g. human computer interaction, intelligent cross-media query and etc.