Learning Distributed Representations of Data in Community Question Answering for Question Retrieval

Kai Zhang; Wei Wu; Fang Wang; Ming Zhou; Zhoujun Li; Wei Wu; Ming Zhou

Learning Distributed Representations of Data in Community Question Answering for Question Retrieval

Kai Zhang ,
Wei Wu ,
Fang Wang ,
Ming Zhou ,
Zhoujun Li ,
Wei Wu ,
Ming Zhou

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM'16) | February 2016

Published by ACM

Download BibTex

We study the problem of question retrieval in community question answering (CQA). The biggest challenge within this task is lexical gaps between questions since similar questions are usually expressed with different but semantically related words. To bridge the gaps, state-of-the-art methods incorporate extra information such as word-to-word translation and categories of questions into the traditional language models. We find that the existing language model based methods can be interpreted using a new framework, that is they represent words and question categories in a vector space and calculate question-question similarities with a linear combination of dot products of the vectors. The problem is that these methods are either heuristic on data representation or difficult to scale up. We propose a principled and efficient approach to learning representations of data in CQA. In our method, we simultaneously learn vectors of words and vectors of question categories by optimizing an objective function naturally derived from the framework. In question retrieval, we incorporate learnt representations into traditional language models in an effective and efficient way. We conduct experiments on large scale data from Yahoo! Answers and Baidu Knows, and compared our method with state-of-the-art methods on two public data sets. Experimental results show that our method can significantly improve on baseline methods for retrieval relevance. On 1 million training data, our method takes less than 50 minutes to learn a model on a single multicore machine, while the translation based language model needs more than 2 days to learn a translation table on the same machine.