Open-Schema Event Profiling for Massive News Corpora
- Quan Yuan ,
- Xiang Ren ,
- Wenqi He ,
- Chao Zhang ,
- Xinhe Geng ,
- Lifu Huang ,
- Heng Ji ,
- Chin-Yew Lin ,
- Jiawei Han
Conference on Information and Knowledge Management |
Published by ACM
PDF | Publication | Publication | Publication
With the rapid growth of online information services, a sheer volume of news data becomes available. To help people quickly digest the explosive information, we define a new problem – schema-based news event profiling – profiling events reported in open-domain news corpora, with a set of slots and slot-value pairs for each event, where the set of slots forms the schema of an event type. Such profiling not only provides readers with concise views of events, but also facilitates various applications such as information retrieval, knowledge graph construction and question answering. It is however a quite challenging task. The first challenge is to find out events and event types because they are both initially unknown. The second difficulty is the lack of pre-defined event-type schemas. Lastly, even with the schemas extracted, to generate event profiles from them is still essential yet demanding. To address these challenges, we propose a fully automatic, unsupervised, three-step framework to obtain event profiles. First, we develop a Bayesian non-parametric model to detect events and event types by exploiting the slot expressions of the entities mentioned in news articles. Second, we propose an unsupervised embedding model for schema induction that encodes the insight: an entity may serve as the values of multiple slots in an event, but if it appears in more sentences along with the same set of more entities in the event, its slots in these sentences tend to be similar. Finally, we build event profiles by extracting slot values for each event based on the slots’ expression patterns. To the best of our knowledge, this is the first work on schema-based profiling for news events. Experimental results on a large news corpus demonstrate the superior performance of our method against the state-of-the-art baselines on event detection, schema induction and event profiling.