KDD Workshop on Issues of Sentiment Discovery and Opinion Mining
会议地点: London, UK
The distillation of knowledge from social media is an extremely difficult task as the content of today's Web, while perfectly suitable for human consumption, remains hardly accessible to machines. The opportunity to capture the opinions of the general public about social events, political movements, company strategies, marketing campaigns, and product preferences has raised growing interest both within the scientific community, leading to many exciting open challenges, as well as in the business world, due to the remarkable benefits to be had from marketing and financial market prediction.
Statistical NLP has been the mainstream NLP research direction since late 1990s. It relies on language models based on popular machine-learning algorithms such as maximum-likelihood, expectation maximization, conditional random fields, and support vector machines. By feeding a large training corpus of annotated texts to a machine-learning algorithm, it is possible for the system to not only learn the valence of keywords, but also to take into account the valence of other arbitrary keywords, punctuation, and word co-occurrence frequencies. However, standard statistical methods are generally semantically weak as they merely focus on lexical co-occurrence elements with little predictive value individually.
Endogenous NLP, instead, involves the use of machine-learning techniques to perform semantic analysis of a corpus by building structures that approximate concepts from a large set of documents. It does not involve prior semantic understanding of documents; instead, it relies only on the endogenous knowledge of these (rather than on external knowledge bases). The advantages of this approach over the knowledge engineering approach are effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. Endogenous NLP includes methods based either on lexical semantics, which focuses on the meanings of individual words (e.g., LSA, LDA, and MapReduce), or compositional semantics, which looks at the meanings of sentences and longer utterances (e.g., HMM, association rule learning, and probabilistic generative models).