Statistical Analysis of Text
Research Field:Computer Science
Lead PI:Prof. Vincent Wade
Abstract:Latent Dirichlet Allocation (LDA) is a generative probabilistic model for uncovering the underlying semantic structure of a text corpus. Topic models work off the assumption that documents are made up from mixtures of topics, where each topic is defined as a distribution over words. The problem is modeled using a three-level hierarchical Bayesian approach and is solved using Gibbs sampling methods as exact inference is intractable. In the Author-Topic model, authorship information is also taken into account; where each author is associated with a multinomial distribution over words in the vocabulary. Authorship information is neatly incorporated into the standard topic model by using some additional inference (the additional computational overhead being low). Experiments were conducted on message forum data from a well-known website which consisted of ~50,000 authors and ~60 million words. All calculations are optimised for parallel CPU execution on facilities provided by TCHPC.Eamonn is also involved in the application of deep belief networks to large text corpora and accelerating the training time of such networks using GPUs.