Skip to main content »

Trinity College Dublin

Statistical Analysis of Text

Research Field: 
Computer Science
Resource Type: 
Resource Class: 
Lead PI: 
Prof. Vincent Wade
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for uncovering the underlying semantic structure of a text corpus. Topic models work off the assumption that documents are made up from mixtures of topics, where each topic is defined as a distribution over words. The problem is modeled using a three-level hierarchical Bayesian approach and is solved using Gibbs sampling methods as exact inference is intractable. In the Author-Topic model, authorship information is also taken into account; where each author is associated with a multinomial distribution over words in the vocabulary. Authorship information is neatly incorporated into the standard topic model by using some additional inference (the additional computational overhead being low). Experiments were conducted on message forum data from a well-known website which consisted of ~50,000 authors and ~60 million words. All calculations are optimised for parallel CPU execution on facilities provided by TCHPC.Eamonn is also involved in the application of deep belief networks to large text corpora and accelerating the training time of such networks using GPUs.
Start Date: 

Last updated 16 Jun 2010Contact TCHPC: info | support.