Skip to main content »

Trinity College Dublin

Textual similarity experiments - public domain books data

Research Field: 
Computer Science
Resource Type: 
Storage
Resource Type: 
Compute
Resource Class: 
C
Lead PI: 
Dr. Carl Vogel
Abstract: 
This project aims at finding the "best" similarity measures between two (or more) texts. Here "best" can have various different meanings: are the texts similar with respect to domain/vocabulary, to linguistic style, to genre? These issues have numerous applications in the Natural Language Processing field, e.g. evaluating the correctness of a piece of text according to a standard, discovering plagiarism cases, gathering together texts which relate to the same topic, etc. Here we plan to use classical books from well known authors as a mean of evaluating our similarity measures, since this kind of data offers different possible evaluation frameworks: is the measure able to classify books by genre/author? to find the book from which an excerpt comes from? to discover stylistic similarity between books?
Start Date: 
08/2011
Duration: 
12.00months

Last updated 17 Aug 2011Contact TCHPC: info | support.