Trinity College Dublin

Textual similarity experiments - public domain books data

Computer Science
Dr. Carl Vogel
This project aims at finding the "best" similarity measures between two (or more) texts. Here "best" can have various different meanings: are the texts similar with respect to domain/vocabulary, to linguistic style, to genre? These issues have numerous applications in the Natural Language Processing field, e.g. evaluating the correctness of a piece of text according to a standard, discovering plagiarism cases, gathering together texts which relate to the same topic, etc. Here we plan to use classical books from well known authors as a mean of evaluating our similarity measures, since this kind of data offers different possible evaluation frameworks: is the measure able to classify books by genre/author? to find the book from which an excerpt comes from? to discover stylistic similarity between books?
