Main Content Region

Textual similarity experiments - public domain books data

Research Field
Computer Science
Resource Type
Storage
Compute
Resource Class
C
Code
HPC_11_00205
Abstract
This project aims at finding the "best" similarity measures between two (or more) texts. Here "best" can have various different meanings: are the texts similar with respect to domain/vocabulary, to linguistic style, to genre? These issues have numerous applications in the Natural Language Processing field, e.g. evaluating the correctness of a piece of text according to a standard, discovering plagiarism cases, gathering together texts which relate to the same topic, etc. Here we plan to use classical books from well known authors as a mean of evaluating our similarity measures, since this kind of data offers different possible evaluation frameworks: is the measure able to classify books by genre/author? to find the book from which an excerpt comes from? to discover stylistic similarity between books?
Duration
12.00months
Declaration 1
Off
Declaration 2
Off
Hosted Service
Off