Clement, Ross and Sharp, David (2003) Ngram and bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18 (4). pp. 423-447. ISSN 0268-1145
Full text not available from this repository.
Official URL: http://dx.doi.org/10.1093/llc/18.4.423
Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) x 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected, or more accurately detected, than topic. Part of speech tagging and function-word lists were used to investigate the influence of structure on classification tasks on documents with meaning removed but grammatical structure intact.
|Additional Information:||Online ISSN 1477-4615|
|Research Community:||University of Westminster > Electronics and Computer Science, School of|
|Deposited On:||23 Sep 2005|
|Last Modified:||19 Oct 2009 14:30|
Repository Staff Only: item control page