Clement, Ross and Sharp, David (2003) Ngram and bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18 (4). pp. 423-447. ISSN 0268-1145Full text not available from this repository.
Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) x 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected, or more accurately detected, than topic. Part of speech tagging and function-word lists were used to investigate the influence of structure on classification tasks on documents with meaning removed but grammatical structure intact.
|Additional Information:||Online ISSN 1477-4615|
|Subjects:||University of Westminster > Science and Technology > Electronics and Computer Science, School of (No longer in use)|
|Depositing User:||Users 4 not found.|
|Date Deposited:||23 Sep 2005|
|Last Modified:||19 Oct 2009 13:30|
Actions (login required)
|Edit Item (Repository staff only)|