WestminsterResearch

Ngram and bayesian classification of documents for topic and authorship

Clement, Ross and Sharp, David (2003) Ngram and bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18 (4). pp. 423-447. ISSN 0268-1145

Full text not available from this repository.

Official URL: http://dx.doi.org/10.1093/llc/18.4.423

Abstract

Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) x 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected, or more accurately detected, than topic. Part of speech tagging and function-word lists were used to investigate the influence of structure on classification tasks on documents with meaning removed but grammatical structure intact.

Item Type:Article
Additional Information:Online ISSN 1477-4615
Research Community:University of Westminster > Electronics and Computer Science, School of
ID Code:499
Deposited On:23 Sep 2005
Last Modified:19 Oct 2009 14:30

Repository Staff Only: item control page