Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles
Manuel Sage, Pietro Cruciata, Raed Abdo, J. Cheung, Y. Zhao
SwissText
Abstract
In this work, we perform authorship attribution on a new dataset of German news articles. We seek to classify over 3,700 articles to their five corresponding authors, using four conventional machine learning approaches (naı̈ve Bayes, logistic regression, SVM and kNN) and a convolutional neural network. We analyze the effect of character and word n-grams on the prediction accuracy, as well as the influence of stop words, punctuation, numbers, and lowercasing when preprocessing raw text. The experiments show that higher order character n-grams (n = 5,6) perform better than lower orders and word n-grams slightly outperform those with characters. Combining both in fusion models further improves results up to 92% for SVM. A multilayer convolutional structure allows the CNN to achieve 90.5% accuracy. We found stop words and punctuation to be important features for author identification; removing them leads to a measurable decrease in performance. Finally, we evaluate the topic dependency of the algorithms by gradually replacing named entities, nouns, verbs and eventually all tokens in the dataset according to their POStags.