A Novel Method of Text Attribution Based on the Statistics of Numerals
Андрей Вячеславович Зенков
Докладчик
доцент
Уральский федеральный университет им. Б. Н. Ельцина
Уральский федеральный университет им. Б. Н. Ельцина
193
2018-03-23
14:45 -
15:05
Ключевые слова, аннотация
Benford’s law; first significant digit; stylometry; text attribution; Pearson's chi-squared test; Mann-Whitney U test; Kruskal-Wallis test
Тезисы
A novel method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford’s law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Differences between the Benford-like distributions for the texts by different authors are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship. The actual frequency of occurrence usually is higher than the probability according to Benford’s law for the first significant digits 1, 2, and sometimes 3; for greater digits, the situation is reversed, and the digits distributions are characterized by strong fluctuations thus making these distributions unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson, et al. The results are confirmed on the basis of the parametric Pearson chi-squared test as well as the non-parametric Mann–Whitney U test and Kruskal–Wallis test.
To acquire the statistical stability of the frequency characteristics we are interested in, the texts should be long enough: a novel, a story, but apparently not an essay. According to our observations, for texts larger than 200 Kb (the size of the txt file), the frequencies of the first significant digits begin to stabilize.
Our methodology can be an addition to the traditional textual practices of taking into account the length of sentences, the length of words, the frequencies of use of service words, etc. Of course, our analysis requires that numerals do not coincide with indefinite articles (like ‘ein’ in German and ‘un’ in French).