Distance Analysis of English Texts. III. ARTHUR CONAN DOYLE: A STUDY IN SCARLET.
Milan Kunz (
kunzmilan@atlas.cz) August, 2002Abstract
Distances between identical symbols in information strings (biological, language, computer programs (*.exe files) are described with a different precision with four distributions: Exponential, Weibull, lognormal and negative binomial. The correlations are sometimes highly significant. Here are analyzed distances between signs in the novel of A. C. Doyle. Some distance tests revealed specific formal features of the text.
INTRODUCTION
This is a continuation study of statistical properties of distances between identical symbols in information strings (1, 2, 3). The Doyle's novel was obtained in the form of RTF. Using MS Word, the text was transformed into the plain *.txt. Some formatting, as headlines, remained unchanged. Then the file has 238430 bytes. It contains 230112 signs including spaces, 189183 signs without spaces in 4159 lines and 42549 words. It means that the mean length of a word is 4.446 signs (including apostrophes and punctuation marks). At some letters, the list of distances were split into more equal (approximately) parts, since the used statistical software Statgraphics (in version, I work with) does not work with too long lists.
After these formal corrections, the distances were determined by a program elaborated by Rádl. The string is at first indexed with the position index i (i going from 1 to m) of each individual symbol in the string, and then the differences of these position indexes are determined. The differences are considered to be the topological distances between the same symbols. The sets of these values were evaluated by different statistical tests. The program counting distances counts all signs, including spacebar, return, and punctuation marks.
From all available implemented distributions, only four distributions gave significant results, the exponential distribution, the Weibull distribution, the lognormal distribution, and the negative binomial distribution, as before.
The actual values (mean, standard deviation, skewness, kurtosis, distribution parameters etc.) are of little interest, since they differ considerably.
Results
The distances between points determine the length of sentences. There are 2393 points, mostly used as the punctuation mark, except some abbreviations (e.g. M.D.)
The distribution is of Weibull type, a = 1.75916, b = 108.163.
Chisquare = 10.4418 over 52, significance level was 0.72929 with 14 degree of freedom. There exists a shortage of points between distances 68 - 82 (237 occurrences against 265 expected). This alone makes 28.4 % of the chi-square test value.
The other punctuation mark, the semicolon (108 occurences), is used as follows:
Chisquare Test
Lower |
Upper |
Observed |
Expected |
|
Limit |
Limit |
Frequency |
Frequency |
Chisquare |
at or below |
667.619 |
39 |
38.3 |
0.01274 |
667.619 |
1334.238 |
23 |
23.4 |
0.00610 |
1334.238 |
2000.857 |
12 |
13.1 |
0.08969 |
2000.857 |
2667.476 |
10 |
8.2 |
0.41749 |
2667.476 |
3334.095 |
4 |
5.5 |
0.39519 |
3334.095 |
4667.333 |
7 |
6.7 |
0.01289 |
4667.333 |
6667.190 |
5 |
5.1 |
0.00387 |
above |
6667.190 |
8 |
7.8 |
0.00706 |
Chisquare = 0.945022 with 5 d.f. Sig. level = 0.966877
This is an example of the almost perfect lognormal distribution.
There are many commas. Therefore, the file was split into 6 parts. The distribution of distances was exponential with different fitting, as follows in tabulated form
Part |
Number |
Chisquare |
Note |
1 |
475 |
0.6716 |
Peak 85-120, walley 155-190 |
2 |
476 |
0.2991 |
Peak 107-127 |
3 |
483 |
0.5055 |
Peak 86-106 |
4 |
477 |
0.0802 |
Peak 115-143 |
5 |
476 |
0.6593 |
Over 100, lower tail |
6 |
478 |
0.0287 |
Peak 71-125 |
There are no immediately repeated commas, as ",," which contributes 21.2 - 50.78 % of the chi-square test value.
The two way sample analysis shows how the parts of the lower case differ:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
0.355 |
0.091 |
0.122 |
0.374 |
0.177 |
2 |
0.397 |
0.480 |
0.977 |
0.656 |
|
3 |
0.447 |
0.364 |
0.654 |
||
4 |
0.621 |
0.761 |
|||
5 |
0.621 |
Note: The asterisk shows the statistically significant difference between tested parts.
The commas are used without too great differences.
The spacebar
The distances between consecutive spacebars greater than 1 determine the number of words of the length corresponding to this distance minus one. There exists 40931 spacebars without corrections. Some of them are used as formatting tools. The results are tabulated as follows. Cumulating frequencies of shorter distances, improved in some cases the fit, since bellow it the counts are scattered, and differences can balance themselves.
Table 2 The number of words with the different length
Length |
Number |
Type of distribution, chisquare value |
1 |
1593 |
LN, 0.305 |
2 |
6427 |
LN (divided) |
3 |
8109 |
LN (divided) |
4 |
6309 |
EX (divided) |
5 |
4364 |
EX (divided) |
6 |
3237 |
NB, 0.024, over 8 = 0.810 |
7 |
2692 |
NB, 0.053, over 4 = 0.230 |
8 |
2002 |
NB, 0.137, over 8 = 0.455 |
9 |
1540 |
NB, 0.126, over 6 = 0.871 |
10 |
1043 |
NB, 0.072, over 25 = 0.605 |
11 |
698 |
EX, 0.358 |
12 |
455 |
EX, 0.251 |
13 |
289 |
WE, 0.676 |
14 |
193 |
WE, 0.686 |
15 |
122 |
WE, 0.097 |
16 |
81 |
WE, 0.531 |
17 |
45 |
WE, 0.207 |
18 |
30 |
too few data |
19 |
13 |
too few data |
20 |
4 |
too few data |
The distribution of length of words seem to have the lognormal shape, but this guess was not tested.
Notes to some results:
The distribution of one letter words is correlated by the lognormal distribution. There are two peaks between distances 70-81 and 128-138 (36 (9) occurrences against 28.4 (5.6) expected). Each makes about 14.7 % of the chi-square test value. There exists a shortage of distances 13-23 (375 occurrences against 406.9 expected). Each contributes about 17.9 % to the chi-square test value. The distribution is shorter than expected (4 occurrences against 9.1 expected. This makes about 20.7 % of the chi-square test value.
The distribution of two letter words
is correlated poorly by the lognormal distribution. The set was divided into 4 parts. From them the second part gives the best fit, the chi-square test value is 0.3346 over 11, 0.8075 over 12, and 0.4011 over 13. These words follow each other more often than corresponds to the shape. This makes 52.7-92.7 % of the chi-square test value.The two way sample analysis shows that only the first and second parts are similar:
Part |
2 |
3 |
4 |
1 |
0.537 |
*0 |
*0.010 |
2 |
*0 |
*0.037 |
|
3 |
*0.032 |
The distribution of three letter words was divided into four parts, too. The parts are correlated by different distributions:
Part |
Type |
The chi-square test value (over) |
Note |
1 |
LN |
0.348 (12), 0.553 (13), 0.273 (14) |
repeatings make 91.3 % |
2 |
EX |
0.130 (16-17) |
peak 3-4 makes 45 % |
3 |
NB |
0.148 (11) |
shortage of repeatings makes 43.8 %, peak 3-4 makes 39.1 % |
4 |
LN |
0.036 (12), 0.495 (13), 0.147 (14) |
repeatings make 66.5 % |
The two way sample analysis shows that only the second and fourth parts are similar:
Part |
2 |
3 |
4 |
1 |
*0 |
*0 |
*0 |
2 |
*0 |
0.688 |
|
3 |
*0 |
The distribution of four letter words was divided into four parts. These words are following each other more often than corresponds to the shape of the exponential distribution but not too much, at most 28.2 % of the chi-square test value in the first part. The parts are correlated as follows:
Part |
The chi-square test value (over) |
Note |
1 |
0.005 (15), 0.332 (16), 0.141 (17) |
shortage of distances 15-18 makes 25.1 %, |
2 |
0.258 (9), 0.617 (10), 0.276 (11) |
peak 6-8 makes 28.5 % |
3 |
0.257 (12-13) |
peak 6-8 makes 31.7 % |
4 |
0 |
peak 2-6 makes 62.3 % |
The two way sample analysis shows that only the second and third parts and the third and fourth ones are similar:
Part |
2 |
3 |
4 |
1 |
*0.003 |
*0 |
*0 |
2 |
0.302 |
*0.041 |
|
3 |
0.283 |
The distribution of five letter words was divided into three parts. These words are following each other slightly often than corresponds to the shape of the exponential distribution. The parts are correlated as follows:
Part |
The chi-square test value (over) |
Note |
1 |
0.071 (2), 0.389 (3), 0.148 (4) |
shortage of distances 17-32 makes 51.3 %, |
2 |
0.002 |
peak 6-8 makes 35.8 % |
3 |
0 |
peak 8-11 makes 41.7 % |
The two way sample analysis shows that the parts are similar:
Part |
2 |
3 |
1 |
0.723 |
0.250 |
2 |
0.723 |
The negative binomial distribution of six letter words is fair, worsened by short distances (the chi-square test value is 0.681 over 7, 0.810 over 8, 0.420 over 9).
The distribution of seven letter words is described by the negative binomial distribution (the chi-square test value = 0.0529. It is somewhat improved over 4 to 0.2304). The tail is longer than expected (18 occurrences against 10.1 expected above 82). This makes 24.9 % of the chi-square test value.
The distribution of eight letter words is also the negative binomial one (the chi-square test value = 0.137). It is somewhat improved over 4 to 0.455). The tail is again longer (16 occurrences against 9.8 expected over 1.7). This makes 15.6 % of the chi-square test value.
The distribution of nine letter words is also the negative binomial one (the chi-square test value = 0.126 is improved over 6 to 0.871). The shortage of these words within distances 88-96 (7 occurrences against 13.9 expected) contributes 16.0 % of the chi-square test value.
The distribution of ten letter words is fairly correlated by the negative binomial distribution (the chi-square test value = 0.072 is improved over 25 to 0.605). The shortage of these words within distances 19-28 (107 occurrences against 132.4 expected) contributes 17.0 % of the chi-square test value. There are more distances 145-172 than expected (23 occurrences against 10.1 expected). This makes 31.9 % of the chi-square test value.
The distribution of eleven letter words is fairly correlated by the exponential distribution (the chi-square test value = 0.358). The shortage of these words within distances 160-186 (11 occurrences against 16.5 expected) contributes 18.3 % of the chi-square test value. There are more distances 28-54 than expected (183 occurrences against 159.0 expected). This makes 36.7 % of the chi-square test value.
The distributions of longer words are well correlated with the Weibull distribution. As an example, the results with the 16 letter words are given:
Chisquare Test
Lower |
Upper |
Observed |
Expected |
|
Limit |
Limit |
Frequency |
Frequency |
Chisquare |
at or below |
200.000 |
29 |
30.0 |
0.03323 |
200.000 |
400.000 |
17 |
16.7 |
0.00504 |
400.000 |
600.000 |
13 |
10.7 |
0.47842 |
600.000 |
800.000 |
7 |
7.2 |
0.00345 |
800.000 |
1200.000 |
5 |
8.2 |
1.27208 |
above 1200.000 |
3374 |
10 |
8.2 |
0.41292 |
Chisquare = 2.20513 with 3 d.f. Sig. level = 0.530938
Distances between individual letters
The results for all letters are presented in the form of the table, where the frequencies of all symbols are given and the significance of the performed chi-square tests. Then the commentaries to all symbols of the alphabet are given. The values in the square brackets show the corresponding values of the combined lower and upper cases.
Table 7 Survey of results
Notes:
EX = exponential distribution
WE = Weibull distribution
L N = lognormal distribution
NB = negative binomial distribution
* = the test was not made, since not enough of data
Statistic = XX, the chi-square test
Symbol |
Small |
Capital |
Both |
a |
14387,EX, 0 |
251, WE, 0.913 |
14640, EX, NB, 0 |
b |
2429, EX, 0 |
113,WE, 0.574 |
2542, WE, 0 |
c |
4403, NB, 0.300 |
126, WE, 0.932 |
4524, NB |
d |
8210, NB, EX |
146, LN, 0.048 |
8.356, NB |
e |
22812, NB, 0 |
84, WE, 0.465 |
22895, NB, 0 |
f |
3773,WE, 0 |
269, WE, 0.008 |
4042, WE, 0 |
g |
3494, WE, 0.296 |
99, WE, 0.924 |
3593, WE, |
h |
11954, NB, 0 |
445, WE, 0.371 |
12399, WE, NB |
i |
1152, LN, 0.137 |
1180, LN, 0.128 |
12332, EX, 0 |
j |
127, WE, 0.576 |
108, WE, 0.031 |
235, WE, 0.346 |
k |
1296, WE, 0.033 |
10, no test |
1306, WE, 0.041 |
l |
6797, WE |
173, EX, 0.194 |
4970, WE |
m |
4569, EX |
164, WE, 0.458 |
4733, EX |
n |
12201, EX |
304, WE, 0.072 |
12505, EX |
o |
13843, EX |
101, WE, 0.077 |
13944,EX |
p |
2867, WE |
69, WE, 0.356 |
2936, WE |
q |
136, EX, 0.441 |
2, no test |
138, EX, 0.433 |
r |
10793, EX |
204, WE, 0.109 |
10997, EX |
s |
12680, WE, EX |
262, WE, 0.878 |
12942, WE, EX |
t |
15486, EX |
525, WE, 0.4522 |
16011, EX |
u |
5047, EX |
193, WE, 0.455 |
5076, EX |
v |
1735, WE |
11, no test |
1747, WE, 0.020 |
w |
4335, EX, WE |
260, WE, 0.709 |
4595, EX |
x |
278, WE, 0.130 |
no test |
- |
y |
3349, EX, |
323, WE, 0.267 |
3672, EX |
z |
EX |
no test |
133, EX, 0161 |
At the upper case, the Weibull distribution is the best one in the case of 16 letters. The lognormal distribution correlates 2 cases, only, the exponential distribution is the best in the 3 performed tests, and the negative binomial distribution in no case.
At the lower case, the Weibull distribution is the best one in the case of 8 letters. The lognormal distribution correlates 1 case, only, the exponential distribution is the best in the 13 performed tests, and the negative binomial distribution in 4 cases. At combined cases, the Weibull distribution is the best one in the case of 10 letters. The lognormal distribution correlates no case, the exponential distribution is the best in the 12 performed tests, and the negative binomial distribution in no case. Sometimes, the distinction between the fit is small and more than one distribution is applicable. The chi-square values sometimes are practically zero, and only adjusting the lowest possible value to greater distances by pooling these shorter distances increases the significance of the chi-square tests. Now, the commentaries to the individual letters follow.
A
The capital case A frequency allowed the separate test. The fair result was obtained with the exponential distribution (the chi-square test value 0.378). The excellent fit with the Weibulll distribution (the chi-square test value 0.913) is worsened by too many repeating within distances 2376 till 2850 (12 occurrences against 7.4 expected) which makes 26.0 % of the chi-square test value.
The distribution of distances between the lower case a is exponential, except that there are practically no repeating aa. This fact contributes 58.4 -77.3 to the chi-square value. The lower case a repeats too often within distances 6 - 14 (1.185 - 1.336 of expected values).
The two way sample analysis shows how the parts of the lower case differ:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
0.368 |
*0.034 |
*0.026 |
*0.000 |
0.390 |
2 |
0.217 |
0.178 |
*0.008 |
0.956 |
|
3 |
0.895 |
0.160 |
0.191 |
||
4 |
0.209 |
0.155 |
|||
5 |
*0.006 |
Note: The asterisk shows the significant difference between tested parts.
The first sixth differs significantly from the second till fifth ones. The fifth and sixth are different, too.
The most important disturbances from the shape of the distribution in all parts are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
1 |
6-20 |
1285 |
1015.5 |
29.7 |
2 |
6-14 |
997 |
728.4 |
18.5 |
3 |
6-9 |
470 |
390.1 |
8.5 |
4 |
6-9 |
483 |
390.9 |
11.7 |
5 |
7-12 |
600 |
453.6 |
20.5 |
6 |
14-17 |
293 |
183.9 |
25.5 |
The lower case a repeats too often within one till three words.
The distances between both case (a + A) are fitted poorly by different distributions. Again, there are practically no repeating Aa. This fact contributes 58 - 77.6 to the chi-square value.
The first sixth of a fits well with the negative binomial distribution with pooled distances to 16 (the chi-square value = 0.592. Other parts give much worse fits, and other distributions (the exponential distribution and the negative binomial distribution) give a better fit.
The two way sample analysis of both cases (a + A) gives worser results as the lower case a:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
0.457 |
*0.025 |
*0.023 |
*0.000 |
0.245 |
2 |
0.131 |
0.120 |
*0.004 |
0.679 |
|
3 |
0.943 |
0.162 |
0.263 |
||
4 |
0.193 |
0.242 |
|||
5 |
*0.011 |
The first sixth differs significantly from the three parts but its consistency with other parts is low, too. The most important disturbances in all parts are tabulated, again:
Part |
Type |
Range |
Observed |
Expected |
% of chisquare |
1 |
NB |
25-27 |
74 |
87.9 |
12.2 |
49-54 |
12 |
24.7 |
36.3 |
||
2 |
NB |
6-18 |
1210 |
1017.8 |
17.1 |
3 |
EX |
6-21 |
1337 |
1135.1 |
19.6 |
4 |
EX |
6-13 |
844 |
713.2 |
14.0 |
5 |
LN |
29-39 |
221 |
172.1 |
29.3 |
6 |
EX |
14-17 |
300 |
187.1 |
25.9 |
B
The distribution of distances between upper case B is Weibull. The distribution of distances over 20 between lower case b is exponential, the chi-square test value is then 0.614. There are too few b within distances 129-150 (106 occurrences against 128 expected), which contributes 19.3 % of the chi-square test value. Contrary, there are too many b within distances 282-324 (63 occurrences against 46 expected), which contributes 34.3 % of the chi-square test value. The distribution of distances over 20 between (b +B) is exponential, the chi-square test value is then 0.921. But here the Weibull distribution gives even a better chi-square test value 0.927. The fit
is worsened by too many (b + B) within distances 295-316 (31 occurrences against 20.7 expected), which contributes 40.9 % of the chi-square test value. There are too few (b + B) within distances 422-442 (1 occurrence against 5.4 expected), which contributes 28.5 % of the chi-square test value.
Including B improved the fit, the disturbances lessened and shifted to longer distances.
C
The distribution of distances between upper case C is the Weibull one (the chi-square test value is wery good, 0.932).
The distribution of distances of the lower case of this letter (and c + C) is described well by three distributions, exponential, negative binomial and Weibull.
The distances between lower case c were split into 3 parts. The results are tabulated:
Part |
Type |
Chisquare |
Range |
Observed |
Expected |
% of chisquare |
1 |
NB |
0.298, 0.817 over 5 |
76-87 |
57 |
68.8 |
14.9 |
2 |
EX, NB |
0.365, 0.349 |
191-238 |
15 |
23.2 |
25.3 |
3 |
NB, EX |
0.954, 0.954 |
146-169 |
45 |
36.9 |
39.9 |
The parts are rather different, as two way sample analysis shows:
2. part |
3. part |
|
1. part |
*0.043 |
*0.000 |
2. part |
0.137 |
The distances between (c = C) were split into 3 parts, too. The results are tabulated as follows:
Part |
Type |
Chisquare |
Range |
Observed |
Expected |
% of chisquare |
1 |
NB |
0.488 |
72-83 |
53 |
68.4 |
18.7 |
203-226 |
3 |
8.2 |
17.7 |
|||
2 |
EX, NB |
0.284, 0.230 |
49-72 |
245 |
217.6 |
26.3 |
3 |
EX, NB |
0.761, 0.769 |
146-169 |
45 |
36.6 |
25.9 |
265-598 |
6 |
11.1 |
31.1 |
The parts are rather different, as two way sample analysis shows:
2. part |
3. part |
|
1. part |
*0.045 |
*0.000 |
2. part |
0.118 |
Combining both cases worsened the fit. It is difficult to choose between the exponential distribution and the negative binomial distribution, both give practically the identical results.
D
Here the exponential distribution and the negative binomial are applicable. The chi-square test values are as follows:
Part |
Exponential |
Negative binomial |
d1 |
0 |
|
d2 |
over 20 = 0.247 |
|
d3 |
over 33 = 0.354 |
|
d4 |
over 30 = 0.329 |
|
[d + D]1 |
0 |
|
[d + D]2 |
over 19 = 0.763 |
|
[d + D]3 |
over 31 = 0.395 |
|
[d + D]4 |
over 22 = 0.683 |
The capital case D frequency allowed the separate test. The lognormal distribution correlates poorly, the chi-square value is only 0.048 since there are too many repeating within distances 1274 till 1909 (20 occurrences against 12.6 expected) which makes 34.3 % of the chi-square test value. The tail is shorter than expected, only 1 occurrence against 5.1 expected, which contributes another 25.9 % of the chi-square test value.
There are too few repeating dd (Dd). This fact contributes 42.4 - 70.5 % (32.3-63.8 %) to the high chi-square values given in the table above.
The two way sample analysis shows that the parts of the lower case d are different:
Part |
2 |
3 |
4 |
1 |
0.587 |
*0.018 |
0.050 |
2 |
0.072 |
0.163 |
|
3 |
0.686 |
The third sixth differs significantly from the first part. Only the third and the fourth parts are similar.
There are always less doubled dd then corresponding to the exponential form which makes 42-70.5 % of the chi-square test value.
The combined [d + D] gives somewhat different results. The two way sample analysis shows that the parts of [d + D] are different, too:
Part |
2 |
3 |
4 |
1 |
0.316 |
*0.022 |
*0.016 |
2 |
0.208 |
0.163 |
|
3 |
0.873 |
The third and fourth parts differ significantly from the first part. Only the third and fourth parts are close.
There are always less doubled Dd then corresponding to the exponential form (0-10 occurrences against 23.5-37 expected) which makes 32.3 - 63.8 % of the chi-square test value.
E
There are relatively few E comparing with the great number of e. The distribution of distances between lower case e and both case (e + E) is mostly the negative binomial, some parts fit better the lognormal or exponential distributions:
Part |
Negative binomial |
e1 |
over 15 = 0.538 |
e2 |
over 12 = 0.137 |
e3 |
over 12 = 0.054 |
e4 |
EX |
e5 |
0 |
e6 |
over 12 = 0.063 |
e7 |
over 14 = 0.066 |
e8 |
over 20 = 0.093 |
[e + E]1 |
over 15 = 0.529 |
[e + E]2 |
over 15 = 0.135 |
[e + E]3 |
over 14 = 0.052 |
[e + E]4 |
0 |
[e + E]5 |
over 17 = 0.112 |
[e + E]6 |
0 |
[e + E]7 |
over 13 = 0.102 |
[e + E]8 |
over 17 = 0.131 |
The two way sample analysis failed due to too large samples.
F
The distribution of the capital F, of the lover case f, and of [f + F], is correlated with the Weilbull distribution. The set of the lover case f, and of [f + F], were divided into two parts, which both are rather different (the two way sample analysis results 0.0002 and 0.0048, respectively.
The distribution of this letter is distorted by too few double ff [Ff] (e. g. 96 occurrences against 28.7 expected). This makes 84.9 % of the total very high chi-square test value.
G
The distribution of the capital G is correlated with the Weibull distribution rather well. It effects the distribution of the lover case g, divided into two parts, in both parts differently:
Part |
The chi-square test value |
|
g |
0.320 |
0.709 |
g + G |
0.296 |
0.024 |
The most important distortions:
Part |
Range |
Observed |
Expected |
% of chisquare |
g1 |
19-36 |
325 |
295.5 |
15.4 |
277-294 |
11 |
6.1 |
20.9 |
|
g2 |
71-117 |
278 |
319.5 |
40.3 |
[g+G]1 |
277-294 |
11 |
5.9 |
32.9 |
[g+G]2 |
88-104 |
80 |
114.1 |
31.5 |
H
The distribution of the capital H is correlated with the Weibull distribution rather well:
Chisquare Test
Lower |
Upper |
Observed |
Expected |
|
Limit |
Limit |
Frequency |
Frequency |
Chisquare |
at or below |
212.074 |
169 |
160.5 |
0.454082 |
212.074 |
423.148 |
96 |
95.3 |
0.005133 |
423.148 |
634.222 |
58 |
61.7 |
0.226729 |
634.222 |
845.296 |
36 |
40.9 |
0.585048 |
845.296 |
1056.370 |
26 |
27.4 |
0.073882 |
1056.370 |
1267.444 |
26 |
18.5 |
2.992872 |
1267.444 |
1478.519 |
8 |
12.6 |
1.695590 |
1478.519 |
1689.593 |
5 |
8.6 |
1.533205 |
1689.593 |
1900.667 |
5 |
5.9 |
0.147582 |
1900.667 |
2322.815 |
7 |
6.9 |
0.000840 |
above 2322.815 |
|
9 |
6.5 |
0.957956 |
Chisquare = 8.67292 with 8 d.f. Sig. level = 0.370635
The surplus of distances 1057-1267 is followed by the shortage of longer distances.
The frequency of h made necessary to split the set for the evaluation into four parts which correlated badly with the negative binomial distribution (1. part has the chisquare value 0.315 over 30) but they were still too long for the two way sample analysis. [g + G] was split for the evaluation into six parts which correlated badly with the negative binomial distribution (e. g. 3. part has the chisquare value 0.115 over 27)
The two way sample analysis shows how the parts are different:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
*0.001 |
*0 |
*0 |
*0 |
*0 |
2 |
0.780 |
*0 |
*0 |
0.370 |
|
3 |
*0 |
*0 |
0.533 |
||
4 |
*0.005 |
*0 |
|||
5 |
*0 |
I
The distribution of the capital I is correlated poorly with the lognormal distribution. The greatest disturbance is a shortage of counts within distances 305-607 (102 occurrences against 125 expected) which contributes 39,1 % of the chi-square test value. The tail is longer over distances 1516 (17 occurrences against 10,2 expected) which contributes another 49,5 % of the chi-square test value.
The frequency of the lower case i made necessary the splitting. The parts are poorly correlated with the exponential distribution, as the best the 5. part (the chi-square test value 0.701 over 5), and they pass the two way sample analysis, as follows:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
*0.014 |
*0.026 |
0.068 |
*0.015 |
*0.002 |
2 |
0.815 |
0.635 |
0.947 |
0.441 |
|
3 |
0.631 |
0.864 |
0.316 |
||
4 |
0.506 |
0.132 |
|||
5 |
0.397 |
Only the first part differs significantly from the others, since the result with the fourth part is only slightly above the limit of rejection. There are no repeating ii. This makes 55.6-78.2 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:
Part |
Range |
Observed |
Expected |
% of chisquare |
i1 |
6-21 |
860 |
744.5 |
15.1 |
i2 |
7-18 |
709 |
567.2 |
23.3 |
i3 |
7-28 |
954 |
803.1 |
23.1 |
i4 |
7-26 |
950 |
871.8 |
6.2 |
i5 |
7-31 |
1077 |
979.5 |
11.8 |
i6 |
8-28 |
930 |
795.7 |
23.1 |
The including of I changed the results of the two way sample analysis as follows:
Part |
2 |
3 |
4 |
5 |
6 |
1 |
*0.001 |
*0.002 |
*0 |
*0 |
*0.022 |
2 |
0.756 |
0.176 |
0.169 |
0.230 |
|
3 |
0.091 |
0.084 |
0.453 |
||
4 |
0.984 |
*0.017 |
|||
5 |
*0.014 |
In most cases, the similarity is worse. Only the fourth and fifth parts are less different. There are no repeating Ii. This makes 60.5-70.1 % of the chi-square test value. The distribution is more skewed, there exists always a surplus of intermediate distances:
Part |
Range |
Observed |
Expected |
% of chisquare |
[i+I]1 |
5-14 |
735 |
604.5 |
21.6 |
[i+I]2 |
7-12 |
495 |
371.2 |
24.0 |
[i+I]3 |
6-15 |
728 |
616.0 |
13.8 |
[i+I]4 |
8-21 |
831 |
687.9 |
21.3 |
[i+I]5 |
6-30 |
1256 |
1084.7 |
18.2 |
[i+I]6 |
7-23 |
1028 |
861.2 |
24.0 |
J
The distribution of the letter is the Weibull one. The Weilbull distribution of the lower case j is better correlated than both cases [j + J]. There are too many distances 874-1310 (22 occurrences against 16.7 expected). This makes 25.0 % of the chi-square test value. Contrary, there are too few distances 2619-3055 (10 occurrences against 6.4 expected). This makes 31.2 % of the chi-square test value. ombining both cases worsened the fit. There are too many distances 963-1284 (52 occurrences against 40.4 expected). This makes 44.2 % of the chi-square test value.
K
The Weilbull distribution of this letter is bad. There are no repeating kk [Kk]. This makes 18.5 [19.3] % of the chi-square test value.
L
The occurrences of capital L is correlated by the exponential distribution. There are too many distances 2653-3173 (14 occurrences against 11.3 expected). This makes 54.1 % of the chi-square test value.
The frequency of l and [l + L] made necessary the splitting.
The parts are correlated with the Weilbull distribution. It is distorted by many double ll [Ll]. This makes 74.8-79.9 % [74.8-81.2 %] of the total chi-square test value. The parts fit over different distances rather well, see table:
Part |
Cut |
Chisquare |
l 1 |
11 |
0.903 |
2 |
35 |
0.967 |
3 |
24 |
0.967 |
4 |
35 |
0.208 |
5 |
20 |
0.097 |
l+L 1 |
11 |
0.987 |
2 |
0 |
|
3 |
30 |
0.925 |
4 |
36 |
0.092 |
5 |
10 |
0.579 |
The parts of l pass the two way sample analysis, as follows:
Part |
2 |
3 |
4 |
5 |
1 |
0.850 |
0.518 |
*0.013 |
*0.041 |
2 |
0.647 |
*0.023 |
0.065 |
|
3 |
0.074 |
0.173 |
||
4 |
0.670 |
The fourth part differs significantly from the first and second ones, the first from the fifth one.
The including of I changed the results only slightly, see table:
Part |
2 |
3 |
4 |
5 |
1 |
0.831 |
0.430 |
*0.012 |
*0.039 |
2 |
0.567 |
*0.022 |
0.067 |
|
3 |
0.095 |
0.219 |
||
4 |
0.661 |
M
The upper case M is correlated well using the Weilbull distribution (the chi-square test value is 0.458).
The lower case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 44 is 0.798, 2. part the chi-square test value = 0.443, 3. part the chi-square test value = 0.137). The doubled mm fit excellently only in the second part, in other parts, the repeating mm is more scarce than expected.
The parts of the distribution of m are different. The two way sample analysis shows following results:
Part |
2 |
3 |
1 |
*0.001 |
*0.004 |
2 |
0.712 |
The upper case m, divided into 3 parts, is correlated as best with the exponential distribution (1. part the chi-square test value over 14 is 0.849, 2. part the chi-square test value = 0.096, 3. part the chi-square test value = 0.082). The doubled Mm fit only in the second part, in other parts, they are more scarce than expected.
The parts of the distribution of [m + M] are different, too. The two way sample analysis shows following results:
Part |
2 |
3 |
1 |
*0.006 |
*0.004 |
2 |
0.835 |
N
The upper case N is correlated using the Weibull distribution (the chi-square test value is 0.072). There are too few distances 1600-2000 (7 occurrences against 13.3 expected). This makes 25.6 % of the chi-square test value.
The distribution of n and (n + N) was divided into seven parts.
The distribution of this letter is distorted by too few double nn [Nn] (e. g. 10 occurrences against 93 expected). This makes 44.0-67.9 % of the total very high chi-square test value. In some parts are rather great disturbances:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
6-10 |
419 |
283.0 |
38.8 |
0.152 over 10 |
3 |
18-22 |
202 |
134.8 |
28.2 |
0.242 over 20 |
4 |
6-9 |
330 |
265.5 |
13.8 |
0.073 over 16 |
5 |
6-16 |
644 |
736.6 |
15.2 |
0.114 over 10 |
6 |
6-22 |
944 |
784.7 |
27.0 |
0.117 over 37 |
The two way sample analysis shows following results:
Part |
2 |
3 |
4 |
5 |
6 |
7 |
1 |
*0.012 |
0.097 |
0.355 |
0.882 |
0.780 |
0.146 |
2 |
0.388 |
0.107 |
*0.008 |
0.107 |
0.274 |
|
3 |
0.457 |
0.071 |
0.457 |
0.823 |
||
4 |
0.284 |
0.508 |
0.598 |
|||
5 |
0.667 |
0.110 |
||||
6 |
0.230 |
[n + N]:
In some parts are rather great disturbances:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
6-10 |
425 |
288.5 |
32.8 |
0.094 over 9 |
2 |
6-10 |
359 |
273.8 |
18.2 |
0.074 over 31 |
3 |
18-22 |
198 |
136.3 |
24.4 |
0.736 over 25 |
4 |
6-9 |
332 |
269.0 |
12.6 |
0.171 over 35 |
5 |
7-16 |
657 |
545.4 |
21.3 |
0.114 over 10 |
6 |
6-22 |
958 |
795.9 |
26.5 |
0.138 over 37 |
The disturbances have slightly less weight than at the lower case n.
The two way sample analysis shows for [n + N] the following results:
Part |
2 |
3 |
4 |
5 |
6 |
7 |
1 |
*0.014 |
0.169 |
0.209 |
0.987 |
0.595 |
0.078 |
2 |
0.278 |
0.234 |
*0.015 |
0.051 |
0.480 |
|
3 |
0.911 |
0.170 |
0.384 |
0.701 |
||
4 |
0.209 |
0.457 |
0.621 |
|||
5 |
0.589 |
0.080 |
||||
6 |
0.211 |
Both sets are alike, the including of N did not changed the results of the two way sample analysis dramatically. The second part differs signficantly from the firts and fifth ones.
O
The distribution of O can be correlated also with the Weibull distribution (the chi-square test value 0.077). There are too many distances 3572-4714 (14 occurrences against 7.3 expected). This makes 35.1 % of the chi-square test value.
The distribution of o and (o + O) was divided into seven parts, which correlated poorly with the exponential distribution.
The distribution of this letter is distorted by too few double oo [Oo] only slightly,
(at most the first part of o, 72 occurrences against 119 expected). This makes 40.6% of the total very high chi-square test value. Here are the greatest disturbances:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
7-11 |
427 |
369.2 |
19.8 |
0.122 over 25 |
2 |
2-5 |
336 |
420.8 |
19.0 |
0.089 over 17 |
6-9 |
395 |
325.7 |
16.4 |
||
14-22 |
421 |
346.1 |
18.8 |
||
3 |
2-4 |
233 |
334.1 |
25.9 |
0.117 over 10 |
11-14 |
262 |
184.6 |
27.5 |
||
4 |
7-18 |
778 |
653.0 |
36.5 |
0.652 over 26 |
5 |
14-22 |
522 |
346.3 |
31.8 |
0.084 over 20 |
6 |
2-5 |
310 |
403.2 |
22.7 |
|
14-22 |
432 |
344.4 |
27.3 |
||
7 |
14-22 |
408 |
345.8 |
18.2 |
0.155 over 30 |
In four parts, an excess of distances 14-22 occurs.
The two way sample analysis shows following results:
Part |
2 |
3 |
4 |
5 |
6 |
7 |
1 |
0.608 |
0.276 |
0.149 |
0.390 |
*0.042 |
0.656 |
2 |
0.565 |
0.342 |
0.726 |
0.124 |
0.941 |
|
3 |
0.689 |
0.824 |
0.323 |
0.511 |
||
4 |
0.541 |
0.575 |
0.302 |
|||
5 |
0.231 |
0.669 |
||||
6 |
0.103 |
The first part differs significantly from the fifth one.
[o + O]:
In some parts are rather great disturbances:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
6-9 |
404 |
328.7 |
22.5 |
0.162 over 31 |
18-22 |
205 |
150.6 |
18.4 |
||
2 |
39-43 |
69 |
42.1 |
18.2 |
0.074 over 31 |
3 |
6-9 |
378 |
318.6 |
17.2 |
0.248 over 45 |
27-30 |
125 |
91.7 |
18.7 |
||
4 |
7-12 |
480 |
378.7 |
31.2 |
|
5 |
14-22 |
426 |
344.6 |
31.0 |
0.128 over 25 |
6 |
2-5 |
308 |
402.7 |
21.0 |
|
14-22 |
432 |
343.3 |
24.3 |
||
7 |
14-22 |
411 |
344.1 |
18.4 |
The excess of distances 14-22 occurs again. The disturbances have slightly less weight than at the lower case o.
The two way sample analysis shows for o + O the following results:
Part |
2 |
3 |
4 |
5 |
6 |
7 |
1 |
0.443 |
0.128 |
0.114 |
0.229 |
0.050 |
0.5378 |
2 |
0.449 |
0.456 |
0.770 |
0.229 |
0.873 |
|
3 |
0.930 |
0.640 |
0.653 |
0.354 |
||
4 |
0.584 |
0.722 |
0.319 |
|||
5 |
0.359 |
0.648 |
||||
6 |
0.169 |
There is no significant difference between parts.
P
The upper case P correlated also using the Weilbull distribution (the chi-square test value = 0.356). The Weilbull distribution of p and [p +P] is distorted by too many pp [Pp] (69.8-80.8 [72.2-73.6] % of the chi-square test value). The other disturbances give only a minor opportunities for commenting. The both sets were divided into 3 parts. The test did not revealed their dissimilarity.
Q
This letter correlated also using the Weilbull distribution (the chi-square test value = 0.367 at q, 0.389 at [q + Q]) but better fit gives the exponential distribution (the chi-square test value = 0.441 at q, 0.433 at [q + Q]).
R
The upper case R correlated with the Weilbull distribution (the chi-square test value = 0.109). The fit distribution is worsened by too many repeating within distances 1000 till 2000 (45 occurrences against 33.3 expected) which makes 40.4 % of the chi-square test value.
The distribution of r and [r + R] was divided into five parts. They fit with the exponential distribution.
The most important disturbances from the shape of the distribution in all parts of r are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
31 |
91.9 |
25.1 |
0.103 over 24 |
2-5 |
263 |
401.5 |
29.8 |
||
6-20 |
973 |
791.6 |
26.9 |
||
2 |
1 |
36 |
94.5 |
30.2 |
0.097 over 20 |
8-21 |
831 |
689.8 |
26.5 |
||
3 |
1 |
45 |
105.6 |
23.8 |
0.195 over 34 |
10-18 |
628 |
471.9 |
41.4 |
||
4 |
2-5 |
305 |
464.5 |
48.0 |
|
5 |
10-18 |
570 |
454.1 |
32.2 |
The two way sample analysis shows that the parts of the lower case r are rather different:
Part |
2 |
3 |
4 |
5 |
1 |
0.351 |
*0 |
*0 |
0178 |
2 |
*0 |
*0 |
0.671 |
|
3 |
0.334 |
*0 |
||
4 |
*0 |
The third and fourth part differ significantly from all other parts.
The most important disturbances from the shape of the distribution in all parts of [r + R] are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
31 |
99.8 |
24.4 |
0.099 over 19 |
2-5 |
254 |
393.3 |
31.2 |
||
6-20 |
957 |
779.8 |
25.9 |
||
2 |
1 |
36 |
93.2 |
31.3 |
|
8-21 |
831 |
682.7 |
27.1 |
||
3 |
1 |
45 |
104.0 |
27.6 |
0.239 over 17 |
10-18 |
617 |
467.0 |
41.5 |
||
4 |
2-5 |
297 |
456.5 |
48.7 |
|
5 |
10-18 |
562 |
448.8 |
31.6 |
The two way sample analysis shows that the parts of [r + R] are rather different:
Part |
2 |
3 |
4 |
5 |
1 |
0.220 |
*0 |
*0 |
0.115 |
2 |
*0 |
*0 |
0.716 |
|
3 |
0.400 |
*0 |
||
4 |
*0 |
The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.
S
The Weibull distribution of the capital S is distorted mostly by a walley within distances 1600-2000 (10 occurrences against 14.2 expected). This makes 50.8 % of the chi-square test value.
The distribution of the lower case s an d [s + S] was divided into five parts.
The most important disturbances from the shape of the distribution in all parts of s are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
115 |
70.1 (WE) |
44.1 |
WE 0.669 over 33, EX 0.569 over 30 |
26-33 |
171 |
226.8 |
21.1 |
||
2 |
2-5 |
285 |
419.6 |
44.5 |
EX 0.921 over 22 |
16-20 |
284 |
210.3 |
26.6 |
||
3 |
1 |
96 |
63.7 |
25.1 |
WE 0.105 over 18 |
2-6 |
294 |
374.6 |
26.5 |
||
4 |
2-6 |
338 |
455.2 |
36.2 |
|
7-12 |
453 |
353.1 |
38.6 |
||
5 |
2-7 |
452 |
529.0 |
26.2 |
EX 0.155 over 10 |
The two way sample analysis shows that the parts of the lower case s are rather similar, except the fifth part which differs from the first and third parts:
Part |
2 |
3 |
4 |
5 |
1 |
0.300 |
0.845 |
0.130 |
*0.019 |
2 |
0.216 |
0.620 |
0.175 |
|
3 |
0.087 |
*0.011 |
||
4 |
0.389 |
The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
115 |
72.1 |
40.1 |
WE, 0.234 over 32 |
26-33 |
170 |
231.0 |
25.5 |
||
2 |
2-5 |
303 |
376.7 |
22.9 |
WE, 0.273 over 32 |
3 |
2-5 |
310 |
450.2 |
48.7 |
EX, 0.091 over 28 |
11-15 |
348 |
276.8 |
20.4 |
||
4 |
2-6 |
355 |
474.7 |
32.8 |
EX, 0.531 over 31 |
7-12 |
474 |
366.3 |
34.4 |
||
5 |
no test |
The two way sample analysis shows that the parts of [s + S] are rather similar:
Part |
2 |
3 |
4 |
5 |
1 |
0.215 |
0.833 |
0.59 |
0.188 |
2 |
0.306 |
0.502 |
0.241 |
|
3 |
0.0.94 |
0.197 |
||
4 |
0.275 |
The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.
T
The distribution of the capital T has the Weilbull shape. There is more distances 544-679 (54 occurrences against 43.3 expected). This makes 26.2 % of the chi-square test value.
The distribution of the lower case t as well as the both [t + T] is divided into six parts.
The most important disturbances from the shape of the exponential distribution in all parts of s are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
82 |
193.2 |
46.1 |
0.429 over 31 |
5-16 |
1410 |
1204.3 |
34.7 |
||
2 |
1 |
72 |
203.3 |
44.2 |
0.461 over 24 |
5-16 |
1513 |
1226.8 |
35.2 |
||
3 |
1 |
82 |
204.4 |
50.2 |
0.119 over 47 |
23-29 |
279 |
225.9 |
10.4 |
||
4 |
1 |
64 |
199.8 |
38.0 |
0.490 over 36 |
5-20 |
1808 |
1448.8 |
37.3 |
||
5 |
1 |
61 |
208.1 |
58.6 |
0.176 over 33 |
5-16 |
1462 |
1238.1 |
23.8 |
There are too few repeated s in all parts. Moreover, the shape of the distribution is rather sharp in the range 5-20, except the third part.
The most important disturbances from the shape of the distribution in all parts of [s + S] are tabulated:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
82 |
205.6 |
44.4 |
0.117 over 25 |
5-16 |
1481 |
1243.5 |
29.8 |
||
2 |
1 |
73 |
219.1 |
42.6 |
0.383 over 25 |
5-16 |
1604 |
1272.4 |
38.4 |
||
3 |
1 |
82 |
218.2 |
54.2 |
0.039 over 21 |
5-16 |
1590 |
1384.1 |
22.1 |
||
4 |
1 |
64 |
213.1 |
36.9 |
0.205 over 45 |
5-20 |
1896 |
1498.6 |
36.4 |
||
5 |
1 |
60 |
221.0 |
54.1 |
0.075 over 37 |
5-16 |
1533 |
1276.5 |
26.0 |
The results are not changed to much.
U
The distribution of the capital U has the Weilbull shape.
The set of u, and [u + U] was divided into two parts, which are similar (the two sample analysis 0.326 [0.318]. There are no doubled uu or Uu (0 [0] occurrence against 54.0-55.6 [54.6, 56.2] expected). This makes 55.4, 44.4 [56.4, 44.1] % of the chi-square test value of the exponential distribution. When the lower limit is set to 30 at the 1. u set [28 at the 1. U set], the chi-square test value of the first part of u [u + U] is improved to 0.701 [0.865. The second parts give poorer fits.
V
The Weilbull distribution of [v + V] is poor. There are no doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value. The tail is longer, (18 occurrences over 634 against 10.7 expected). This makes another 23.9 % of the chi-square test value.
W
The Weilbull distribution of the upper case W gives an good fit. There is a shortage of the distances 1840-2160 (5 occurrences against 9.4 expected). This small difference alone makes 44.5 % of the chi-square test value.
The sets w and [w + W] were divided into three parts. The exponential distribution of w gives a fair fit, see the following table:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
2 |
25.6 |
57.3 |
0.758 over 20 |
108-125 |
37 |
57.2 |
18.7 |
||
2 |
1 |
1 |
26.3 |
58.6 |
0.242 over 10 |
3 |
1 |
0 |
29.3 |
52.3 |
0.711 over 45 |
31-45 |
249 |
203.2 |
18.4 |
The two way sample analysis shows that the third part of w differs from the first two thirds:
Part |
2 |
3 |
1 |
0.440 |
*0 |
2 |
*0.007 |
The third and fourth part differ significantly from all other parts. Combining r with R changed the distribution insignificantly.
The exponential distribution is best in the first two thirds, the last one is better correlated by the Weilbull distribution, see the following table:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
2 |
28.9 |
59.4 |
0.581 over 10 |
16-30 |
309 |
276.7 |
8.9 |
||
2 |
1 |
1 |
29.7 |
64.0 |
0.250 over 10 |
3 |
1 |
0 |
16.7 |
44.8 |
0.228 over19 |
166-180 |
16 |
9.2 |
13.6 |
The two way sample analysis shows that the third part of w differs from the first two thirds:
Part |
2 |
3 |
1 |
0.471 |
*0.001 |
2 |
*0.015 |
X
The Weilbull distribution of [x + X] is poor. There is a shortage of the distances 1141-1368 (8 occurrences against 16.6 expected). This makes 32.5 % of the chi-square test value.
doubled vv (0 occurrence against 9.8 expected). This makes 46.2 % of the chi-square test value.
Y
The exponential distribution of the upper case Y gives an acceptable fit. There is a peak of the distances 2216-2585 (10 occurrences against 6.2 expected). This minor difference makes 30.8 % of the chi-square test value.
The sets w and [w + W] were divided into two parts. The exponential distribution of y gives a poor fit, see the following table:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
1 |
25.6 |
60.0 |
0.436 over 70 |
2 |
1 |
1 |
22.9 |
46.0 |
0.161 over 40 |
136-185 |
84 |
104.5 |
20.0 |
The two way sample analysis shows that they differ significantly (test value 0.003).
The exponential distribution of [y + Y] gives a poor fit, too, see the following table:
Part |
Range |
Observed |
Expected |
% of chisquare |
Chisquare |
1 |
1 |
1 |
28.2 |
71.7 |
0.162 over 30 |
2 |
1 |
1 |
25.1 |
46.7 |
0.200 over 40 |
139-185 |
82 |
103.5 |
18.9 |
The two way sample analysis shows that they differ significantly (test value 0.001).
Z
The exponential distribution of this letter is distorted by few occurrences within distances 638*1273 (17 occurrences against 28.4 expected). This makes 49.9 % of the chi-square test value.
Discussion
The insufficient capacity of the used software for long lists forced splitting of too frequent signs. The splitting was made before determining distances. Surprisingly, the obtained parts are not always comparable, since there are in the split parts different number of signs. This leads to the different mean distances between them.
Some distributions of distances between consonants are highly regular, especially their tails, if the low distances inside words are pooled. They are described with a different precision with four distributions: exponential, Weibull, lognormal and negative binomial. Sometimes it is rather difficult to decide which distribution is the better one for fitting.
If the results are compared with published analyses of Shakespeare's Sonnets and the Mathew's Ghospel, then there can be observed many differences. Doyle used words differently than older authors. Especially Weibull distribution appears more often.
Some peaks are obviously results of repeated phrases. This conclusion should be confirmed by stylistic analysis.
REFERENCES
1. Kunz M., See papers of this series on the page.