This paper was determined for Scientometrics but it was rejected. This is the scanned version of the manuscript with some minor corrections.

ARE INFORMATION DISTRIBUTIONS STATIONARY OR ARE THEY STATIONERY?

Milan KUNZ

Chemopetrol, Research Institute of Macromolecular Chemistry, 656 49 Brno, Czechoslovakia

Parameters of stationary distributions should be independent on time, according to the generally accepted theories. It is shown on literature data that the slope constant B of the Lotka law depends on time and its first moment does not depend on the sample size. This is against the Haitun's conjectures.

Introduction

It is a pity that some basic questions about extremely skewed distributions of information remain still open despite all theoretical and empirical efforts and that such discussions as mine [1] with Haitun [2] are still possible and necessary.

Haitun's conjectures are concerned about stationary information distributions. These are extremely skewed and have many specific properties.

I used for Lotka and Zipf laws the picturesque term: paper dragons with fuzzy tails [3]. Because paper dragons are stationery, the difference between mine and Haitun's view seems to be insignificant, both term differ by one letter, only, but the clearing of this difference is not so easy.

Stationary according to Haitun [4] should mean "involving no time dependence", it is a distribution should be independent from time.

To be time independent, the derivation of a stationary distribution F(x) according the time should be zero

dF(x)/dt = O

in all time intervals in which distribution exists. Because this is not fulfilled and such mathematical definition is, may be, too strict, we must seek some other plausible explanation. We can suppose that information behaves as a river, it flows constantly and we observe its instant properties, where an instant dt can be a relatively long time.

Such information systems behave as they were without memory.

Now, do we consider seasonal fluctuations of the flow as the dependence of a river on time or not? We can argument that they are only seasonal and that yearly means remain constant and neglect these variations, to give as broad meaning to stationary as possible. But there are also information systems with memory. Their instant properties cumulate as water in dam lakes on rivers. We need not to complicate our analysis by cases, when some information is lost at accumulation (at water by evaporation from lakes) as by obsolescence, for example.

The problem of the notion of stationary is: which properties of information should remain here independent from time? Accumulated amounts are surely a function of time and are thus not time independent. At water it can be its density or temperature. But what is time independent at information?

Because we speak about distributions and the moments of Zipfian distributions should depend on the sample size, it must be the shape of the distribution itself which does not depend on time, at Lotka and Zipf distributions their normalized parameters.

The purpose of this paper is to clear relations between moments, parameters and time.

Moments and parameters

Let us have an information system M with m units of information partitioned between n information vectors [5]. This system is characterized by its position in n dimensional vector space known in physics as the phase space. If we study statistics of a system M, we are not interested on its exact position, but only on its statistical properties which are characterized by the partition sums. The first sum counts the different information vectors (e.g. symbols of the alphabet, authors)

m0 = n = S 1j = S mj0 = S nkmk0 (1)

where the summation of the index j goes from 1 to n, the summation of the index k goes from 1 to m (but nk is zero, mostly), mi are the numbers of information units which are characterized by the index j. The first sum can be formulated differently, at first as counting the information vectors directly, than as zero moments (this operation transforms all m to 1), and at last by counting the number nk of information vectors having the same number of information units mk.

The second sum counts repeating of the information vectors (e.g. the total number of individual symbols of the alphabet, papers attributed to the j-th author)

m1 = m = S mj1 = S nkmk1 (2)

where the summation of the index j goes from 1 to n, the summation of the index k goes from 1 to m (index k is used for counting vectors with equal values mj, but nk is zero, mostly), as before. and generally for the sums of higher moments

mp = n = S mip = S nkmkp (3).

It is necessary to notice that if the values mj are ranked, mj ³ mj+1, then the summing in the equations is made upside down, only. Therefore, there is none substantial difference between rank and frequency distributions, except different scales.

p-Roots from sums mp are known as distances, e.g. m20.5 is the Euclidean length of vector M. All vectors M having the constant sum m form the plane simplex. They are hypotenuses of right triangles having vector (m/n)In as the leg, thus the plane simplex is orthogonal to the unit diagonal vector In [6].

The probability theory neglects orientation of vectors, transforming the n dimensional space into one dimensional space of the stochastical variable. Moreover, instead of the Euclidean space, it uses the Hilbert space, where n can go to infinity, and introduces the notion of density function of probabilities pk = nk/n into definition of the sums mp. Then m0= 1 always, the space is normalized, and we speak about moments instead about distances.

There are also defined central moments m'p which are determined as distances from the mean m/n

m'p = S pk(mk - m/n)p

There is m'0= m'1= 0 always, which is in accord with our definition of the plane simplex, all its points are in the same plane at the distance m11= m from the beginning of coordinates.

It is clear that distances in a space depend on its measures and only normalized moments can be independent from the sample size. Jones and Furnas [6] discussed in details problems connected with the normalization of moments and it is not necessary to repeat their arguments.

In statistics, our aim is to determine the state of a system instead by exact equation (2) by a functional relation

nk = f(mk) or pk = f(mk).

This function has parameters. At some types of distributions, the parameters are identical with moments which is somewhat confusing but irrelevant to our discussion.

Now, we repeat our question: What it means, that an information distribution is time independent? Are independent moments or parameters? Because information distributions should be Zipfian and their moments should depend on the sample size, which is growing with time, it must be the parameters which should remain constant.

But what if moments remain constant and parameters change? Are such distributions stationary or is this hypothesis only a stationery filling space on library shelves?

One from tasks of the probability theory is to determine unknown moments and parameters of the whole systems from their samples. But to do it, we must make some assumptions. Usually it is supposed that unknown parts do not differ substantially from known ones. But Zipfian distributions with their infinite moments must differ from their known parts with finite moments and we ask if these models are really corresponding to the objective reality which is finite.

A special branch of statistics is devoted to problems of sampling of nonhomogeneous systems. At information distributions we encounter two different situations: we have some closed systems which are accessible completely, as e.g. books are, or open systems from which only samples can be obtained, as e. g. the spoken language of large collectives is.

To simplify the situation, we will deal only with closed systems. These can be ordered as on Fig. l, where are shown different orderings of information units leading to all three laws.

Fig. 1. Matrices of information distributions.

A Accession matrix. Information vectors are ordered by their accession into a system.

B Index matrix. Information units are ordered as in indexes or catalogs.

C Bradford matrix. Indexed vectors are ranked according to their m values. Their contour corresponds to the Bradford law.

D Zipf matrix. Information vectors are grouped to the matrix base. Their contour corresponds to the Zipf law.

E Lotka matrix. Differences of numbers nk are ordered against mk values. Their contour corresponds to the Lotka law.

F Matrix partition. Information vectors are separated into two submatrices.

If information units j are indexed accidentally by their accession numbers i, as words in texts or books in accession catalogues (Fig. 1A), there is a problem how to determine individual frequencies mj. Sliding moments are ever increasing and exactly they are determined only after the whole system is closed and evaluated.

If information units are partially ordered as in catalogues or indexes (Fig. 1B), frequencies mj are known and sliding moments are fluctuating functions. If there is no accidental ranking of mj values (the alphabetical orders can be nationally biased, for example), it is not necessary to evaluate whole systems, but we can use moment estimates from samples. Lotka made full counts of letters A and B of the Name Index of Chemical Abstracts [4]. His moment s can be accepted as good estimates of moments of the whole decenial index, because when Vlachı [8] used random samples, he obtained similar results from smaller samples.

There is possible yet another partition of information. On Fig. 1F, the accession partition into two subsystems is given, e.g. splitting of the CA index into indexes of the indexed journals. Such partition, similarly as the partition in time, shortens the distribution and leads to increased slope constants B. Bonitz [9] found that if Bradford law is fulfilled for a collective of scientists in the selective dissemination of information, then it is not for single scientists from this collective.

Some empirical results

Mathematically the question, what is more important, moments or parameters, is unsolvable as the question, what was before, an egg or a hen.

But information systems are still generated by people, and we can decide, if material conditions changed so that their change can explain changes of systems M.

The logarithmic form of the Lotka law has two parameters

log nk = A - B log mk

The parameter A depends always on sample size, it is the calculated value log n1. It is thus parameter B which should be time independent.

Rao [12] tested results of Williams about authors of papers in two year volumes of Reviews of Applied Entomology. In 1913, the population of authors was 411 which published 656 papers. In 1936,1534 authors published 2379 papers. The sample size increased nearly 4 times, but the first moment remained constant. In 1913, m1 = 1,59, in 1936, m1 = 1,55. We can conclude that working and publishing behavior of entomologists publishing in Reviews did not change, only their population was 3,73 times greater. But because the mean remained constant, slope constant B changed from 2,5481 to 2,7331.

Time factor at Lotka law was studied extensively by Vlachı [8]. He tested even Lotka's counts, so his data are rather instructive, they allow comparisons of different kinds /Table 1/.

Table 1 Moments and parameters of Chemical Abstracts Indexes

Period

Interval years

Number of publications

Sample %

Moment m1

Parameter B

1904-1916

10

172000

13

3,31

1,9

 

10

172000

1

3,55

1,9

1910

1

16500

13

1,40

2,6

1927-36

10

317000

1

2,95

1,9

1930

1

30000

2

1,21

3,2

1947-56

10

595000

1

2,73

2,0

1950

1

55000

2

1,42

2,5

1957-61

5

585000

1

2,30

2,1

1960

1

130000

1

1,39

2,7

1967-71

5

1314000

1

2,43

2,0

1970

0,5

137283

1

1,29

2,6

1971

0,5

157995

1

1,20

2,8

Note: From [8] except the first line which is from [7].

If we compare Vlachı's results with Lotka's, from the same 10 years register, there is no great difference, slope constants are equal and the mean determined by Vlachı is greater than Lotka's one despite of the fact that Vlachı used 13 times lesser sample.

The number of publications registered in Chemical Abstracts increased in 60 years about 8 times, but the means of authorship remained relatively stable and decreased somewhat. They are independent from the population size, but they clearly depend on time, because:

m1 = (0,5-1) ~ 1,3, m1 (5) ~ 2,35, m1 (10) ~ 3,0.

Even prolific authors can not write in short intervals too many papers, as required for large populations. It means that also slopes of Lotka plots must be less steeper at shorter observations periods, which is really the case.

The apparently decreasing productivity of authors (in fifty years) can be explained by the fact that Chemical Abstracts are now covering more journals that at the beginning, but some of them only selectively and only a part of publications of some authors is thus registered.

Similar results were obtained from Czechoslovak Journal of Physics data (Table 2), where the full counts were made. From Vlachı's compilation, only data about some institutional communities exclude of the given pattern.

Table 2 Moments and parameters of Czechoslovak Journal of Physics Indexes (8)

Period

Interval years

 

Yearly indexes

 

 

Moment m1

Parameter B

Moments m1

Parameters B

1952-1960

9

3,16

1,3

1,42

1,7-2,6

1961-1970

10

2,22

1,3

1,29

1,4-2,8

1971-1975

5

1,82

1,8

1,24

1,8-2,5

 

Information distributions should have no moments. It means that their moments should grow with the sample size. If we take the second moments, then Rao calculated from Lotka's Chemistry A s = 0.0562, Chemistry B s = 0.0537, and Chemistry A + B s = 0.0331, the second moment decreased with the sample size against theoretical conjectures. This can be explained with the effect of the fuzzy tail of the distribution, which does not obey its theoretical head.

Discussion

It seems to be trivial to give proofs that authors in longer time intervals not only can but really publish more papers (and similarly words can be repeated, papers cited more often) which changes moments as well parameters of the Lotka and Zipf distributions, but somebody must point that Caesar's clothes are invisible.

The stationary information distributions are rather stationery - paper dragons with fuzzy tails -and until some empirical evidence about independence of their parameters from time will be presented (this is an analogy to the proof of the conjecture of the dependence of the gravitation constant on time), the myth that Lotka, Zipf and Bradford laws are information analogies of the Newton law must be abandoned.

It is a hard conclusion, especially when only recently the Lotka law passed examination by statistical tests [11, 12]. But there the lognormal distribution could be used [13], too.

It is plausible that the exact correlation need more parameters than two [3].

We must admit that believes in stationarity of information distributions are based partially on objective reasons.

Parameters of extremely distributions are badly determinable at small samples where only few scattered points are available for correlation. It is practically impossible to study dependence of parameters on time.

A possibility how to overcame this difficulty would be to follow, how moments change with time and to compare them with theoretical ones.

In statistical linguistics an important characteristic is the ratio vocabulary/all words n/m which is the inverted first moment m-1. Sichel [14] has given some approximations, how it should change as a function of m according to the generalized inverse Gaussian-Poisson distribution. It would be possible to compare it also with theoretical values calculated from ideal Zipfian plots.

Extremely skewed distributions are stationery not only because they are not stationary but because the concerning literature is based mostly on speculations about abstract properties of mathematical relations and on counts of relatively small samples. Lotka's counts were never surpassed.

Full statistics of computerized databases as Chemical Abstracts, Derwent's Central Patent Index or Science Citation Index are desirable. Fujiwara [15] published analysis of Chemical Abstracts vocabulary, based on 10 millions of keywords, but his result, that they are distributed normally, is not verifiable because no details were given. Statistical evaluations of computerized databases need large funds which are for scientometricians still inaccessible and so they devote their efforts, as careful bibliographs, to categorization of their third world [16].

References

1. M. KUNZ, A case study against Haitun's conjectures. Scientometrics 13, (1988), 25.

2. S.D. HAlTUN, On M. Kunz's article "A case study against Haitun's conjectures, Scientometrics 13, (1988), 35.

3. M. KUNZ, Lotka and Zipf: Paper dragons with fuzzy tails, Scientometrics 13, (1988), 89.

4. S.D.HAlTUN, Stationary scientometric distributions, Part I. Different approximations. Scientometrics, 4, (1982), 5.

5. M. KUNZ, Information processing in linear vector space, lnformation Processing Management 20, (1984), 519.

6. W.P.JONES, G.W. FURNAS, Pictures of relevance: A geometric analysis of similarity measures. Journal American Society Information Science, 38, (1984), 420.

7. A.J. LOTKA, The frequency distribution of scientific productivity. Journal Washington Academy of Sciences, 16, (1926), 317.

8. J. VLACHİ, Time factor in Lotka's law. Probleme Informare Documentare 10, (1976), 44.

9. M. BONITZ, Evidence for the invalidity of the Bradford law for the single scientist. Scientometrics, 2, (1980), 203.

10. M.L.PAO, Am empirical examination of Lotka's law. Journal American Society Information Science, 37, (1986), 26.

11. P.T.NICHOLLS, Price's square root law: Empirical validity and relation to Lotka's law. lnformation Processing Management 24, (1988).

12. P.T. NICHOLLS. Bibliometric methods and validity of Lotka's law. Journal American Society Information Science,

13.M. KUNZ. Can lognormal distribution be rehabilitated? Scientometrics,

14. H.S. SICHEL, Word frequency distributions and type/token characteristics. Mathematical Scientist 11, (1986), 45.

15. S. FUJIWARA, Main scientific-technical-key words as the mean of connection possibilating better exchange of information. (In Russian). Mezhdunarodniy Forum Informacii Dokumentacii 9, (l984) , 23.

l6. A. SOYIBO, W. O. AIYPEKU, On the categorization, exactness and probable utility of bibliometric laws and their extensions. Journal Information Science, l4, (l988), 243.