Zipf's law

related topics
{math, number, function}
{rate, high, increase}
{math, energy, light}
{language, word, form}
{law, state, case}
{style, bgcolor, rowspan}

Zipf's law, an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. The law is named after the linguist George Kingsley Zipf (pronounced /ˈzɪf/) who first proposed it (Zipf 1935, 1949), though J.B. Estoup appears to have noticed the regularity before Zipf.[1]



Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

The same relationship occurs in many other rankings, unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, etc. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.[2] Empirically a data set can be tested to see if Zipf's law applies by running the regression log R = a - b log n where R is the rank of the datum, n is its value and a and b are constants. For Zipf's law to apply the constant b = 1. When this regression is applied to cities a better fit has been found with b = 1.07.

Theoretical review

Zipf's law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(frequency). For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). The data conform to Zipf's law to the extent that the plot is linear.

Full article ▸

related documents
Chi-square distribution
Geometric distribution
Inverse transform sampling
Ceva's theorem
Lorenz curve
Euler number
Sum rule in integration
Exponential time
Noetherian ring
Canonical LR parser
Minkowski's theorem
Complete graph
Most significant bit
Genus (mathematics)
Linear span
Dirichlet's theorem on arithmetic progressions
Infinite set
8.3 filename
Urysohn's lemma
Hilbert's third problem
Single precision
Matrix addition
Lagged Fibonacci generator