
related topics 
{math, number, function} 
{rate, high, increase} 
{math, energy, light} 
{language, word, form} 
{law, state, case} 
{style, bgcolor, rowspan} 

Zipf's law, an empirical law formulated using mathematical statistics, refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. The law is named after the linguist George Kingsley Zipf (pronounced /ˈzɪf/) who first proposed it (Zipf 1935, 1949), though J.B. Estoup appears to have noticed the regularity before Zipf.^{[1]}
Contents
Motivation
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the secondplace word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.
The same relationship occurs in many other rankings, unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, etc. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.^{[2]} Empirically a data set can be tested to see if Zipf's law applies by running the regression log R = a  b log n where R is the rank of the datum, n is its value and a and b are constants. For Zipf's law to apply the constant b = 1. When this regression is applied to cities a better fit has been found with b = 1.07.
Theoretical review
Zipf's law is most easily observed by plotting the data on a loglog graph, with the axes being log(rank order) and log(frequency). For example, the word "the" (as described above) would appear at x = log(1), y = log(69971). The data conform to Zipf's law to the extent that the plot is linear.
Full article ▸


related documents 
Chisquare distribution 
Skewness 
Geometric distribution 
Inverse transform sampling 
Ceva's theorem 
Lorenz curve 
Unification 
Hyperplane 
Euler number 
Sum rule in integration 
Exponential time 
Noetherian ring 
Canonical LR parser 
Minkowski's theorem 
Complete graph 
Most significant bit 
Genus (mathematics) 
ROT13 
Linear span 
EXPTIME 
Dirichlet's theorem on arithmetic progressions 
NPequivalent 
Infinite set 
8.3 filename 
Urysohn's lemma 
Hilbert's third problem 
AIML 
Single precision 
Matrix addition 
Lagged Fibonacci generator 
