Zipf Law Word Frequency Calculator
Computes the relative estimated frequency of a word in a corpus by Zipfs law (1/rank) from the word rank in the chosen language.
โ
Zipf's law: frequency โ 1/rank โ top 100 words cover ~50% of any text
Zipf's law says a word's frequency is inversely proportional to its rank. The 2nd word shows up half as often as the 1st, the 10th roughly a tenth as often, and the pattern continues down the list. The formula is f(rank) โ C / rank, where C โ 0.1 ร N for a corpus of N tokens. In practice the coverage thresholds are easy to remember: the top 100 lemmas cover ~50% of running text, 1,000 cover ~80%, and 5,000 cover ~95%. Individual corpora can pull the constant off, though. In Estonian, the single word "ja" ("and") accounts for around 5% of all tokens, well above what Zipf alone would predict.
Applications
People use this to decide which vocabulary to drill first in Anki and other SRS decks, to order wordlists for language-learning curricula, and to build comprehensible input materials. It also helps when you want to estimate how much of a text a learner can already read, or when tuning smoothing in n-gram or transformer language models.
FAQ
Does Zipf apply to all languages? For most languages the slope sits near -1. Agglutinative languages like Finnish and Turkish flatten the curve, since heavy inflection spreads each word across many distinct forms.
Lemmas or forms? The coverage numbers above count lemmas. If you count surface forms instead, the wordlist you need grows by roughly 2โ5ร.
Why does the curve break at the tail? The long tail is mostly hapax legomena, words that appear only once. To fit that region better, Mandelbrot's modification adds a constant to the denominator.
How big a corpus do I need? Around 1M tokens is enough to stabilize the top 5,000 ranks. For academic-grade frequency lists, aim for 10M or more.
Related Tools
Rent Adjustment Calculator
Compute annual rent adjustment by IGP-M or IPCA accumulated in the last 12 months (manually configurable).
Pregnancy Calculator
Compute estimated due date (EDD), gestational age and trimester from the last menstrual period (LMP).
Fertile Period Calculator
Compute fertile window and ovulation day from the first day of the last cycle and the average cycle length.