1001Ferramentas
๐Ÿ“Š Calculators

Zipf Law Word Frequency Calculator

Computes the relative estimated frequency of a word in a corpus by Zipfs law (1/rank) from the word rank in the chosen language.

โ€”

Zipf's law: frequency โˆ 1/rank โ€” top 100 words cover ~50% of any text

Zipf's law says a word's frequency is inversely proportional to its rank. The 2nd word shows up half as often as the 1st, the 10th roughly a tenth as often, and the pattern continues down the list. The formula is f(rank) โ‰ˆ C / rank, where C โ‰ˆ 0.1 ร— N for a corpus of N tokens. In practice the coverage thresholds are easy to remember: the top 100 lemmas cover ~50% of running text, 1,000 cover ~80%, and 5,000 cover ~95%. Individual corpora can pull the constant off, though. In Estonian, the single word "ja" ("and") accounts for around 5% of all tokens, well above what Zipf alone would predict.

Applications

People use this to decide which vocabulary to drill first in Anki and other SRS decks, to order wordlists for language-learning curricula, and to build comprehensible input materials. It also helps when you want to estimate how much of a text a learner can already read, or when tuning smoothing in n-gram or transformer language models.

FAQ

Does Zipf apply to all languages? For most languages the slope sits near -1. Agglutinative languages like Finnish and Turkish flatten the curve, since heavy inflection spreads each word across many distinct forms.

Lemmas or forms? The coverage numbers above count lemmas. If you count surface forms instead, the wordlist you need grows by roughly 2โ€“5ร—.

Why does the curve break at the tail? The long tail is mostly hapax legomena, words that appear only once. To fit that region better, Mandelbrot's modification adds a constant to the denominator.

How big a corpus do I need? Around 1M tokens is enough to stabilize the top 5,000 ranks. For academic-grade frequency lists, aim for 10M or more.

Related Tools