1001Ferramentas
💬 Calculators

Portuguese Common Word Frequency Calculator

Shows estimated relative frequency of very common Portuguese words from corpus lists and indicates where it falls in the top thousand.

Word frequency in Portuguese: Zipf's law and coverage curves

Run the numbers on Portuguese text and the top 100 words cover ~50% of any running corpus. The top 1,000 push that to ~80%, and the top 5,000 reach ~95%. Almost all of those high-frequency tokens are function words: de, a, o, que, e, do, da, em, um, para. Because the distribution follows Zipf, a learner pulls huge comprehension out of a tiny starting vocabulary. The catch is the long tail, that last 5% of word types, which is where most of the topic-specific meaning actually lives. The reference here is the Corpus do Português (Mark Davies, BYU), about 1 billion words spanning the Brazilian and European variants. Lemmatised frequency lists fold fui, vou and vão under ir, while unlemmatised lists count each form on its own.

Applications

It shows up in curriculum design for Portuguese as a foreign language, in Anki and SRS decks ordered by frequency rank, and in NLP preprocessing, where stopword lists feed tokenisation and TF-IDF. Educational publishers use it for readability scoring, and it also serves as a prior for OCR post-correction.

FAQ

Brazilian or European Portuguese? The top 1,000 lists overlap by about 95%. Where they part ways is vocabulary like ônibus/autocarro or celular/telemóvel. The Corpus do Português keeps the two variants separate.

Why study by frequency instead of theme? Going by frequency squeezes the most text coverage out of each word you memorise. You hit 80% comprehension after roughly 1,000 words, whereas themed lists can take weeks to deliver the same payoff.

Are stopwords useless? In NLP search they mostly add noise, so yes. For comprehension and grammar it's the opposite. Prepositions and articles are what hold a sentence's structure together.

How big a vocabulary for fluency? An educated native speaker actively uses something like 15,000–20,000 lemmas, with passive recognition climbing past 40,000. C1 learners usually get by on 8,000–10,000.

Related Tools