1001Ferramentas
๐Ÿ”ก Calculators

PT Bigram Frequency

Shows typical bigram frequency for most common Brazilian Portuguese pairs typical corpus.

โ€”

Bigram Frequency in Brazilian Portuguese

A bigram is just a pair of letters sitting next to each other in a text. To get how often each pair shows up, you divide its count by the total, which is what f(xy) = count(xy) / total_bigrams says. The denominator counts every overlapping letter pair in the corpus. A string of length n always gives you n - 1 bigrams.

Because Brazilian Portuguese leans on open syllables and endings full of vowels, the most frequent bigrams tend to mirror that. Measured against the Câmara Cascudo Brazilian Corpus and a few other reference collections, the top ten usually come out as “ar”, “es”, “de”, “os”, “do”, “ra”, “te”, “to”, “na” and “as”.

Applications

Classical n-gram language models lean heavily on bigram statistics, and you find them all over the place: NLP pipelines, the autocomplete and predictive text running on your phone, OCR cleanup, and statistical cryptanalysis. They also show up as features when a system tries to guess which language a text is in, or who wrote it.

FAQ

Why are bigrams more useful than single letters? A single letter only tells you how often that letter appears on its own. A bigram carries some local context and a hint of syllable structure, which makes a big difference for tasks like guessing the language or predicting the next character.

Do bigrams cross word boundaries? That comes down to your tokenizer. If it pays attention to whitespace, it keeps bigrams inside a single word. A character-level extractor will happily include space-letter pairs, and those can actually help with detecting prefixes.

How are accented letters handled? Modern Brazilian Portuguese corpora treat accented characters (á, é, ç) as their own thing. Strip the accents and you end up lumping together bigrams that sound different, so they stay separate.

Related Tools