1001Ferramentas
🗣️ Generators

SSML Builder (Speech)

Build SSML (Speech Synthesis Markup) documents compatible with Alexa, Google and Polly with break, prosody, emphasis, phoneme and voice tags.


  

SSML — the markup language that makes synthetic voices sound human

SSML (Speech Synthesis Markup Language) is a W3C standard — version 1.1 was published in 2010 — for telling a text-to-speech engine how to read a string. Plain text gives the engine a single signal (the words); SSML adds pauses, emphasis, pitch, rate, phonemes, character spelling, dates, currencies, and substitutions. It is the difference between a contact-centre bot that reads "Dr. R. Silva" as "doctor R Silva" and one that reads it as "doctor Reginaldo Silva, account number A-one-two-three".

Anatomy of an SSML document

Every document is wrapped in a <speak> root element. Inside, you mix plain text with tags that modify the surrounding speech:

<speak>
  Welcome to <emphasis level="strong">Amazon</emphasis>.
  Please wait <break time="500ms"/> while I connect you.
  <prosody rate="slow" pitch="+2st">Slowly and high.</prosody>
  Your account number is <say-as interpret-as="characters">A123</say-as>.
</speak>

The tags you will actually use

  • <break time="500ms"/> — insert a pause; accepts ms, s, or strength weak|medium|strong|x-strong.
  • <emphasis level="strong"> — stress a word. Most engines support strong, moderate, reduced.
  • <prosody rate="slow" pitch="+2st" volume="loud"> — fine-grained speed, pitch (semitones or %), volume control.
  • <say-as interpret-as="characters|digits|date|time|currency|telephone"> — force interpretation. "A123" reads as "A-one-two-three" with characters; "2025-12-31" reads as a date with date.
  • <phoneme alphabet="ipa" ph="ˈnaɪki">Nike</phoneme> — custom pronunciation in IPA or X-SAMPA.
  • <sub alias="Doctor">Dr.</sub> — substitute spoken form. Universal across providers.

Engines and providers

SSML support is not uniform. Amazon Polly implements the most complete subset plus extensions (<amazon:effect>, newscaster style, breathing sounds). Google Cloud Text-to-Speech (Wavenet, Neural2, Studio voices) is stricter and rejects invalid markup outright. Microsoft Azure Speech uses a slightly different namespace and adds <mstts:express-as style="cheerful"> for emotional styles. IBM Watson and Amazon Connect (the IVR product) both consume SSML. For Brazilian Portuguese, the best neural voices are Polly's Camila and Vitória, Google's pt-BR-Wavenet-C, and Azure's Francisca. Apple VoiceOver ships Felipe and Luciana on macOS/iOS.

Where SSML is worth the effort

  • Contact-centre IVR — account numbers, currency amounts, and dates need say-as tags to avoid embarrassing reads.
  • Audiobooks and podcasts — long-form synthesis with deliberate pacing and emphasis.
  • Voice assistants — Alexa Skills SDK requires SSML for any non-trivial response; Google Actions accepts it.
  • Accessibility — screen readers honour some SSML hints embedded via ARIA or aria-label.

FAQ

Is SSML portable between providers? Partly. The W3C core (break, emphasis, prosody, say-as, sub, phoneme) works almost everywhere. Provider-specific extensions (amazon:effect, mstts:express-as) do not. Test in the target engine before you ship.

Are there neural voices for Brazilian Portuguese? Yes. Polly's Camila (neural) is the de-facto industry pick for natural-sounding BR-PT. Google has pt-BR-Neural2-A through pt-BR-Neural2-C. Azure offers Francisca and Antônio. All cost around US$ 16 per 1 M characters for neural, ~US$ 4 for standard.

How many tags should I use? Fewer than you think. Engines are good at default prosody; over-tagging produces robotic results. Use say-as where ambiguity exists, break for deliberate beats, and stop there. Resist the urge to micro-manage every word with prosody.

What is a PLS lexicon? Pronunciation Lexicon Specification — a separate XML file mapping written forms to phonetic spellings. Useful for brand names and jargon you reuse across many SSML documents (define "Nike" once, reference everywhere).

Related Tools