Built for AWS Polly TTS

What is AWS Polly?

AWS Polly is a "Text to Speech" web service. There are many out there - but this particular one integrates well with the technical needs of enabling Digital Humans to speak.

A didimo can be requested from our API with AWS Polly compatibility.

You send AWS Polly a web request with the text input via the user (for a chatbot for instance) and Polly will return an audio file of the speech with a text file which is used to animate the lips to the correct shapes for each viseme at the correct time.

You can read up on AWS Polly Developer Documentation here

Quick start

Didimo compatibility with AWS Polly

Didimo's facial blendshapes are built to be compatible with AWS Polly. You can request a didimo to be generated with these additional blendshapes

See AWS Polly Viseme Mapping Table below

How to I get visemes for TTS generated in the didimo Package?

When generating a didimo, turn the flag aws_polly=true

Head over to the API reference section under Generate a New didimo to see a curl example of this.

How to integrate AWS Polly?

More info can be found at:

Orientation

What is a Viseme?

A viseme represents the position of the face and mouth when saying a word. It is the visual equivalent of a phoneme, which is the basic acoustic unit from which a word is formed. Visemes are the basic visual building blocks of speech.

Each language has a set of visemes that correspond to their specific phonemes. In a language, each phoneme has a corresponding viseme that represents the shape that the mouth makes when forming the sound. However, not all visemes can be mapped to a particular phoneme because numerous phonemes appear the same when spoken, even though they sound different. For example, in English, the words "pet" and "bet" are acoustically different. However, when observed visually (without sound), they look exactly the same.

The above is an extract from the AWS Polly documentation - which you can read more here

What is a Phoneme?

A phoneme is the smallest unit of sound in speech. When we teach reading we teach children which letters represent those sounds. For example – the word ‘hat’ has 3 phonemes – ‘h’ ‘a’ and ‘t’.

What's the difference between a phoneme and viseme?

Phonemes are specific to sounds, visemes are specific to the shapes made by the mouth

Type

Example

Phoneme

The word ‘hat’ has 3 phonemes – ‘p’ ‘e’ and ‘t’

Viseme

The words "pet" and "bet" are acoustically different as such have different phonemes. However, when observed visually (without sound), they look exactly the same, which is the viseme.

AWS Polly viseme mapping

When you generate a new didimo and set the aws_polly flag to TRUE, the package you will receive will contain the following 21 different phoneme blendshapes which follow the AWS Polly Specification

298

AWS Polly Phoneme

Didimo Pose

phoneme_aa

phoneme_aa

phoneme_ae_ax_ah

phoneme_ae_ax_ah

phoneme_ao

phoneme_ao

phoneme_aw

phoneme_aw

phoneme_ay

phoneme_ay

phoneme_d_t_n

phoneme_d_t_n

phoneme_er

phoneme_er

phoneme_ey_eh_uh

phoneme_ey_eh_uh

phoneme_f_v

phoneme_f_v

phoneme_h

phoneme_h

phoneme_k_g_ng

phoneme_k_g_ng

phoneme_l

phoneme_l

phoneme_ow

phoneme_ow

phoneme_oy

phoneme_oy

phoneme_p_b_m

phoneme_p_b_m

phoneme_r

phoneme_r

phoneme_s_z

phoneme_s_z

phoneme_sh_ch_jh_zh

phoneme_sh_ch_jh_zh

phoneme_th_dh

phoneme_th_dh

phoneme_w_uw

phoneme_w_uw

phoneme_y_iy_ih_ix

phoneme_y_iy_ih_ix