AWS Polly is a "Text to Speech" web service. There are many out there - but this particular one integrates well with the technical needs of enabling Digital Humans to speak.
You send AWS Polly a web request with the text input via the user (for a chatbot for instance) and Polly will return an audio file of the speech with a text file which is used to animate the lips to the correct shapes for each viseme at the correct time.
You can read up on AWS Polly Developer Documentation here
- How to Generate a didimo with TTS capability
- Unity Example Integration of AWS Polly integration
- AWS Polly Developer Documentation
Didimo's facial blendshapes are built to be compatible with AWS Polly. You can request a didimo to be generated with these additional blendshapes
See AWS Polly Viseme Mapping Table below
When generating a didimo, turn the flag
Head over to the API reference section under Generate a New didimo to see a curl example of this.
More info can be found at:
- Integration: Text-To-Speech
- Facial Animations
- [AWS iOS Example ](https://docs.aws.amazon.com/polly/latest/dg/examples-ios.html
- AWS Android Example
A viseme represents the position of the face and mouth when saying a word. It is the visual equivalent of a phoneme, which is the basic acoustic unit from which a word is formed. Visemes are the basic visual building blocks of speech.
Each language has a set of visemes that correspond to their specific phonemes. In a language, each phoneme has a corresponding viseme that represents the shape that the mouth makes when forming the sound. However, not all visemes can be mapped to a particular phoneme because numerous phonemes appear the same when spoken, even though they sound different. For example, in English, the words "pet" and "bet" are acoustically different. However, when observed visually (without sound), they look exactly the same.
The above is an extract from the AWS Polly documentation - which you can read more here
A phoneme is the smallest unit of sound in speech. When we teach reading we teach children which letters represent those sounds. For example – the word ‘hat’ has 3 phonemes – ‘h’ ‘a’ and ‘t’.
Phonemes are specific to sounds, visemes are specific to the shapes made by the mouth
The word ‘hat’ has 3 phonemes – ‘p’ ‘e’ and ‘t’
The words "pet" and "bet" are acoustically different as such have different phonemes. However, when observed visually (without sound), they look exactly the same, which is the viseme.
When you generate a new didimo and set the aws_polly flag to
TRUE, the package you will receive will contain the following 21 different phoneme blendshapes which follow the AWS Polly Specification
AWS Polly Phoneme
Updated about 2 months ago