Beyond SSML: Controlling Whisper, Shout, and Cry via API
If you tell a standard TTS engine to say 'I am so angry right now', it will say it with the same flat calmness as 'The weather is nice today'. This dissonance breaks user trust. Humans communicate 40% of meaning through text and 60% through tone (prosody). (See also: [Medical Voice Banking](/blog/medical-voice-banking-als) for how we preserve this identity).
MorVoice allows you to control this prosody not just with broad categories, but with **Scalar Style Vectors**. You can mix emotions like paint.
The Style Vector API
Basic Emotion Tagging
// Simple request
{
"text": "Get out of my office!",
"emotion": "anger"
}This picks a default 'angry' style. But human emotion is nuanced. Maybe you want 'cold, quiet fury' rather than 'hot shouting'.
Advanced Scalar Mixing
We expose 6 core emotion dimensions: Happiness, Sadness, Anger, Fear, Disgust, and Surprise. You can value each from 0.0 to 1.0.
// Complex 'passive-aggressive' mix
{
"text": "Oh, sure, that's a great idea.",
"emotion": {
"anger": 0.3, // Underlying tension
"happiness": 0.6, // Fake politeness
"disgust": 0.2 // Subtle judgment
},
"voice_settings": {
"speed": 0.9, // Slightly slower for emphasis
"pitch": -1.0 // Lower tone
}
}The result is a chillingly sarcastic delivery that no standard model could produce.
Dynamic Contextual AI
The real power comes when you connect this to an LLM. Ask GPT-4 to output the JSON style vector alongside the text response.
# System Prompt for GPT-4
SYSTEM_PROMPT = """
You are a helpful assistant.
Analyze the sentiment of your reply and provide emotion scores (0.0-1.0).
Format: JSON
"""
# GPT-4 Output:
{
"message": "I'm so sorry to hear that your account was locked. That must be frustrating.",
"emotion": {
"sadness": 0.7, // Empathy
"anger": 0.1 // Mirroring user frustration
}
}Use Cases
1. Audiobooks
Characters can whisper during stealth scenes or scream during battles. The 'Projection' parameter controls the simulated distance from the microphone.
2. Therapy Bots
A bot dealing with sensitive topics needs to sound gentle and reassuring (High 'Warmth', Low 'Speed'), not upbeat and energetic.
Code Example: The 'Shout' Function
async function shout(text) {
return await morvoice.generate({
text: text,
style: {
projection: "shout", // Special mode for loud projection
anger: 0.8,
excitement: 0.5
},
// IMPORTANT: Turn on clipping protection for loud audio
post_processing: { normalize: true }
});
}Conclusion
Emotion is the API of humanity. By giving you fine-grained control over the emotional vector of the voice, MorVoice turns a 'text reader' into a 'digital actor'. Start experimenting with the Style Lab in your dashboard today.