Tutorial: Building Conversational NPCs in Unity 6 with MorVoice SDK (Zero-Latency Setup)

Unity Integration Team

1/20/2026

The holy grail of modern gaming is the 'Smart NPC'—a character you can talk to that replies intelligently. While LLMs (like GPT-4) solved the brain part, the voice part has remained a bottleneck. Traditional TTS is too slow (latency) and too robotic (immersion breaking).

This tutorial shows you how to implement **MorVoice Streaming SDK** in Unity 6. We will achieve a voice-response latency of under 200ms (see our [Latency Benchmark](/blog/websocket-vs-http-tts-latency-benchmark-2026)), making the conversation feel instantaneous.

Prerequisites

- Unity 2022.3 LTS or higher (Unity 6 recommended)
- MorVoice SDK (Install via Package Manager: https://npm.morvoice.com)
- An API Key from dashboard.morvoice.com
- A basic NPC GameObject with an AudioSource component

Architecture: The Streaming Pipeline

Do NOT save audio to disk. File I/O adds 50-100ms of lag. We will stream raw PCM data directly from the WebSocket memory buffer into the AudioSource's clip buffer.

Step 1: The NPC Voice Controller

Create a new script `NPCVoiceController.cs` and attach it to your character.

using UnityEngine;
using MorVoice.SDK;
using System.Collections;

public class NPCVoiceController : MonoBehaviour
{
    [SerializeField] private string voiceId = "orc_warrior_v2";
    private MorVoiceClient _client;
    private AudioSource _audioSource;

    void Start()
    {
        _client = new MorVoiceClient(ApiKey.LoadFromEnv());
        _audioSource = GetComponent<AudioSource>();
    }

    public async void Speak(string text)
    {
        // 1. Start the stream. This returns immediately (active connection)
        var stream = await _client.StreamSpeechAsync(text, voiceId);

        // 2. Prepare a streaming AudioClip (Unity 2022+ feature)
        var clip = AudioClip.Create("VoiceStream", 44100 * 60, 1, 44100, true, 
            (float[] data) => stream.ReadBuffer(data));
            
        _audioSource.clip = clip;
        _audioSource.Play();
    }
}

Step 2: Lipsync Integration

Audio isn't enough; the mouth must move. MorVoice sends 'viseme' events (mouth shapes) alongside the audio chunks via the WebSocket. This is much faster than analyzing the audio on the client side.

// Inside Speak() method, subscribe to viseme events
stream.OnViseme += (visemeCode, duration) => {
    // Map MorVoice viseme codes to your character's BlendShapes
    // Example: Code 4 = 'Ah' sound -> Set BlendShape 'MouthOpen' to 100
    float intensity = 100f;
    SkinnedMeshRenderer.SetBlendShapeWeight(visemeCode, intensity);
    
    // Auto-close mouth after duration
    StartCoroutine(ResetMouth(visemeCode, duration));
};

Optimization Tips

1. Pre-warming Connection

Establish the WebSocket connection when the player enters the room, not when they start talking. This saves the initial SSL handshake time (approx 100ms).

2. Caching Common Phrases

For standard replies like 'Hello', 'What do you want?', or 'Goodbye', generate them once and cache them locally. Only use Streaming TTS for dynamic LLM responses.

Common Pitfalls

❌ NEVER call .ToArray() on the stream. That waits for the full audio to download.
✅ ALWAYS use the streaming callback or buffer reader.

❌ WARNING: Don't use standard HTTP requests. They block the main thread in WebGL builds.
✅ Use the async/await pattern shown above.

Conclusion

With this setup, your NPCs can interrupt players, react to game events in real-time, and whisper or shout dynamically. The MorVoice SDK handles the heavy lifting of buffering and decoding, letting you focus on the gameplay logic.

Download the complete Unity Project example from our GitHub repository.