04. Voice Pipeline (TTS/STT/LipSync) / 음성 파이프라인 (TTS/STT/LipSync)

Last verified: 2026년 2월 / February 2026

한국어

개요

음성 파이프라인은 사용자의 음성을 인식(STT)하고, AI NPC의 답변을 자연스러운 목소리로 합성(TTS)하며, 이를 캐릭터의 입 모양과 동기화(LipSync)하는 핵심 기술 요소입니다. 혜경궁 홍씨(Lady Hyegyong) AI NPC 프로젝트에서는 조선 시대 왕실의 격조 있는 말투와 감정을 생생하게 전달하기 위해 고품질의 한국어 음성 합성과 실시간 반응성이 필수적입니다.

본 문서는 다양한 TTS, STT, LipSync 솔루션을 비교 분석하고, Unity MR 환경(Quest 3)에서 최적의 사용자 경험을 제공하기 위한 기술적 전략을 제시합니다. 특히 한국어와 영어 이중언어 지원 및 스트리밍 기반의 레이턴시(Latency) 최적화 방안을 중점적으로 다룹니다.

핵심 발견

TTS 성능: ElevenLabs Turbo v2.5는 약 75ms~250ms의 매우 낮은 지연 시간(TTFB)을 기록하며, Typecast의 ssfm-v30 모델은 7가지 감정 표현과 한국어 격식체 구현에 강점을 보입니다.
STT 온디바이스화: Whisper Sentis를 활용하면 Quest 3 기기 내에서 네트워크 없이 실시간 음성 인식이 가능하여 개인정보 보호와 안정성을 동시에 확보할 수 있습니다.
립싱크 최적화: Quest 3 환경에서는 Meta의 OVRLipSync가 가장 낮은 레이턴시와 높은 최적화 수준을 보이며, SALSA LipSync v2는 오디오 파형 분석을 통해 범용적인 Unity 연동을 지원합니다.
스트리밍 패턴: 문장 단위 버퍼링(Sentence-level buffering)과 WebSocket 연결을 통해 전체 체감 지연 시간(TTFA)을 1초 미만으로 단축할 수 있습니다.
알려진 버그: 오픈소스 엔진인 Coqui는 한국어 특정 발음에서 부자연스러운 기계음이나 발음 오류가 발생하는 갭이 확인되었습니다.

비교 분석

1. TTS (Text-to-Speech) 비교표

서비스	한국어 품질	감정 표현	격식체	스트리밍 지연	Unity 연동	커스텀 음성	비용 (2026.02)	오프라인
Typecast	최상 (Native)	7종 (ssfm-v30)	우수	~300ms	SDK 지원	가능	$0.05/자	불가
ElevenLabs	우수 (Multilingual)	우수	보통	75ms (Turbo)	SDK/API	가능	$0.30/1k자	불가
Azure Speech	우수	멀티 스타일	우수	~200ms	SDK 지원	가능	$16/1M자	일부 가능
NAVER CLOVA	최상 (Native)	4단계 (vyuna)	우수	~250ms	API 지원	가능	₩0.1/자	불가
Google Cloud	보통	제한적	보통	~300ms	API 지원	불가	$16/1M자	불가
Coqui	보통 (버그 존재)	가능	보통	가변적	플러그인	가능	무료 (오픈소스)	가능

2. STT (Speech-to-Text) 비교표

서비스	유형	한국어 정확도	지연 시간	Unity 연동	비용 (2026.02)	오프라인
Whisper Sentis	온디바이스	우수	낮음 (로컬)	Sentis SDK	무료	가능
Azure Speech	클라우드	최상	보통	SDK 지원	$1.00/시간	불가
Google Speech	클라우드	우수	보통	API 지원	$0.024/분	불가

3. LipSync (립싱크) 비교표

솔루션	접근 방식	품질	Unity 연동	비용 (2026.02)	지연 시간
OVRLipSync	Viseme 기반	우수	Meta XR SDK	무료	최저 (~100ms)
SALSA v2	파형 분석	보통	에셋 스토어	$45 (1회)	보통 (~150ms)
Audio2Face	AI 기반	최상	Omniverse	무료 (NVIDIA)	높음 (~300ms)

TTS 평가 모순 해결 (TTS Evaluation Contradiction Resolution)

본 프로젝트의 초기 리서치(Research 2)와 심층 리서치(Research 7) 결과 사이에서 TTS 서비스 순위에 대한 모순이 발견되었습니다.

Research 2 (초기 평가): 범용적인 한국어 자연스러움과 접근성을 기준으로 NAVER CLOVA와 Google Cloud TTS를 높게 평가했습니다.
Research 7 (심층 평가): 실시간 대화형 NPC 구현을 위한 ‘스트리밍 레이턴시’와 ‘감정 제어의 세밀함’을 기준으로 ElevenLabs와 Typecast를 최상위로 재평가했습니다.
해결 방안: 본 프로젝트는 실시간 상호작용이 핵심이므로, Research 7의 결과를 우선순위로 채택합니다. 다만, 안정적인 한국어 격식체 표현이 필요한 특정 시나리오(예: 긴 나레이션)에서는 Research 2에서 높게 평가된 NAVER CLOVA를 폴백(Fallback) 옵션으로 고려합니다.

알려진 갭 및 향후 과제

Coqui 한국어 발음: 오픈소스 Coqui 엔진 사용 시 한국어 특정 음절에서 금속성 기계음이 섞이는 버그가 확인되었습니다. 오프라인 환경 구축 시 이 문제의 해결이 선행되어야 합니다.
Samsung Galaxy XR: Android XR 기반의 OpenXR 호환성은 이론적으로 가능하나, 전용 음성 SDK의 최적화 수준은 아직 검증되지 않은 알려진 갭(Known Gap)입니다.
이중언어 전환: 대화 도중 한국어와 영어를 혼용할 때 발생하는 음색 변화를 최소화하기 위한 단일 보이스 클로닝 기술 적용이 필요합니다.

출처 및 참고문헌

ElevenLabs API Documentation (2026)
Typecast SSFM v3.0 Technical Whitepaper (2026)
Microsoft Azure AI Speech Service Docs (2026)
Unity Sentis & Whisper Integration Guide (2025)
Meta XR SDK: OVRLipSync Reference (2026)

English

Overview

The voice pipeline is a critical technical component that recognizes the user’s voice (STT), synthesizes the AI NPC’s response into a natural voice (TTS), and synchronizes it with the character’s mouth shapes (LipSync). In the Lady Hyegyong (혜경궁 홍씨) AI NPC project, high-quality Korean voice synthesis and real-time responsiveness are essential to vividly convey the elegant tone and emotions of the Joseon (조선) dynasty royalty.

This document provides a comparative analysis of various TTS, STT, and LipSync solutions and presents technical strategies to provide an optimal user experience in a Unity MR environment (Quest 3). It focuses particularly on Korean and English bilingual support and streaming-based latency optimization.

Key Findings

TTS Performance: ElevenLabs Turbo v2.5 records a very low latency (TTFB) of approximately 75ms to 250ms, while Typecast’s ssfm-v30 model excels in expressing 7 types of emotions and formal Korean speech.
On-device STT: Utilizing Whisper Sentis allows for real-time speech recognition within the Quest 3 device without a network connection, ensuring both privacy and stability.
LipSync Optimization: In the Quest 3 environment, Meta’s OVRLipSync shows the lowest latency and highest level of optimization, while SALSA LipSync v2 supports universal Unity integration through audio waveform analysis.
Streaming Patterns: Total perceived latency (TTFA) can be reduced to less than 1 second through sentence-level buffering and WebSocket connections.
Known Bug: The open-source engine Coqui has been identified with a gap where unnatural mechanical sounds or pronunciation errors occur in specific Korean pronunciations.

Comparative Analysis

1. TTS (Text-to-Speech) Comparison Table

Service	Korean Quality	Emotion	Formal Speech	Streaming Latency	Unity Integration	Custom Voice	Cost (As of 2026.02)	Offline
Typecast	Best (Native)	7 types (ssfm-v30)	Excellent	~300ms	SDK Support	Available	$0.05/char	No
ElevenLabs	Excellent (Multi)	Excellent	Average	75ms (Turbo)	SDK/API	Available	$0.30/1k chars	No
Azure Speech	Excellent	Multi-style	Excellent	~200ms	SDK Support	Available	$16/1M chars	Partial
NAVER CLOVA	Best (Native)	4 levels (vyuna)	Excellent	~250ms	API Support	Available	₩0.1/char	No
Google Cloud	Average	Limited	Average	~300ms	API Support	N/A	$16/1M chars	No
Coqui	Average (Bugs)	Available	Average	Variable	Plugin	Available	Free (OSS)	Yes

2. STT (Speech-to-Text) Comparison Table

Service	Type	Korean Accuracy	Latency	Unity Integration	Cost (As of 2026.02)	Offline
Whisper Sentis	On-device	Excellent	Low (Local)	Sentis SDK	Free	Yes
Azure Speech	Cloud	Best	Average	SDK Support	$1.00/hour	No
Google Speech	Cloud	Excellent	Average	API Support	$0.024/min	No

3. LipSync Comparison Table

Solution	Approach	Quality	Unity Integration	Cost (As of 2026.02)	Latency
OVRLipSync	Viseme-based	Excellent	Meta XR SDK	Free	Lowest (~100ms)
SALSA v2	Waveform Analysis	Average	Asset Store	$45 (One-time)	Average (~150ms)
Audio2Face	AI-based	Best	Omniverse	Free (NVIDIA)	High (~300ms)

TTS Evaluation Contradiction Resolution

A contradiction was found in the ranking of TTS services between the initial research (Research 2) and deep research (Research 7) results of this project.

Research 2 (Initial Evaluation): Ranked NAVER CLOVA and Google Cloud TTS highly based on general Korean naturalness and accessibility.
Research 7 (Deep Evaluation): Re-evaluated ElevenLabs and Typecast as top-tier based on ‘streaming latency’ and ‘granularity of emotion control’ for real-time conversational NPC implementation.
Resolution: Since real-time interaction is core to this project, the results of Research 7 are adopted as the priority. However, for specific scenarios requiring stable formal Korean expression (e.g., long narrations), NAVER CLOVA, which was highly rated in Research 2, is considered as a fallback option.

Recommendations & Trade-off Analysis

Option 1: ElevenLabs Turbo v2.5 (Responsiveness-focused)

Pros: Overwhelming speed at the 75ms level, perfect support for Korean/English bilingualism with a single model, includes natural breathing sounds.
Cons: Certain archaic Korean (고어) pronunciations may sound slightly awkward compared to Korean-only engines.
Recommended for: Scenarios where fast back-and-forth dialogue and global expansion are important.

Option 2: Typecast ssfm-v30 (Emotion & Cultural Context-focused)

Pros: As a Korean native engine, it excels in ‘formal speech’ and ‘emotion (sadness, firmness, etc.)’ expression. Optimal for expressing Lady Hyegyong’s complex feelings.
Cons: Slightly higher latency compared to ElevenLabs, and English support quality is not as overwhelming as Korean.
Recommended for: Scenarios where historical accuracy and emotional depth of the character are the top priority.

Known Gaps & Future Work

Coqui Korean Pronunciation: A bug has been confirmed where metallic mechanical sounds are mixed in specific Korean syllables when using the open-source Coqui engine. This issue must be resolved before building an offline environment.
Samsung Galaxy XR: While OpenXR compatibility based on Android XR is theoretically possible, the optimization level of the dedicated voice SDK is still an unverified known gap.
Bilingual Switching: Application of single voice cloning technology is needed to minimize tone changes that occur when mixing Korean and English during conversation.

Sources & References

ElevenLabs API Documentation (2026)
Typecast SSFM v3.0 Technical Whitepaper (2026)
Microsoft Azure AI Speech Service Docs (2026)
Unity Sentis & Whisper Integration Guide (2025)
Meta XR SDK: OVRLipSync Reference (2026)