시스템 아키텍처 전체도 / System Architecture Overview

Last verified: 2026년 2월 / February 2026

한국어

개요

본 문서는 혜경궁 홍씨(Lady Hyegyong) AI NPC 프로젝트를 위한 시스템 아키텍처의 전체적인 구조와 데이터 흐름을 정의합니다. 본 프로젝트는 Meta Quest 3를 주 타깃 디바이스로 하며, 혼합 현실(MR) 환경에서 사용자와 역사적 인물인 혜경궁 홍씨 간의 자연스러운 상호작용을 구현하는 것을 목표로 합니다.

시스템 아키텍처는 실시간 대화 생성, 고품질 음성 합성, 그리고 이에 동기화된 캐릭터 애니메이션을 통합하는 복합적인 구조를 가집니다. 특히 전시 환경의 특수성을 고려하여 네트워크 불안정성에 대비한 4단계 폴백(Fallback) 전략과 저지연(Low Latency) 성능 목표를 핵심 설계 원칙으로 삼고 있습니다.

사용자의 음성 입력부터 캐릭터의 반응 출력까지 이어지는 파이프라인은 클라우드 기반의 고성능 처리와 엣지 서버(Edge Server)를 통한 로컬 최적화, 그리고 온디바이스(On-device) 처리를 유기적으로 결합합니다. 이를 통해 사용자는 1.5초 이내의 체감 레이턴시로 혜경궁 홍씨와 대화하며 조선(Joseon) 시대의 문화를 체험할 수 있습니다.

핵심 발견

1. 전체 시스템 데이터 흐름

사용자의 상호작용은 다음과 같은 순차적 파이프라인을 통해 처리됩니다:

사용자 음성 입력 (User Voice): Quest 3 마이크를 통해 수집된 오디오 데이터.
STT (Speech-to-Text): 음성을 텍스트로 변환. Whisper Sentis를 통한 온디바이스 처리 또는 클라우드 API 사용.
LLM (Large Language Model): 텍스트를 분석하고 혜경궁 홍씨의 페르소나(Persona)에 맞는 답변 생성. GPT-4o 또는 Convai 엔진 활용.
TTS (Text-to-Speech): 생성된 텍스트를 음성으로 합성. ElevenLabs 또는 Typecast 활용.
립싱크 (LipSync): 음성 데이터의 파형을 분석하여 캐릭터의 입 모양 동기화. SALSA LipSync v2 활용.
애니메이션 (Animation): 답변의 감정과 맥락에 맞는 전신 애니메이션 및 표정 제어. Unity Mecanim 시스템 활용.

2. 컴포넌트 간 통신 패턴

Unity ↔ Cloud: 고성능 LLM(GPT-4o) 및 고품질 TTS(ElevenLabs) 접근을 위한 REST API 또는 WebSocket 통신.
Unity ↔ Edge Server: 전시장 내 로컬 네트워크(VLAN)를 통한 데이터 처리. 로컬 LLM(Llama 3.2) 및 로컬 TTS 엔진 구동으로 클라우드 장애 시 대응.
Unity ↔ On-device: Quest 3 내부에서 Whisper Sentis를 이용한 STT 처리 및 경량 LLM 구동.

3. 4단계 폴백(Fallback) 아키텍처

전시 운영의 안정성을 보장하기 위해 다음과 같은 계층적 대응 체계를 구축합니다:

Cloud (Tier 1): GPT-4o + ElevenLabs. 최고의 품질과 자연스러움을 제공하는 기본 모드.
Edge (Tier 2): Jetson AGX Orin 기반 로컬 LLM + 로컬 TTS. 인터넷 연결 불안정 시 전시장 내 서버에서 처리.
On-device (Tier 3): Whisper Sentis + Quest 3 내장 경량 LLM. 엣지 서버 통신 불가 시 기기 자체에서 최소한의 대화 유지.
Pre-scripted (Tier 4): 하드코딩된 시나리오 기반 답변. 모든 AI 시스템 작동 불능 시에도 기본적인 안내 및 활동 수행 가능.

4. 성능 목표 및 제약 사항 (2026년 2월 기준)

첫 토큰 생성 시간 (TTFT): <200ms
문장 단위 TTS 생성 시간: <500ms
전체 체감 레이턴시 (Perceived Latency): <1.5s
렌더링 성능: 90fps 유지
Quest 3 제약: IL2CPP 빌드 방식, .NET Standard 2.1 사용, 드로콜(Draw Call) 300 이하 유지.
캐릭터 모델: 20,000 ~ 50,000 정점(Vertex) 예산.

5. MR→VR 전환 시퀀스

사용자가 현실 공간(MR)에서 가상 공간(VR)으로 몰입하는 과정은 다음과 같은 기술적 단계를 거칩니다:

Passthrough Opacity Fade: 현실 배경의 투명도를 점진적으로 낮춤.
Async Scene Loading: 가상 환경(한중록 배경 등)을 백그라운드에서 비동기 로딩하여 프레임 드랍 방지.
Spatial Anchor Alignment: 현실의 특정 위치에 고정된 가상 오브젝트를 VR 환경으로 자연스럽게 전이.

비교 분석

아키텍처 경로 비교 (Path A vs Path B)

비교 항목	Path A: Convai 기반 통합 스택	Path B: Custom Stack (GPT-4o + ElevenLabs)
통합 난이도	낮음 (SDK 하나로 대화, 음성, 립싱크 통합)	높음 (각 컴포넌트별 개별 API 연동 및 동기화 필요)
커스터마이징	제한적 (플랫폼 제공 기능 내에서 조정)	매우 높음 (프롬프트, 메모리 구조, 파이프라인 최적화 자유)
한국어 품질	우수 (60개 이상 언어 지원, 지속적 개선)	최고 수준 (GPT-4o 및 ElevenLabs의 강력한 한국어 성능)
레이턴시	최적화됨 (통합 파이프라인으로 지연 최소화)	가변적 (각 서비스별 지연 시간 합산, 추가 최적화 필요)
비용 (2026년 2월 기준)	월 구독료 기반 (Scale 플랜 기준 ~$2,994/mo)	사용량 기반 API 비용 (토큰 및 음성 길이당 과금)
오프라인 지원	제한적 (클라우드 의존도 높음)	유연함 (로컬 모델로의 전환 설계가 용이함)

알려진 갭 및 향후 과제

1. Samsung Galaxy XR 호환성 (알려진 갭 / Known Gap)

현황: 본 아키텍처는 Meta XR SDK를 기반으로 설계되었습니다. Samsung Galaxy XR은 Android XR 기반이며 OpenXR 표준을 따르지만, 전용 SDK 연동 및 성능 테스트가 아직 수행되지 않았습니다.
과제: 향후 AR Foundation 또는 OpenXR 추상화 레이어를 강화하여 기기 간 호환성을 확보해야 합니다.

2. 엣지 서버(Edge Server) 최적화

현황: Jetson AGX Orin을 통한 로컬 LLM 구동은 가능하나, 한국어 특화 모델의 추론 속도와 메모리 점유율 최적화가 추가로 필요합니다.
과제: vLLM 등 최적화 라이브러리를 적용하여 로컬 환경에서도 1초 이내의 TTFT를 달성해야 합니다.

출처 및 참고문헌

Meta Quest 3 Developer Documentation (2026).
Convai SDK for Unity Documentation (2026).
OpenAI API Reference - GPT-4o (2026).
ElevenLabs API Documentation (2026).
Unity 6 Manual - XR Architecture (2026).

English

Overview

This document defines the overall system architecture and data flow for the Lady Hyegyong (혜경궁 홍씨) AI NPC project. The project primarily targets the Meta Quest 3 device, aiming to implement natural interactions between users and the historical figure Lady Hyegyong within a Mixed Reality (MR) environment.

The system architecture features a complex structure that integrates real-time dialogue generation, high-quality speech synthesis, and synchronized character animation. Considering the specific requirements of exhibition environments, the core design principles include a 4-tier fallback strategy to prepare for network instability and low-latency performance targets.

The pipeline, extending from user voice input to character response output, organically combines cloud-based high-performance processing, local optimization via edge servers, and on-device processing. This enables users to experience Joseon (조선) dynasty culture by conversing with Lady Hyegyong with a perceived latency of less than 1.5 seconds.

Key Findings

1. Overall System Data Flow

User interactions are processed through the following sequential pipeline:

User Voice Input: Audio data collected through the Quest 3 microphone.
STT (Speech-to-Text): Converts speech to text. On-device processing via Whisper Sentis or cloud API usage.
LLM (Large Language Model): Analyzes text and generates responses consistent with Lady Hyegyong’s persona. Utilizes GPT-4o or the Convai engine.
TTS (Text-to-Speech): Synthesizes the generated text into artificial speech. Utilizes ElevenLabs or Typecast.
LipSync: Synchronizes the character’s mouth shapes by analyzing audio waveforms. Utilizes SALSA LipSync v2.
Animation: Controls full-body animation and facial expressions matching the emotion and context of the response. Utilizes the Unity Mecanim system.

2. Component Communication Patterns

Unity ↔ Cloud: REST API or WebSocket communication for accessing high-performance LLMs (GPT-4o) and high-quality TTS (ElevenLabs).
Unity ↔ Edge Server: Data processing via a local network (VLAN) within the exhibition site. Runs local LLMs (Llama 3.2) and local TTS engines to respond to cloud failures.
Unity ↔ On-device: STT processing using Whisper Sentis and running lightweight LLMs within the Quest 3.

3. 4-Tier Fallback Architecture

To ensure the stability of exhibition operations, a hierarchical response system is established:

Cloud (Tier 1): GPT-4o + ElevenLabs. The default mode providing the highest quality and naturalness.
Edge (Tier 2): Local LLM + local TTS based on Jetson AGX Orin. Processes data on the exhibition site server during internet instability.
On-device (Tier 3): Whisper Sentis + lightweight LLM built into Quest 3. Maintains minimal conversation on the device itself when edge server communication is unavailable.
Pre-scripted (Tier 4): Responses based on hard-coded scenarios. Basic guidance and activities can be performed even when all AI systems are inoperable.

4. Performance Targets and Constraints (As of February 2026)

Time to First Token (TTFT): <200ms
Sentence-level TTS Generation Time: <500ms
Overall Perceived Latency: <1.5s
Rendering Performance: Maintain 90fps
Quest 3 Constraints: IL2CPP build method, .NET Standard 2.1 usage, maintain draw calls below 300.
Character Model: Budget of 20,000 to 50,000 vertices.

5. MR→VR Transition Sequence

The process of immersing the user from physical space (MR) to virtual space (VR) involves the following technical steps:

Passthrough Opacity Fade: Gradually reduces the transparency of the real-world background.
Async Scene Loading: Loads the virtual environment (e.g., Hanjungnok background) asynchronously in the background to prevent frame drops.
Spatial Anchor Alignment: Naturally transitions virtual objects fixed to specific real-world locations into the VR environment.

Comparative Analysis

Architecture Path Comparison (Path A vs Path B)

Comparison Item	Path A: Convai-based Integrated Stack	Path B: Custom Stack (GPT-4o + ElevenLabs)
Integration Difficulty	Low (Integrated dialogue, voice, and lipsync with one SDK)	High (Requires individual API integration and synchronization for each component)
Customization	Limited (Adjustments within platform-provided features)	Very High (Freedom in prompt, memory structure, and pipeline optimization)
Korean Quality	Excellent (Supports 60+ languages, continuous improvement)	Top-tier (Strong Korean performance of GPT-4o and ElevenLabs)
Latency	Optimized (Minimized delay through integrated pipeline)	Variable (Sum of delays for each service, requires additional optimization)
Cost (As of Feb 2026)	Monthly subscription-based (~$2,994/mo for Scale plan)	Usage-based API costs (Billing per token and voice length)
Offline Support	Limited (High cloud dependency)	Flexible (Easy to design transition to local models)

Recommendations & Trade-off Analysis

Recommendation 1: Path A (Convai) for Initial Prototype and Rapid Deployment

Condition: When the development period is tight or there is a lack of personnel for building AI pipelines.
Pros: Development speed is very fast as STT, LLM, TTS, and LipSync can be resolved at once through the Convai SDK. It features excellent integration with Unity and built-in synchronization with character animations.
Cons (Trade-off): Strong platform dependency, and there may be limitations in precisely controlling Lady Hyegyong’s complex historical knowledge (e.g., Hanjungnok) using RAG (Retrieval-Augmented Generation). Additionally, subscription costs may accumulate over the long term.

Recommendation 2: Path B (Custom Stack) for Long-term Production and High-quality Experience

Condition: When the highest Korean dialogue quality and precise control for historical verification are required.
Pros: Provides the best immersion by combining the powerful reasoning capabilities of GPT-4o with the natural Korean voice of ElevenLabs. Enables deep interactions, such as remembering past conversations with the user, by building an independent memory system (Semantic/Episodic Memory).
Cons (Trade-off): High development effort as each component must be directly integrated and synchronized. In particular, the technical difficulty of matching the timing between voice data streaming and lip-sync animation is high.

Known Gaps & Future Work

1. Samsung Galaxy XR Compatibility (Known Gap)

Status: This architecture is designed based on the Meta XR SDK. While the Samsung Galaxy XR is based on Android XR and follows OpenXR standards, dedicated SDK integration and performance testing have not yet been performed.
Task: Future work should involve strengthening AR Foundation or OpenXR abstraction layers to ensure inter-device compatibility.

2. Edge Server Optimization

Status: While running local LLMs via Jetson AGX Orin is possible, additional optimization of inference speed and memory occupancy for Korean-specialized models is required.
Task: Achieve a TTFT of less than 1 second even in local environments by applying optimization libraries such as vLLM.

Sources & References

Meta Quest 3 Developer Documentation (2026).
Convai SDK for Unity Documentation (2026).
OpenAI API Reference - GPT-4o (2026).
ElevenLabs API Documentation (2026).
Unity 6 Manual - XR Architecture (2026).