Animation System / 애니메이션 시스템

Last verified: 2026년 2월 / February 2026

한국어

개요

본 문서는 혜경궁 홍씨 AI NPC 프로젝트의 핵심 구성 요소인 애니메이션 시스템에 대한 리서치 결과를 정리합니다. 가상 현실(VR) 및 혼합 현실(MR) 환경에서 AI NPC가 사용자에게 실감 나는 몰입감을 제공하기 위해서는 단순한 동작 재생을 넘어, 대화의 맥락과 감정, 그리고 시스템의 기술적 제약(레이턴시)을 고려한 정교한 애니메이션 설계가 필수적입니다.

애니메이션 시스템은 크게 네 가지 목표를 가집니다. 첫째, 사용자의 질문에 대한 AI의 사고 시간(Thinking Time)을 자연스럽게 마스킹하여 대화의 흐름이 끊기지 않도록 하는 것입니다. 둘째, 다과(Dagwa), 서예(Seoye), 예절(Yejeol) 등 조선 시대의 복잡한 전통 활동을 단계별로 정확하게 재현하는 것입니다. 셋째, 사용자의 위치를 추적하고 상호작용하는 물체를 정확히 파악하는 IK(Inverse Kinematics) 시스템을 구축하는 것입니다. 마지막으로, Meta Quest 3의 하드웨어 성능 한계 내에서 최적의 시각적 품질을 유지하는 것입니다.

본 리서치는 Unity의 Mecanim 시스템을 기반으로 한 4레이어 구조를 제안하며, 대화와 애니메이션을 동기화하기 위한 마크업 시스템 및 절차적 애니메이션(Procedural Animation) 기법을 포함합니다. 이를 통해 ‘불쾌한 골짜기(Uncanny Valley)’ 현상을 최소화하고, 혜경궁 홍씨라는 역사적 인물의 품격과 감정을 효과적으로 전달하고자 합니다.

핵심 발견

4레이어 Mecanim 구조: 효율적인 애니메이션 관리와 레이어별 블렌딩을 위해 다음과 같은 4단계 레이어 구조를 채택합니다.
1. Base Body Layer: 기본적인 이동(Locomotion), 대기(Idle), 앉아 있는 자세(Sitting Poses) 등 전신 동작을 담당합니다.
2. Upper Body Layer: 대화 중의 제스처, 활동별 팔 움직임 등 상체 위주의 동작을 담당하며, Base Layer와 블렌딩됩니다.
3. Face Layer: 표정 변화, 감정 표현(Emotion Blending)을 담당하며, 립싱크 시스템과 연동됩니다.
4. Eyes Layer: 시선 추적(Gaze Tracking), 눈 깜빡임(Blink), 미세한 눈동자 움직임을 독립적으로 제어합니다.
레이턴시 마스킹(Latency Masking) 설계: AI의 답변 생성 지연 시간에 따른 사용자 경험 저하를 방지하기 위해 심리학적 근거에 기반한 마스킹 전략을 수립합니다.
- 자연스러운 일시 정지 (<500ms): 일반적인 대화 흐름으로 간주되어 별도의 마스킹이 필요하지 않습니다.
- 불편함 유발 (800-1200ms): 사용자가 시스템의 지연을 인지하기 시작하는 단계로, 가벼운 고개 끄덕임이나 시선 맞춤이 필요합니다.
- 흐름 단절 (>1200ms): 반드시 ‘생각 중(Thinking)’ 애니메이션으로 마스킹해야 합니다.
- 생각 중 애니메이션: 최소 2초에서 최대 5-6초 동안 지속되며, 고개 기울이기, 사려 깊은 시선 처리, 미세한 호흡 변화, 손 제스처 등을 포함합니다.
활동별 애니메이션 시퀀스:
- 다과 상차림 (Tea Ceremony): 5단계 시퀀스 (쟁반 준비 → 차 따르기 → 찻잔 건네기 → 받기 → 목례)
- 서예 (Calligraphy): 2가지 모드 (시연 모드: NPC가 직접 쓰는 모습, 안내 모드: 사용자의 쓰기를 지도하는 동작)
- 예절/절 (Etiquette/Bow): 5단계 시퀀스 (서기 → 절 준비 → 절하기 → 유지 → 일어나기)
- 편지 쓰기 (Letter Writing): 글쓰기 제스처, 읽기 제스처, 봉인 제스처
IK(Inverse Kinematics) 시스템:
- Look-at IK: NPC의 시선이 사용자의 HMD 위치를 자연스럽게 추적하여 아이컨택을 유지합니다.
- Hand IK: 찻잔, 붓, 종이 등 상호작용 물체를 집거나 놓을 때 손의 위치를 정확하게 보정합니다.
대화-애니메이션 동기화: [ANIMATION:tag] 형태의 마크업 시스템을 사용하여 LLM 답변과 동작을 일치시킵니다. (예: [ANIMATION:bow] 어서 오세요.)
절차적 아이들(Procedural Idle): 호흡에 따른 흉부의 미세한 움직임, 무작위 타이밍의 눈 깜빡임, 미세한 무게 중심 이동을 통해 정지 상태에서도 생동감을 유지합니다.
Quest 3 성능 예산 (2026년 2월 기준):
- 최대 4개의 애니메이션 레이어 유지
- 캐릭터당 최대 70개의 본(Bone) 제한
- GPU 스키닝(GPU Skinning) 필수 활성화
- 드로콜(Draw Call): NPC 관련 드로콜은 전체 예산(300) 중 50 이하로 할당
- 폴리곤 수: NPC 모델당 20,000 ~ 50,000 정점(Vertex) 권장

비교 분석

애니메이션 제어 방식에 따른 Unity 내 주요 기술 비교입니다.

비교 항목	Mecanim (Animator)	Playables API	Timeline
주요 특징	상태 머신(State Machine) 기반 시각적 제어	그래프 기반의 저수준 스크립팅 제어	시퀀스 및 이벤트 기반 시각적 편집
장점	직관적인 상태 전이 관리, 레이어 블렌딩 용이	런타임 시 동적 애니메이션 생성 및 혼합 탁월	복잡한 시네마틱 및 활동 시퀀스 설계에 적합
단점	상태가 많아질 경우 그래프가 복잡해짐(Spaghetti)	학습 곡선이 높고 시각적 디버깅이 어려움	런타임 시 동적인 조건부 분기 처리가 제한적
적합한 용도	기본적인 이동, 아이들, 감정 표현 레이어	AI 응답에 따른 실시간 동작 합성 및 미세 조정	다과, 서예 등 정해진 단계가 있는 활동 시퀀스
성능(Quest 3)	보통 (레이어 수에 비례)	우수 (필요한 노드만 계산)	보통 (시퀀스 로드 시 오버헤드)

상세 기술 설계

1. 레이턴시 마스킹 애니메이션 (Latency Masking)

AI NPC와의 대화에서 발생하는 기술적 지연(STT -> LLM -> TTS)은 사용자 몰입감을 저해하는 가장 큰 요소입니다. 이를 해결하기 위해 ‘생각 중’ 상태를 시각적으로 표현하여 지연 시간을 인지적 대기 시간으로 전환합니다.

동작 설계:
- Thinking_Start: 사용자의 말이 끝난 직후(VAD 감지), NPC가 가볍게 고개를 끄덕이거나 “음…” 하는 표정을 지으며 시선을 살짝 위나 옆으로 옮깁니다.
- Thinking_Loop: 답변이 생성되는 동안 손을 턱에 대거나, 옷매무새를 만지는 등 정적인 대기 동작을 반복합니다. (최소 2초 유지)
- Thinking_End: 답변 출력이 시작되기 직전, 다시 사용자와 시선을 맞추며 밝은 표정으로 전환합니다.
타이밍 전략: 지연 시간이 1.2초를 초과할 경우, 시스템은 즉시 [ANIMATION:think] 태그를 트리거하여 마스킹 애니메이션을 실행해야 합니다.

2. 활동별 애니메이션 시퀀스 상세

다과 (Dagwa):
1. 준비: 쟁반을 사용자 앞으로 가져오는 동작.
2. 따르기: 찻주전자를 들어 찻잔에 차를 따르는 정교한 손 동작 (Hand IK 연동).
3. 권하기: 찻잔을 두 손으로 받쳐 사용자에게 내미는 동작.
4. 대기: 사용자가 차를 마시는 동안 정중하게 기다리는 자세.
5. 마무리: 빈 잔을 수거하고 가벼운 목례.
서예 (Seoye):
- 시연: 붓을 잡고 종이에 글씨를 쓰는 팔과 손목의 움직임. (Timeline 기반의 베이크된 애니메이션 사용)
- 지도: 사용자가 붓을 잡을 때 옆에서 자세를 교정해주거나 방향을 가리키는 제스처.

3. 절차적 아이들 및 IK 시스템

Procedural Idle: 단순 반복 애니메이션의 단조로움을 피하기 위해 Mathf.PerlinNoise를 활용하여 호흡의 깊이와 속도를 미세하게 변화시킵니다. 눈 깜빡임은 Random.Range를 사용하여 2~5초 사이의 불규칙한 간격으로 발생시킵니다.
Look-at IK: Meta XR SDK의 LookAt 컴포넌트를 활용하여 NPC의 머리와 눈이 사용자의 HMD를 추적하게 합니다. 이때 목의 회전 각도 제한(예: 좌우 60도, 상하 30도)을 설정하여 인체 공학적으로 자연스러운 움직임을 보장합니다.

알려진 갭 및 향후 과제

Samsung Galaxy XR 호환성: 본 리서치는 Meta Quest 3를 기준으로 작성되었습니다. Samsung Galaxy XR은 Android XR 기반으로 OpenXR을 지원하지만, 전용 SDK의 애니메이션 최적화 기능(예: 특정 셰이더나 스키닝 방식)에 대한 실기 검증이 아직 이루어지지 않았습니다.
한복 물리 시뮬레이션: 혜경궁 홍씨의 의상인 한복(치마, 소매)의 자연스러운 움직임을 위해 실시간 물리 시뮬레이션이 필요하지만, Quest 3의 연산 성능 내에서 안정적인 프레임(90fps)을 유지하며 구현하는 것은 기술적 도전 과제입니다. (현재는 본 애니메이션 기반의 베이크된 물리 사용 권장)
다국어 립싱크 정밀도: 한국어와 영어의 발음 구조 차이에 따른 립싱크 데이터의 이질감을 최소화하기 위한 추가 튜닝이 필요합니다.

출처 및 참고문헌

Unity Technologies, “Mecanim Animation System Documentation,” 2025.
Unity Technologies, “Playables API: Customizing your animation system,” 2024.
Meta, “Presence Platform: Interaction SDK - Hand IK & Gaze Tracking,” 2025.
NVIDIA, “Audio2Face: AI-powered 3D facial animation,” 2025.
“Research on Latency Perception in Social VR,” Journal of Virtual Reality, 2024.

English

Overview

This document summarizes the research findings for the animation system, a core component of the Lady Hyegyong AI NPC project. To provide a realistic sense of immersion for users in Virtual Reality (VR) and Mixed Reality (MR) environments, it is essential to design sophisticated animations that go beyond simple playback, considering the context of the conversation, emotions, and technical constraints (latency) of the system.

The animation system has four primary objectives. First, to naturally mask the AI’s thinking time in response to user questions, ensuring the flow of conversation is not interrupted. Second, to accurately reproduce complex traditional Joseon Dynasty activities such as the tea ceremony (Dagwa), calligraphy (Seoye), and etiquette (Yejeol) step-by-step. Third, to build an Inverse Kinematics (IK) system that tracks the user’s position and accurately identifies interacting objects. Finally, to maintain optimal visual quality within the hardware performance limits of Meta Quest 3.

This research proposes a 4-layer structure based on Unity’s Mecanim system and includes a markup system for synchronizing dialogue and animation, as well as procedural animation techniques. Through these methods, we aim to minimize the ‘Uncanny Valley’ effect and effectively convey the dignity and emotions of the historical figure, Lady Hyegyong.

Key Findings

4-Layer Mecanim Structure: Adopts a 4-stage layer structure for efficient animation management and per-layer blending:
1. Base Body Layer: Responsible for full-body movements such as basic locomotion, idle, and sitting poses.
2. Upper Body Layer: Handles upper-body-centric movements like gestures during conversation and activity-specific arm movements, blending with the Base Layer.
3. Face Layer: Manages facial expression changes and emotion blending, integrated with the lip-sync system.
4. Eyes Layer: Independently controls gaze tracking, blinking, and micro-eye movements.
Latency Masking Animation Design: Establishes masking strategies based on psychological evidence to prevent degradation of user experience due to AI response generation delays.
- Natural Pause (<500ms): Considered a normal conversational flow; no special masking required.
- Uncomfortable Pause (800-1200ms): The stage where users begin to perceive system delay; requires light nodding or eye contact.
- Breaking Pause (>1200ms): Must be masked with a ‘Thinking’ animation.
- Thinking Animation: Lasts between 2 and 6 seconds, including head tilts, thoughtful gaze, subtle breathing changes, and hand gestures.
Activity-Specific Animation Sequences:
- Tea Ceremony (Dagwa): 5-stage sequence (prepare tray → pour tea → offer cup → receive → bow).
- Calligraphy (Seoye): 2 modes (demonstration mode: NPC writing; guidance mode: guiding the user’s writing).
- Etiquette/Bow (Yejeol/Bow): 5-stage sequence (stand → bow preparation → bow → hold → rise).
- Letter Writing: Writing gestures, reading gestures, and sealing gestures.
IK (Inverse Kinematics) System:
- Look-at IK: NPC’s gaze naturally tracks the user’s HMD position to maintain eye contact.
- Hand IK: Accurately corrects hand positions when picking up or placing interaction objects like tea cups, brushes, and paper.
Dialogue-Animation Sync: Uses a [ANIMATION:tag] markup system to match LLM responses with specific actions (e.g., [ANIMATION:bow] Welcome.).
Procedural Idle: Maintains vitality even in a static state through subtle chest movements from breathing, randomized blinking, and micro-weight shifts.
Quest 3 Performance Budget (As of February 2026):
- Maintain a maximum of 4 animation layers.
- Limit to a maximum of 70 bones per character.
- Mandatory activation of GPU Skinning.
- Draw Calls: Allocate 50 or fewer for NPC-related draw calls out of the total budget (300).
- Vertex Count: Recommended 20,000 to 50,000 vertices per NPC model.

Comparative Analysis

A comparison of major animation control techniques within Unity.

Feature	Mecanim (Animator)	Playables API	Timeline
Key Characteristics	Visual control based on State Machines	Low-level scripting control based on graphs	Visual editing based on sequences and events
Pros	Intuitive state transition management, easy layer blending	Excellent for dynamic animation generation and mixing at runtime	Suitable for complex cinematic and activity sequences
Cons	Graphs can become complex (Spaghetti) as states increase	High learning curve, difficult visual debugging	Limited dynamic conditional branching at runtime
Best Use Case	Basic movement, idle, and emotion expression layers	Real-time motion synthesis and fine-tuning based on AI responses	Activity sequences with fixed steps like tea ceremonies
Performance (Quest 3)	Moderate (proportional to layer count)	Excellent (calculates only necessary nodes)	Moderate (overhead when loading sequences)

Recommendations & Trade-off Analysis

Option 1: Hybrid Control System (Mecanim + Playables API)

A method where basic body movements and emotional expressions are managed by Mecanim’s layer system, while real-time motion synthesis or fine IK adjustments based on AI dialogue use the Playables API.

Pros: Secures both the intuitiveness of Mecanim and the flexibility of the Playables API. It is the most powerful way to implement complex AI NPC reactions. Specifically, the Playables API allows for sophisticated control, such as dynamically mixing animation clips at runtime or adjusting animation speed to match dialogue length.
Cons: Has the highest implementation difficulty, and the weight adjustment and synchronization logic between the two systems can become complex. Additionally, visual debugging is difficult as the Playables Graph must be managed directly.
Recommendation: When high-quality interaction and natural motion synthesis are the top priorities.

Option 2: Mecanim-Centric System (Animator + Timeline)

A method where all animation states are managed in the Animator, and fixed activity sequences like tea ceremonies or calligraphy are called via Timeline.

Pros: Follows standard Unity workflows, making development fast and maintenance easy. Even non-experts (animators) can easily understand the logic through state machines. Timeline is very convenient for visually editing complex multi-stage activities and synchronizing them with events (sounds, particles, etc.).
Cons: Fine-grained control, such as dynamically extending or shortening animations based on AI response length or content, is difficult. ‘Spaghetti node’ issues may occur if the state machine becomes bloated.
Recommendation: When the development period is tight or the proportion of fixed activity sequences is high.

Detailed Technical Design

1. Latency Masking Animation

Technical delays (STT -> LLM -> TTS) in AI NPC conversations are a major factor hindering user immersion. To address this, we visually represent a ‘Thinking’ state to convert technical delay into cognitive waiting time.

Motion Design:
- Thinking_Start: Immediately after the user finishes speaking (VAD detection), the NPC lightly nods or makes an “Hmm…” expression, slightly shifting their gaze up or to the side.
- Thinking_Loop: While the response is being generated, the NPC repeats static waiting motions such as putting a hand to their chin or adjusting their clothes (maintained for at least 2 seconds).
- Thinking_End: Just before the response output begins, the NPC returns their gaze to the user and transitions to a bright expression.
Timing Strategy: If the delay exceeds 1.2 seconds, the system must immediately trigger the [ANIMATION:think] tag to execute the masking animation.

2. Activity-Specific Animation Sequence Details

Tea Ceremony (Dagwa):
1. Preparation: Bringing the tray in front of the user.
2. Pouring: Sophisticated hand movements lifting the teapot and pouring tea into the cup (integrated with Hand IK).
3. Offering: Holding the teacup with both hands and presenting it to the user.
4. Waiting: Waiting politely while the user drinks the tea.
5. Conclusion: Collecting the empty cup and giving a light bow.
Calligraphy (Seoye):
- Demonstration: Arm and wrist movements writing characters on paper with a brush (using Timeline-based baked animations).
- Guidance: Gestures correcting the user’s posture or pointing in a direction when the user holds the brush.

3. Procedural Idle and IK System

Procedural Idle: To avoid the monotony of simple repetitive animations, Mathf.PerlinNoise is utilized to subtly vary the depth and speed of breathing. Eye blinking is generated at irregular intervals between 2 and 5 seconds using Random.Range.
Look-at IK: Utilizes the LookAt component of the Meta XR SDK to have the NPC’s head and eyes track the user’s HMD. Rotation angle limits (e.g., 60 degrees left/right, 30 degrees up/down) are set to ensure ergonomically natural movement.

Known Gaps & Future Work

Samsung Galaxy XR Compatibility: This research was written based on Meta Quest 3. While Samsung Galaxy XR is based on Android XR and supports OpenXR, actual verification of animation optimization features (e.g., specific shaders or skinning methods) for its dedicated SDK has not yet been performed.
Hanbok Physics Simulation: Real-time physics simulation is needed for the natural movement of Lady Hyegyong’s traditional attire (skirt, sleeves), but implementing this while maintaining a stable frame rate (90fps) within the computational power of Quest 3 is a technical challenge. (Currently, the use of baked physics based on bone animation is recommended.)
Multilingual Lip-Sync Precision: Additional tuning is required to minimize the discrepancy in lip-sync data caused by the structural differences in pronunciation between Korean and English.

Sources & References

Unity Technologies, “Mecanim Animation System Documentation,” 2025.
Unity Technologies, “Playables API: Customizing your animation system,” 2024.
Meta, “Presence Platform: Interaction SDK - Hand IK & Gaze Tracking,” 2025.
NVIDIA, “Audio2Face: AI-powered 3D facial animation,” 2025.
“Research on Latency Perception in Social VR,” Journal of Virtual Reality, 2024.