Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
734 changes: 734 additions & 0 deletions explore.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
schema: spec-driven
created: 2026-05-13
created: 2026-05-20
68 changes: 68 additions & 0 deletions openspec/changes/ai-live2d-runtime-hooks/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
## Context

plugin-live2d 的前端已经通过 `untitled-pixi-live2d-engine` 实现了 PixiJS v8 驱动的 Live2D 渲染,支持 Cubism 2~5 全版本模型。AI 聊天功能也已接入:前端 `ChatApi` 通过 SSE 接收后端流式响应,后端 `AiChatEndpoint` 基于 OpenAI API 实现对话。但当前 AI 输出仅作为纯文本展示,Live2D 模型完全无法感知 AI 的情感状态、说话节奏或内容特征。

本设计建立在以下前置基础设施之上(由独立的底层变更提供):
- **Semantic Parameter Layer**:统一参数语义 API,使 AI 无需关心底层参数名
- **Motion Layer System**:分层动画系统,支持 AI 触发的 gesture/override 层
- **Runtime Filter Pipeline**:PixiJS v8 滤镜管线,支持情绪驱动的渲染效果

## Goals / Non-Goals

**Goals:**
- 建立从 AI 流式输出到 Live2D 模型实时反应的完整数据管道
- 实现情感标记提取 → 参数过渡 → 唇同步 → 动作触发的端到端链路
- 保持与现有 AI 聊天功能的完全兼容(可开关)
- 延迟可控:情感反应 < 100ms,唇同步与文本展示同步

**Non-Goals:**
- 不实现基于真实音频分析的唇同步(如 VAD / TTS 音频流分析)——使用文本节奏推断作为轻量级替代
- 不修改 LLM 模型本身(仅通过 prompt engineering 影响输出格式)
- 不实现复杂的面部捕捉或摄像头输入
- 不替代现有的 motion 文件播放系统,仅作为运行时增强层

## Decisions

### 使用 prompt 内嵌标记协议而非独立情感分析请求

在 system prompt 中要求 LLM 输出情感标记(如 `[happy]`、`[shy]`),前端 StreamParser 实时提取。这比独立情感分析请求延迟更低(零额外网络请求),且完全受用户控制(system prompt 可自由调整)。

**替代方案**:前端接收完整回复后调用独立情感分类 API。延迟增加 100-300ms,但标记更精确。**选择 prompt 方案**是因为实时性对 Live2D 体验至关重要,且用户已确认可完全控制后端 prompt。

### 文本节奏唇同步而非音频分析唇同步

基于文本的字符流和标点推断口型节奏(短字符 = 开嘴,标点 = 闭嘴),而非等待 TTS 音频输出后做 FFT 分析。这避免了引入 TTS 依赖和音频处理复杂度。

**替代方案**:接入 Web Speech API 或 WASM 语音合成获取真实 phoneme 时序。**选择文本方案**是因为当前后端仅返回文本流,没有音频通道。文本方案可立即工作,未来可无缝升级到音频方案(相同 LipSyncFrame 接口)。

### 情感状态机采用离散情感 + 连续强度,而非多情感混合

每个时刻只有一个主导情感(如 `happy`),配合 0~1 的强度值。过渡到新情感时做参数插值。这比多情感向量混合更简单且效果足够好。

**替代方案**:Plutchik 情感轮的多维向量混合。**选择离散方案**是因为 Live2D 模型通常只有一组预定义的表情参数,多维混合需要更复杂的参数映射,收益有限。

### AiCommandBus 采用发布-订阅模式而非直接耦合

AI 解析器发布命令到总线,Semantic Layer / Motion Layer / Filter Pipeline 各自订阅感兴趣的命令类型。这避免了 AI 层与各运行时系统的循环依赖。

## Risks / Trade-offs

- **[Risk]** LLM 输出情感标记的位置可能与说话内容不同步(标记在句首但情感应覆盖整句)。→ **Mitigation**:标记生效时间从其出现位置开始,持续到下一个标记或句子结束。提供 `duration` 覆盖机制。
- **[Risk]** 低端设备上多层动画 + 滤镜可能影响帧率。→ **Mitigation**:所有 AI 联动效果提供 `quality` 配置(低/中/高),低端设备可降级到仅参数变化无滤镜。
- **[Risk]** 非标准 Live2D 模型缺少常见参数(如 `PARAM_ANGLE_X`),Semantic Layer 检测不到时表情不生效。→ **Mitigation**:Semantic Layer 的 `detectFromModel` 在初始化时报告 missing 参数,AI 层根据可用能力动态调整(Capability Detection 联动)。
- **[Risk]** 过于频繁的情感切换导致模型抖动。→ **Mitigation**:Emotion Timeline 的过渡时长最短限制为 300ms,同一句内的标记批量合并。

## Migration Plan

1. 后端更新 system prompt 模板(增加情感标记输出规范)
2. 前端新增 `runtime/ai/` 模块(独立目录,不影响现有代码)
3. `ChatApi` 增加可选的 `aiHooks` 回调配置(默认关闭,向后兼容)
4. 在 `Live2dCanvas` 中初始化 AI 联动层(仅在 `config.aiHooksEnabled` 为 true 时)
5. 通过 Halo 插件设置面板暴露 AI 联动开关和高级配置

Rollback:关闭 `aiHooksEnabled` 配置即可回退到纯文本聊天模式。

## Open Questions

- 情感标记集合的完整列表是否需要用户可配置?(建议先内置 6~8 种基础情感,后续再扩展)
- 是否需要支持每句话结束后的「返回 idle」行为?(建议默认启用,可配置 idle 延迟时长)
32 changes: 32 additions & 0 deletions openspec/changes/ai-live2d-runtime-hooks/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Why

plugin-live2d 已具备 AI 流式聊天功能(SSE 后端 + 前端消息展示),但 Live2D 模型与 AI 输出之间是完全割裂的——模型不会根据 AI 的情感、说话节奏、内容产生任何反应。这使得「AI 看板娘」的体验停留在「文本聊天 + 静态立绘」的层面,无法发挥 Live2D 的动态表现力。建立 AI Runtime Hooks 层,让 LLM 输出直接驱动模型的表情、动作、渲染效果,是实现真正「AI 伴侣级」体验的关键一步。

## What Changes

- 在后端 AI Chat system prompt 中注入情感标记协议(`[happy]`, `[shy]`, `[surprised]` 等)
- 前端流式响应解析器:从 SSE chunk 中提取情感标记,转化为结构化 `AiCommand`
- AI Command Bus:统一分发情感、唇同步、动作、滤镜四类运行时指令
- 情感时间线系统(Emotion Timeline):支持表情参数的平滑过渡而非瞬间切换
- 轻量级唇同步(Text-based Lip Sync):基于文本节奏和标点推断口型变化
- 与 Motion Layer System 的集成点:AI 触发的高优先级动作层(gesture/override)
- 与 Runtime Filter Pipeline 的集成点:AI 情绪驱动渲染效果(色温、光晕)

## Capabilities

### New Capabilities
- `ai-stream-parser`: 前端 SSE 流式文本解析,提取情感标记和纯文本内容
- `ai-command-bus`: AI 指令总线,统一接收和分发运行时指令到各子系统
- `emotion-timeline`: 情感状态机与参数插值过渡系统
- `text-lip-sync`: 基于文本节奏的轻量级唇同步生成器

### Modified Capabilities
- `ai-chat`: 后端 ChatCompletion 接口的 system prompt 需增加情感标记输出规范

## Impact

- **后端**: `AiChatEndpoint.java` 的 system prompt 模板需要更新
- **前端 runtime**: 新增 `packages/live2d/src/runtime/ai/` 目录及模块
- **前端 chat**: `ChatApi` 的流式处理逻辑需接入 `AiStreamParser`
- **依赖**: 需要 Semantic Parameter Layer 和 Motion Layer System 作为前置基础设施(这两个系统属于独立的底层变更)
- **配置**: 新增 AI 联动相关的运行时配置项(情感标记开关、过渡时长等)
24 changes: 24 additions & 0 deletions openspec/changes/ai-live2d-runtime-hooks/specs/ai-chat/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## MODIFIED Requirements

### Requirement: Backend system prompt includes emotion markers
The backend `AiChatEndpoint` SHALL include emotion marker guidance in the system prompt sent to the LLM when AI hooks are enabled.

#### Scenario: Emotion marker output
- **WHEN** a chat request is processed with AI hooks enabled
- **THEN** the system prompt instructs the LLM to embed emotion markers like `[happy]`, `[shy]`, `[surprised]` at appropriate positions in the response text

#### Scenario: Backward compatibility when disabled
- **WHEN** AI hooks are disabled
- **THEN** the system prompt does not include emotion marker instructions
- **AND** the LLM responds with plain text only

## ADDED Requirements

### Requirement: Frontend configuration for AI hooks
The public runtime config SHALL include an `aiHooksEnabled` boolean field defaulting to `false`.

#### Scenario: Config-driven enablement
- **WHEN** the config contains `aiHooksEnabled: true`
- **THEN** the frontend initializes the AI Runtime Hooks layer
- **WHEN** the config contains `aiHooksEnabled: false`
- **THEN** the AI Runtime Hooks layer is not initialized and chat behaves as before
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## ADDED Requirements

### Requirement: Publish-subscribe command dispatch
The `AiCommandBus` SHALL support publishing commands and subscribing handlers by command type. Multiple subscribers SHALL receive each matching command.

#### Scenario: Subscribe and receive emotion command
- **WHEN** a handler subscribes to command type `"emotion"`
- **AND** a command `{ type: "emotion", emotion: "happy", intensity: 0.8 }` is published
- **THEN** the handler receives the command

#### Scenario: Multiple subscribers for same type
- **WHEN** two handlers subscribe to command type `"lipSync"`
- **AND** a lip sync command is published
- **THEN** both handlers receive the command

### Requirement: Support command types
The command bus SHALL support at minimum these command types: `emotion`, `lipSync`, `motion`, `filter`.

#### Scenario: Motion command dispatch
- **WHEN** a command `{ type: "motion", layer: "gesture", motion: "nod" }` is published
- **THEN** the Motion Layer subscriber receives and processes it

### Requirement: Commands carry timing metadata
Every command SHALL carry an `estimatedTime` field representing the estimated display time offset from the start of the AI response.

#### Scenario: Lip sync timing
- **WHEN** a lip sync command with `estimatedTime: 1200` is published
- **THEN** the Lip Sync subscriber schedules the mouth shape change at 1200ms from response start

### Requirement: Async-safe publishing
The command bus SHALL handle commands published during an active transition without dropping or corrupting state.

#### Scenario: Rapid emotion changes
- **WHEN** three emotion commands are published within 50ms
- **THEN** all three are queued and processed in order by the Emotion Timeline
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## ADDED Requirements

### Requirement: Extract emotion markers from SSE chunks
The parser SHALL scan incoming SSE text chunks for embedded emotion markers in the format `[emotion-name]` where `emotion-name` is a registered emotional state key.

#### Scenario: Single emotion marker in chunk
- **WHEN** an SSE chunk contains text `你好呀![happy] 今天真开心~`
- **THEN** the parser emits `{ text: "你好呀! 今天真开心~", commands: [{ type: "emotion", emotion: "happy", position: 4 }] }`

#### Scenario: Multiple emotion markers in chunk
- **WHEN** an SSE chunk contains `[happy]欢迎![shy]人家有点紧张……`
- **THEN** the parser emits two emotion commands with their respective positions and the stripped text `欢迎!人家有点紧张……`

### Requirement: Support configurable emotion vocabulary
The parser SHALL accept a configurable set of valid emotion marker names at initialization time. Unrecognized markers SHALL be ignored and left in the text.

#### Scenario: Unknown marker is preserved
- **WHEN** the configured vocabulary is `["happy", "shy", "sad"]` and a chunk contains `[unknown]hello`
- **THEN** the parser emits `{ text: "[unknown]hello", commands: [] }`

### Requirement: Estimate timestamp for each command
The parser SHALL assign an estimated display timestamp to each extracted command based on character position within the cumulative text stream.

#### Scenario: Timestamp estimation
- **WHEN** 10 characters have already been processed and a new chunk of `hello[happy]world` arrives
- **THEN** the `happy` command receives a timestamp estimate of `5` (the position of the marker in the chunk)

## MODIFIED Requirements

### Requirement: Chat stream processing supports AI hooks
The `ChatApi.sendMessage` method SHALL optionally invoke an `AiStreamParser` to process each SSE chunk before displaying text, when AI hooks are enabled in configuration.

#### Scenario: AI hooks enabled
- **WHEN** `ChatApi` is configured with `aiHooksEnabled: true`
- **AND** an SSE chunk arrives containing emotion markers
- **THEN** the parsed text (with markers removed) is displayed to the user
- **AND** extracted commands are dispatched to the `AiCommandBus`

#### Scenario: AI hooks disabled (backward compatibility)
- **WHEN** `ChatApi` is configured with `aiHooksEnabled: false` (default)
- **AND** an SSE chunk arrives
- **THEN** the raw text is displayed without parsing
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## ADDED Requirements

### Requirement: Discrete emotional states with intensity
The system SHALL maintain a single active emotional state consisting of a named emotion and an intensity value in the range [0, 1].

#### Scenario: Set emotion state
- **WHEN** the emotion timeline receives `{ emotion: "happy", intensity: 0.8 }`
- **THEN** the active state becomes `happy` at intensity `0.8`

### Requirement: Smooth parameter interpolation between emotions
The system SHALL interpolate semantic parameter values between the outgoing and incoming emotional states over a configurable duration.

#### Scenario: Transition from neutral to happy
- **WHEN** the current state is `neutral` and a transition to `happy` with duration `800ms` is requested
- **THEN** over the next 800ms, the model's expression parameters smoothly blend from neutral values to happy values

### Requirement: Minimum transition duration
The system SHALL enforce a minimum transition duration of 300ms to prevent visual jitter from rapid emotion changes.

#### Scenario: Rapid successive emotions
- **WHEN** an emotion change is requested while another transition is in progress
- **AND** the remaining time of the current transition is less than 300ms
- **THEN** the system extends or completes the current transition before starting the new one

### Requirement: Emotion parameter mapping registry
The system SHALL use a configurable mapping from emotion names to sets of semantic parameter values.

#### Scenario: Happy emotion parameters
- **WHEN** the "happy" emotion is mapped to `{ mouthOpen: 0.3, eyeSmile: 0.7, cheek: 0.4 }`
- **AND** the emotion timeline transitions to "happy"
- **THEN** those semantic parameters are driven to the mapped values via the Semantic Parameter Layer

### Requirement: Auto-return to idle
The system SHALL automatically transition back to a configurable default emotion (typically "neutral") after a configurable idle timeout following the last emotion command.

#### Scenario: Return to idle after timeout
- **WHEN** the idle timeout is configured to 2000ms
- **AND** no new emotion command arrives within 2000ms after the last one
- **THEN** the system transitions back to the default emotion
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## ADDED Requirements

### Requirement: Generate lip sync frames from text stream
The system SHALL generate a sequence of lip sync frames from incoming text chunks, where each frame specifies a mouth shape value and duration.

#### Scenario: Simple text lip sync
- **WHEN** the text chunk `"hello"` arrives
- **THEN** the system generates frames that open the mouth during characters and briefly close at the end

### Requirement: Map text patterns to mouth shapes
The system SHALL use a simple heuristic mapping: alphabetic characters → open mouth, whitespace/punctuation → closed or partially open mouth.

#### Scenario: Punctuation causes mouth close
- **WHEN** the text chunk `"Hi! How are you?"` arrives
- **THEN** the mouth closes at `!`, opens at `H`, closes at ` `, opens at `a`, etc.

### Requirement: Synchronize with text display timing
The generated lip sync frames SHALL align with the estimated text display timing so the model's mouth movements appear synchronized with the text streaming onto screen.

#### Scenario: Timing alignment
- **WHEN** text chunks arrive with delays of 100-200ms between them
- **THEN** the lip sync frames are paced to match those delays

### Requirement: Graceful fallback when mouth parameter unavailable
If the current model does not expose a mouth-related semantic parameter, the lip sync system SHALL silently disable itself without errors.

#### Scenario: Missing mouth parameter
- **WHEN** `SemanticLayer.hasSemantic("mouthOpen")` returns `false`
- **THEN** lip sync frames are generated but not applied to the model
50 changes: 50 additions & 0 deletions openspec/changes/ai-live2d-runtime-hooks/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## 1. Backend Prompt Update

- [ ] 1.1 Update `AiChatEndpoint.java` system prompt template to include emotion marker instructions when AI hooks are enabled
- [ ] 1.2 Add `aiHooksEnabled` flag to backend AI chat configuration model
- [ ] 1.3 Ensure backward compatibility: prompt without markers when hooks disabled

## 2. Frontend AI Stream Parser

- [ ] 2.1 Create `packages/live2d/src/runtime/ai/AiStreamParser.ts` with emotion marker extraction
- [ ] 2.2 Implement configurable emotion vocabulary at parser initialization
- [ ] 2.3 Implement estimated timestamp calculation for extracted commands
- [ ] 2.4 Add unit tests for marker extraction edge cases (multiple markers, unknown markers, nested brackets)

## 3. AI Command Bus

- [ ] 3.1 Create `packages/live2d/src/runtime/ai/AiCommandBus.ts` with pub-sub interface
- [ ] 3.2 Define TypeScript types for all command types: `emotion`, `lipSync`, `motion`, `filter`
- [ ] 3.3 Implement command queuing for rapid successive publishes
- [ ] 3.4 Wire Command Bus into `ChatApi.handleStreamResponse` (behind `aiHooksEnabled` flag)

## 4. Emotion Timeline System

- [ ] 4.1 Create `packages/live2d/src/runtime/ai/EmotionTimeline.ts`
- [ ] 4.2 Implement discrete emotional state with intensity [0, 1]
- [ ] 4.3 Implement smooth parameter interpolation with configurable duration (min 300ms)
- [ ] 4.4 Create emotion-to-parameter mapping registry (happy, shy, sad, surprised, angry, thinking, neutral)
- [ ] 4.5 Implement auto-return to idle after configurable timeout
- [ ] 4.6 Integrate with Semantic Parameter Layer (via command bus subscription)

## 5. Text Lip Sync Generator

- [ ] 5.1 Create `packages/live2d/src/runtime/ai/TextLipSync.ts`
- [ ] 5.2 Implement text-to-mouth-shape heuristic mapping
- [ ] 5.3 Implement frame timing synchronized with SSE chunk arrival timing
- [ ] 5.4 Add graceful fallback when `mouthOpen` semantic parameter is unavailable
- [ ] 5.5 Integrate with Command Bus (publish `lipSync` commands)

## 6. Integration & Configuration

- [ ] 6.1 Add `aiHooksEnabled` and related fields to `Live2dConfig` interface
- [ ] 6.2 Initialize AI runtime layer in `Live2dCanvas` when config enables it
- [ ] 6.3 Ensure ChatApi backward compatibility (hooks disabled by default)
- [ ] 6.4 Add Halo plugin settings UI fields for AI hooks toggle and emotion timeout

## 7. Testing & Validation

- [ ] 7.1 End-to-end test: AI response with `[happy]` marker triggers model smile
- [ ] 7.2 End-to-end test: Rapid emotion changes do not cause visual jitter
- [ ] 7.3 Test backward compatibility: disabled hooks mode works identically to before
- [ ] 7.4 Performance test: AI hooks do not drop frame rate below 30fps on mid-tier devices
Loading