feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890
Open
Sakkana wants to merge 1 commit into
Open
feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890Sakkana wants to merge 1 commit into
Sakkana wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
support pretrained_to_huggingface functionality for CosyVoice3 RL trainging
Summary
Support pretrained torch model conversion to huggingface model for RL training.
1. Token Design
<|eos1|><|eos2|><|eos3|><|sos|><|task_id|>(+5)<|sos|><|eos|><|task_id|>(+203)CV3 folds control tokens into the speech token space and uses an alias map to redirect them. CV2 simply appends them after the vocab.
2. Special Token Vocabulary
CV3 introduces phoneme-level tokens absent in CV2:
[AA],[AE],[AH],[B],[CH]...[ā],[ǎo],[iāng],[uán]...<|endofsystem|>3. lm_head Construction
-infbias=False)slice(speech_start_idx, speech_end_idx)4. Input Embeddings
CV2 explicitly copies
llm_embeddingweights for<|sos|>and<|task_id|>into the input embedding table. CV3 dropsllm_embeddingentirely and handles everything through the alias mechanism.5. EOS Token Configuration
CV2 registers three separate EOS token IDs:
CV3 uses both alias and real IDs as a dual fallback:
Test
GPU: 8 x B200 + Triton reward server (SenseVoice) + Verl + GRPO adv_estimator
Note: