Commit 7ea23dd
vocab : add Carbon-3B (HybridDNATokenizer) support (#23410)
* vocab : add Carbon-3B (HybridDNATokenizer) support
Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the
HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}.
The base BPE is Qwen3-4B-Base's; what differs is that text inside
<dna>...</dna> regions is chunked into fixed 6-mers (right-padded
with 'A' on the trailing partial), and any base outside ACGT maps
to <oov>.
* src/llama-vocab.{h,cpp}: new pre-type, dispatched from
llm_tokenizer_bpe_session::tokenize.
* src/llama-vocab-carbon.h: pure helpers (tokenize_carbon,
emit_dna_kmers) factored out for unit testing — no llama_vocab
dependency, vocab access goes through a std::function.
* conversion/base.py: detect HybridDNATokenizer by class name in
get_vocab_base_pre (chktxt collides with Qwen3 base since it
has no <dna>), and pass trust_remote_code=True in get_vocab_base
so the custom tokenizer class can load.
* tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer,
multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions, vocab miss.
* vocab : align Carbon-3B changes with llama.cpp conventions
* Fold tokenize_carbon + emit_dna_kmers inline into
llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h),
matching how every other tokenizer keeps its helpers inside
llama-vocab.cpp.
* Replace the standalone unit test with the conventional
test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf
(vocab-only conversion) + .inp/.out fixtures covering single
6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions.
* Register "carbon" in convert_hf_to_gguf_update.py's model list
(pointing at HuggingFaceBio/Carbon-3B) and teach both
AutoTokenizer call sites in the updater to pass
trust_remote_code=True for it, matching how t5 is special-cased.
* vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch
Refactor the conversion-side changes to follow the per-tokenizer-family
convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm,
etc. instead of conditionalising the shared get_vocab_base /
get_vocab_base_pre paths.
* conversion/base.py: add _set_vocab_carbon — self-contained, loads
with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA
vocab is visible, writes tokenizer.ggml.pre = "carbon" directly.
* conversion/llama.py: branch in LlamaModel.set_vocab on
tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and
dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py
(tokenizer_class branch between BertTokenizer / RobertaTokenizer) and
conversion/phi.py.
* conversion/base.py: revert the conditional in get_vocab_base and the
class-name short-circuit in the auto-generated get_vocab_base_pre.
* tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples
Add 6 cases from the Carbon-3B model card on top of the existing edge
coverage: the unterminated basic-completion prompt, the closed 33-bp
example, the metadata-conditioned prompt (with <vertebrate_mammalian>
and <protein_coding_region> which BPE-decompose since they are not in
the vocab), the documented anti-pattern of raw DNA without <dna> tags,
and the two likelihood-scoring examples. Brings the suite to 19 cases.
* vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE
Refactor per upstream review:
> This should be its own tokenizer model, ie. carbonhybriddna instead
> of gpt2 and not carbon pre-tokenizer. That way you can keep the
> correct pre-tokenizer, in case that ever changes.
Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a
new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific
branch inside llm_tokenizer_bpe_session::tokenize (only existing
pre-types differ in regex, not dispatch logic), and (b) conflated
"hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer".
This change moves it to its own vocab type, peer to PLAMO2, with the
GGUF model name matching the HF tokenizer class (HybridDNATokenizer):
* include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7.
* src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that
owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and
routes raw text through a DNA-aware splitter; wired into
init_tokenizer, tokenize, type_name, byte_to_token, and the
BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov>
are pure ASCII, so byte-level BPE decoding handles them).
LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type
config block alongside SPM/WPM/UGM/RWKV, where pre_type is set
to QWEN2 and the matching add_space_prefix / escape_whitespaces /
clean_spaces flags are applied — mirroring qwen2's BPE path so
byte-level BPE merging stays bit-identical to the Python
reference for non-DNA text.
* src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON.
* conversion/base.py: _set_vocab_hybriddna writes
tokenizer.ggml.model = "hybriddna" (no separate pre).
* conversion/llama.py: dispatch on tokenizer_class ==
"HybridDNATokenizer" same as bert.py / phi.py do.
* models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture +
regenerated metadata.
* convert_hf_to_gguf_update.py: drop the stale chkhsh entry and
trust_remote_code special-case (no longer needed since dispatch
is now class-name driven, not chkhsh).
Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}:
tokenization is bit-identical to the Python HybridDNATokenizer for
all 19 test fixtures plus the model-card metadata-conditioned
prompt; greedy completion produces the same DNA continuation as
the Python reference; spec-dec with 500M as draft for 8B still
works.
* vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA
* vocab : drop llm_tokenizer_bpe vocab-type assert
* vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch
* vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe
* vocab : annotate #endif with PRETOKENIZERDEBUG
* vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)
* deduplicate
* simplify
* simplify
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>1 parent 2fc8d18 commit 7ea23dd
3 files changed
Lines changed: 152 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1610 | 1610 | | |
1611 | 1611 | | |
1612 | 1612 | | |
| 1613 | + | |
| 1614 | + | |
| 1615 | + | |
| 1616 | + | |
| 1617 | + | |
| 1618 | + | |
| 1619 | + | |
| 1620 | + | |
| 1621 | + | |
| 1622 | + | |
| 1623 | + | |
| 1624 | + | |
| 1625 | + | |
| 1626 | + | |
| 1627 | + | |
| 1628 | + | |
| 1629 | + | |
| 1630 | + | |
| 1631 | + | |
| 1632 | + | |
| 1633 | + | |
| 1634 | + | |
| 1635 | + | |
| 1636 | + | |
| 1637 | + | |
| 1638 | + | |
| 1639 | + | |
| 1640 | + | |
| 1641 | + | |
| 1642 | + | |
| 1643 | + | |
| 1644 | + | |
| 1645 | + | |
| 1646 | + | |
| 1647 | + | |
| 1648 | + | |
1613 | 1649 | | |
1614 | 1650 | | |
1615 | 1651 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
54 | 63 | | |
55 | 64 | | |
56 | 65 | | |
| |||
72 | 81 | | |
73 | 82 | | |
74 | 83 | | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
530 | 530 | | |
531 | 531 | | |
532 | 532 | | |
| 533 | + | |
| 534 | + | |
533 | 535 | | |
534 | 536 | | |
535 | 537 | | |
| |||
567 | 569 | | |
568 | 570 | | |
569 | 571 | | |
570 | | - | |
| 572 | + | |
571 | 573 | | |
572 | 574 | | |
573 | 575 | | |
| |||
1579 | 1581 | | |
1580 | 1582 | | |
1581 | 1583 | | |
| 1584 | + | |
| 1585 | + | |
| 1586 | + | |
| 1587 | + | |
| 1588 | + | |
| 1589 | + | |
| 1590 | + | |
| 1591 | + | |
| 1592 | + | |
| 1593 | + | |
| 1594 | + | |
| 1595 | + | |
| 1596 | + | |
| 1597 | + | |
| 1598 | + | |
| 1599 | + | |
| 1600 | + | |
| 1601 | + | |
| 1602 | + | |
| 1603 | + | |
| 1604 | + | |
| 1605 | + | |
| 1606 | + | |
| 1607 | + | |
| 1608 | + | |
| 1609 | + | |
| 1610 | + | |
| 1611 | + | |
| 1612 | + | |
| 1613 | + | |
| 1614 | + | |
| 1615 | + | |
| 1616 | + | |
| 1617 | + | |
| 1618 | + | |
| 1619 | + | |
| 1620 | + | |
| 1621 | + | |
| 1622 | + | |
| 1623 | + | |
| 1624 | + | |
| 1625 | + | |
| 1626 | + | |
| 1627 | + | |
| 1628 | + | |
| 1629 | + | |
| 1630 | + | |
| 1631 | + | |
| 1632 | + | |
| 1633 | + | |
| 1634 | + | |
| 1635 | + | |
| 1636 | + | |
| 1637 | + | |
| 1638 | + | |
| 1639 | + | |
| 1640 | + | |
| 1641 | + | |
| 1642 | + | |
| 1643 | + | |
| 1644 | + | |
| 1645 | + | |
| 1646 | + | |
| 1647 | + | |
| 1648 | + | |
| 1649 | + | |
| 1650 | + | |
| 1651 | + | |
| 1652 | + | |
| 1653 | + | |
| 1654 | + | |
| 1655 | + | |
| 1656 | + | |
| 1657 | + | |
| 1658 | + | |
| 1659 | + | |
| 1660 | + | |
| 1661 | + | |
| 1662 | + | |
| 1663 | + | |
| 1664 | + | |
| 1665 | + | |
| 1666 | + | |
| 1667 | + | |
| 1668 | + | |
| 1669 | + | |
| 1670 | + | |
| 1671 | + | |
| 1672 | + | |
1582 | 1673 | | |
1583 | 1674 | | |
1584 | 1675 | | |
| |||
1808 | 1899 | | |
1809 | 1900 | | |
1810 | 1901 | | |
1811 | | - | |
| 1902 | + | |
1812 | 1903 | | |
1813 | 1904 | | |
1814 | 1905 | | |
| |||
3144 | 3235 | | |
3145 | 3236 | | |
3146 | 3237 | | |
3147 | | - | |
3148 | 3238 | | |
3149 | 3239 | | |
| 3240 | + | |
| 3241 | + | |
| 3242 | + | |
| 3243 | + | |
| 3244 | + | |
| 3245 | + | |
| 3246 | + | |
| 3247 | + | |
| 3248 | + | |
3150 | 3249 | | |
3151 | | - | |
| 3250 | + | |
3152 | 3251 | | |
3153 | 3252 | | |
3154 | 3253 | | |
| |||
3161 | 3260 | | |
3162 | 3261 | | |
3163 | 3262 | | |
3164 | | - | |
| 3263 | + | |
3165 | 3264 | | |
3166 | | - | |
| 3265 | + | |
3167 | 3266 | | |
3168 | 3267 | | |
3169 | 3268 | | |
3170 | 3269 | | |
3171 | | - | |
3172 | | - | |
| 3270 | + | |
| 3271 | + | |
3173 | 3272 | | |
3174 | 3273 | | |
3175 | 3274 | | |
| |||
0 commit comments