Skip to content

Commit 7ea23dd

Browse files
kashifCISC
andauthored
vocab : add Carbon-3B (HybridDNATokenizer) support (#23410)
* vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
1 parent 2fc8d18 commit 7ea23dd

3 files changed

Lines changed: 152 additions & 15 deletions

File tree

conversion/base.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1610,6 +1610,42 @@ def _set_vocab_gpt2(self) -> None:
16101610
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
16111611
special_vocab.add_to_gguf(self.gguf_writer)
16121612

1613+
def _set_vocab_hybriddna(self):
1614+
from transformers import AutoTokenizer
1615+
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
1616+
vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab)) # ty: ignore[unresolved-attribute]
1617+
assert max(tokenizer.vocab.values()) < vocab_size # ty: ignore[unresolved-attribute]
1618+
1619+
reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()} # ty: ignore[unresolved-attribute]
1620+
added_vocab = tokenizer.get_added_vocab() # ty: ignore[unresolved-attribute]
1621+
added_tokens_decoder = tokenizer.added_tokens_decoder # ty: ignore[unresolved-attribute]
1622+
1623+
tokens: list[str] = []
1624+
toktypes: list[int] = []
1625+
for i in range(vocab_size):
1626+
if i not in reverse_vocab:
1627+
tokens.append(f"[PAD{i}]")
1628+
toktypes.append(gguf.TokenType.UNUSED)
1629+
else:
1630+
token: str = reverse_vocab[i]
1631+
if token in added_vocab:
1632+
if added_tokens_decoder[i].special or self.does_token_look_special(token):
1633+
toktypes.append(gguf.TokenType.CONTROL)
1634+
else:
1635+
toktypes.append(gguf.TokenType.USER_DEFINED)
1636+
else:
1637+
toktypes.append(gguf.TokenType.NORMAL)
1638+
tokens.append(token)
1639+
1640+
tokpre = self.get_vocab_base_pre(tokenizer)
1641+
self.gguf_writer.add_tokenizer_model("hybriddna")
1642+
self.gguf_writer.add_tokenizer_pre(tokpre)
1643+
self.gguf_writer.add_token_list(tokens)
1644+
self.gguf_writer.add_token_types(toktypes)
1645+
1646+
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
1647+
special_vocab.add_to_gguf(self.gguf_writer)
1648+
16131649
def _set_vocab_qwen(self):
16141650
from .qwen import QwenModel
16151651

conversion/llama.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,15 @@ def set_vocab(self):
5151
if path_tekken_json.is_file() and not path_tokenizer_json.is_file():
5252
self._set_vocab_mistral()
5353

54+
tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
55+
if tokenizer_config_file.is_file():
56+
with open(tokenizer_config_file, "r", encoding="utf-8") as f:
57+
tokenizer_config_json = json.load(f)
58+
if (add_prefix_space := tokenizer_config_json.get("add_prefix_space")) is not None:
59+
self.gguf_writer.add_add_space_prefix(add_prefix_space)
60+
if tokenizer_config_json.get("tokenizer_class") == "HybridDNATokenizer":
61+
return self._set_vocab_hybriddna()
62+
5463
try:
5564
self._set_vocab_sentencepiece()
5665
except FileNotFoundError:
@@ -72,13 +81,6 @@ def set_vocab(self):
7281
special_vocab._set_special_token("eot", 32010)
7382
special_vocab.add_to_gguf(self.gguf_writer)
7483

75-
tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
76-
if tokenizer_config_file.is_file():
77-
with open(tokenizer_config_file, "r", encoding="utf-8") as f:
78-
tokenizer_config_json = json.load(f)
79-
if "add_prefix_space" in tokenizer_config_json:
80-
self.gguf_writer.add_add_space_prefix(tokenizer_config_json["add_prefix_space"])
81-
8284
# Apply to granite small models only
8385
if self.hparams.get("vocab_size", 32000) == 49152:
8486
self.gguf_writer.add_add_bos_token(False)

src/llama-vocab.cpp

Lines changed: 107 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -530,6 +530,8 @@ struct llm_tokenizer_bpe : llm_tokenizer {
530530
struct llm_tokenizer_bpe_session {
531531
llm_tokenizer_bpe_session(const llama_vocab & vocab, const llm_tokenizer_bpe & tokenizer) : vocab(vocab), tokenizer(tokenizer) {}
532532

533+
virtual ~llm_tokenizer_bpe_session() = default;
534+
533535
static void append(const llama_token token_id, std::vector<llama_token> & output) {
534536
output.push_back(token_id);
535537
}
@@ -567,7 +569,7 @@ struct llm_tokenizer_bpe_session {
567569
}
568570
}
569571

570-
void tokenize(const std::string & text, std::vector<llama_token> & output) {
572+
virtual void tokenize(const std::string & text, std::vector<llama_token> & output) {
571573
int final_prev_index = -1;
572574
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs, tokenizer.byte_encode);
573575

@@ -1579,6 +1581,95 @@ struct llm_tokenizer_plamo2_session {
15791581
const llm_tokenizer_plamo2 & tokenizer;
15801582
};
15811583

1584+
struct llm_tokenizer_hybriddna_session : llm_tokenizer_bpe_session {
1585+
llm_tokenizer_hybriddna_session(const llama_vocab & vocab, const llm_tokenizer_bpe & tokenizer) : llm_tokenizer_bpe_session{vocab, tokenizer}, vocab{vocab} {}
1586+
1587+
void tokenize(const std::string & text, std::vector<llama_token> & output) override {
1588+
static const std::string open_tag = "<dna>";
1589+
static const std::string close_tag = "</dna>";
1590+
1591+
const auto dna_begin_id = vocab.text_to_token(open_tag);
1592+
const auto dna_end_id = vocab.text_to_token(close_tag);
1593+
const auto dna_oov_id = vocab.text_to_token("<oov>");
1594+
1595+
// Fall back to plain BPE if the DNA pieces aren't in the vocab.
1596+
if (dna_begin_id == LLAMA_TOKEN_NULL || dna_end_id == LLAMA_TOKEN_NULL || dna_oov_id == LLAMA_TOKEN_NULL) {
1597+
llm_tokenizer_bpe_session::tokenize(text, output);
1598+
return;
1599+
}
1600+
1601+
const size_t k = 6;
1602+
size_t pos = 0;
1603+
1604+
while (pos < text.size()) {
1605+
const size_t start = text.find(open_tag, pos);
1606+
if (start == std::string::npos) {
1607+
if (pos < text.size()) {
1608+
llm_tokenizer_bpe_session::tokenize(text.substr(pos), output);
1609+
}
1610+
break;
1611+
}
1612+
if (start > pos) {
1613+
llm_tokenizer_bpe_session::tokenize(text.substr(pos, start - pos), output);
1614+
}
1615+
output.push_back(dna_begin_id);
1616+
1617+
const size_t content_start = start + open_tag.size();
1618+
const size_t end = text.find(close_tag, content_start);
1619+
const size_t content_end = (end == std::string::npos) ? text.size() : end;
1620+
1621+
emit_dna_kmers(text.substr(content_start, content_end - content_start), k, dna_oov_id, output);
1622+
1623+
if (end == std::string::npos) {
1624+
break;
1625+
}
1626+
output.push_back(dna_end_id);
1627+
pos = end + close_tag.size();
1628+
}
1629+
}
1630+
1631+
private:
1632+
void emit_dna_kmers(const std::string & raw, size_t k, llama_token oov_id, std::vector<llama_token> & output) {
1633+
std::string seq = raw;
1634+
for (char & c : seq) {
1635+
if (c >= 'a' && c <= 'z') {
1636+
c = char(c - 32);
1637+
}
1638+
}
1639+
auto is_valid_kmer = [](const std::string & s) {
1640+
for (char c : s) {
1641+
if (c != 'A' && c != 'C' && c != 'G' && c != 'T') {
1642+
return false;
1643+
}
1644+
}
1645+
return true;
1646+
};
1647+
1648+
size_t i = 0;
1649+
for (; i + k <= seq.size(); i += k) {
1650+
const std::string kmer = seq.substr(i, k);
1651+
if (is_valid_kmer(kmer)) {
1652+
const auto tok = vocab.text_to_token(kmer);
1653+
output.push_back(tok != LLAMA_TOKEN_NULL ? tok : oov_id);
1654+
} else {
1655+
output.push_back(oov_id);
1656+
}
1657+
}
1658+
if (i < seq.size()) {
1659+
std::string kmer = seq.substr(i);
1660+
kmer.append(k - kmer.size(), 'A');
1661+
if (is_valid_kmer(kmer)) {
1662+
const auto tok = vocab.text_to_token(kmer);
1663+
output.push_back(tok != LLAMA_TOKEN_NULL ? tok : oov_id);
1664+
} else {
1665+
output.push_back(oov_id);
1666+
}
1667+
}
1668+
}
1669+
1670+
const llama_vocab & vocab;
1671+
};
1672+
15821673
//
15831674
// impl
15841675
//
@@ -1808,7 +1899,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
18081899
special_mask_id = 103;
18091900

18101901
add_sep = true;
1811-
} else if (tokenizer_model == "gpt2") {
1902+
} else if (tokenizer_model == "gpt2" || tokenizer_model == "hybriddna") {
18121903
type = LLAMA_VOCAB_TYPE_BPE;
18131904

18141905
// read bpe merges and populate bpe ranks
@@ -3144,11 +3235,19 @@ std::vector<llama_token> llama_vocab::impl::tokenize(
31443235
} break;
31453236
case LLAMA_VOCAB_TYPE_BPE:
31463237
{
3147-
llm_tokenizer_bpe_session session(vocab, *static_cast<const llm_tokenizer_bpe *>(tokenizer.get()));
31483238
// it calls some other methods that are not exist in llm_tokenizer,
31493239
// here just cast it to bpe tokenizer object
3240+
const llm_tokenizer_bpe * tok_bpe = static_cast<const llm_tokenizer_bpe *>(tokenizer.get());
3241+
3242+
std::unique_ptr<llm_tokenizer_bpe_session> session;
3243+
if (vocab.get_tokenizer_model() == "hybriddna") {
3244+
session = std::make_unique<llm_tokenizer_hybriddna_session>(vocab, *tok_bpe);
3245+
} else {
3246+
session = std::make_unique<llm_tokenizer_bpe_session>(vocab, *tok_bpe);
3247+
}
3248+
31503249
if (add_special) {
3151-
session.append_bos(output);
3250+
session->append_bos(output);
31523251
}
31533252
for (const auto & fragment : fragment_buffer) {
31543253
if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) {
@@ -3161,15 +3260,15 @@ std::vector<llama_token> llama_vocab::impl::tokenize(
31613260
#ifdef PRETOKENIZERDEBUG
31623261
LLAMA_LOG_WARN("TT: (%ld %ld %ld) '%s'\n", text.length(), fragment.offset, fragment.length, text.c_str());
31633262
#endif
3164-
session.tokenize(text, output);
3263+
session->tokenize(text, output);
31653264
} else { // if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_TOKEN)
3166-
session.append(fragment.token, output);
3265+
session->append(fragment.token, output);
31673266
}
31683267
}
31693268

31703269
if (add_special) {
3171-
session.append_eos(output);
3172-
session.check_double_bos_eos(output);
3270+
session->append_eos(output);
3271+
session->check_double_bos_eos(output);
31733272
}
31743273
} break;
31753274
case LLAMA_VOCAB_TYPE_WPM:

0 commit comments

Comments
 (0)