Skip to content

Commit 65ac541

Browse files
authored
refactor!: vendor the tokenizer stack into lance (#6512)
This PR vendors the tokenizer stack Lance actually uses into a new `rust/lance-tokenizer` crate and rewires FTS and inverted-index code to depend on it instead of `tantivy` and `lindera-tantivy`. It keeps the existing document and query tokenization semantics in-tree, renames the old FTS document adapter module to `document_tokenizer`, and preserves upstream license headers on vendored code.
1 parent e5ceacb commit 65ac541

37 files changed

Lines changed: 3664 additions & 1041 deletions

.typos.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,6 @@ nprob = "nprobe"
2323
extend-exclude = [
2424
"notebooks/*.ipynb",
2525
"*_THIRD_PARTY_LICENSES.*",
26+
"rust/lance-tokenizer/src/stop_word_filter/stopwords.rs",
2627
]
2728
# If a line ends with # or // and has spellchecker:disable-line, ignore it

0 commit comments

Comments
 (0)