Skip to content

Commit 6bbba18

Browse files
committed
Merge origin/main into pr-34
2 parents 9e02042 + 53c5a61 commit 6bbba18

6 files changed

Lines changed: 287 additions & 187 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,16 @@ jobs:
1212
strategy:
1313
fail-fast: false
1414
matrix:
15-
backend: # Different backends for Tokenizers
16-
- default
17-
- fancy-regex
15+
include:
16+
- name: default
17+
build: cargo build --verbose
18+
test: cargo test --verbose
19+
- name: fancy-regex
20+
build: cargo build --no-default-features --features fancy-regex --verbose
21+
test: cargo test --no-default-features --features fancy-regex --verbose
22+
- name: local-only
23+
build: cargo build --no-default-features --features onig,local-only --verbose
24+
test: cargo test --no-default-features --features onig,local-only --verbose
1825

1926
steps:
2027
- name: Checkout
@@ -26,23 +33,28 @@ jobs:
2633
components: clippy, rustfmt
2734

2835
- name: Build
29-
run: |
30-
if [ "${{ matrix.backend }}" = "default" ]; then
31-
cargo build --verbose
32-
else
33-
cargo build --no-default-features --features fancy-regex --verbose
34-
fi
36+
run: ${{ matrix.build }}
3537

3638
- name: Test
37-
run: |
38-
if [ "${{ matrix.backend }}" = "default" ]; then
39-
cargo test --verbose
40-
else
41-
cargo test --no-default-features --features fancy-regex --verbose
42-
fi
39+
run: ${{ matrix.test }}
4340

4441
- name: Lint (clippy)
4542
run: cargo clippy --all-targets --all-features -- -D warnings
4643

4744
- name: Format check
4845
run: cargo fmt -- --check
46+
47+
wasm-check:
48+
runs-on: ubuntu-latest
49+
50+
steps:
51+
- name: Checkout
52+
uses: actions/checkout@v4
53+
54+
- name: Set up Rust
55+
uses: dtolnay/rust-toolchain@stable
56+
with:
57+
target: wasm32-unknown-unknown
58+
59+
- name: Check wasm feature set
60+
run: RUSTFLAGS='--cfg getrandom_backend="wasm_js"' cargo check --no-default-features --features wasm --target wasm32-unknown-unknown

Cargo.lock

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,23 +14,25 @@ categories = ["science", "text-processing"]
1414
exclude = ["tests/*"]
1515

1616
[features]
17-
default = ["onig"]
17+
default = ["onig", "hf-hub"]
18+
hf-hub = ["dep:hf-hub", "dep:ureq"]
19+
local-only = []
1820
onig = ["tokenizers/onig",
1921
"tokenizers/progressbar",
20-
"tokenizers/esaxx_fast",
21-
"tokenizers/http"]
22+
"tokenizers/esaxx_fast"]
2223

2324
fancy-regex = ["tokenizers/fancy-regex",
2425
"tokenizers/progressbar",
25-
"tokenizers/esaxx_fast",
26-
"tokenizers/http"]
26+
"tokenizers/esaxx_fast"]
27+
wasm = ["local-only",
28+
"tokenizers/unstable_wasm"]
2729

2830
[dependencies]
2931
tokenizers = { version = "0.21", default-features = false }
3032
safetensors = "0.5"
3133
ndarray = "0.15"
32-
hf-hub = { version = "0.4", default-features = false, features = ["ureq"] }
33-
ureq = "2"
34+
hf-hub = { version = "0.4", default-features = false, features = ["ureq"], optional = true }
35+
ureq = { version = "2", optional = true }
3436
clap = { version = "4.0", features = ["derive"] }
3537
anyhow = "1.0"
3638
serde_json = "1.0"

README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,36 @@ cargo build --release
145145
* **Batch Processing:** Encodes multiple sentences in batches.
146146
* **Configurable Encoding:** Allows customization of maximum sequence length and batch size during encoding.
147147

148+
### Feature flags
149+
150+
The crate exposes a few feature combinations for different runtimes:
151+
152+
* `default`: native build with `onig` tokenization and optional Hugging Face Hub downloads
153+
* `fancy-regex`: alternative tokenizer backend for native builds
154+
* `local-only`: disable remote model downloads and restrict loading to local paths or `from_bytes(...)`
155+
* `wasm`: minimal WebAssembly-oriented feature set for in-memory loading via `from_bytes(...)`
156+
157+
Typical invocations are:
158+
159+
* native local-only build:
160+
`cargo build --no-default-features --features onig,local-only`
161+
* wasm check:
162+
`RUSTFLAGS='--cfg getrandom_backend="wasm_js"' cargo check --no-default-features --features wasm --target wasm32-unknown-unknown`
163+
164+
The `wasm` feature is intended for `wasm32-unknown-unknown` builds that load models
165+
from in-memory bytes, for example after fetching assets over HTTP or embedding them
166+
into the binary. Direct filesystem access is usually not available in browser-style
167+
WebAssembly environments, so callers should pass file contents through `from_bytes(...)`.
168+
Remote Hugging Face downloads are not available in this mode.
169+
170+
For `wasm32-unknown-unknown`, `getrandom` also requires a target-specific backend
171+
configuration. The minimal check command is:
172+
173+
```bash
174+
RUSTFLAGS='--cfg getrandom_backend="wasm_js"' \
175+
cargo check --no-default-features --features wasm --target wasm32-unknown-unknown
176+
```
177+
148178
## What is Model2Vec?
149179

150180
Model2Vec is a technique to distill large sentence transformer models into highly efficient static embedding models. This process significantly reduces model size and computational requirements for inference. For a detailed understanding of how Model2Vec works, including the distillation process and model training, please refer to the [main Model2Vec Python repository](https://github.com/MinishLab/model2vec) and its [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md).

0 commit comments

Comments
 (0)