Skip to content

Commit df712d9

Browse files
committed
Merge branch 'main' into aneubeck/sampling
2 parents 04da223 + f460d14 commit df712d9

61 files changed

Lines changed: 3472 additions & 882 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/dependabot.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@ updates:
44
- package-ecosystem: "cargo"
55
directory: "/"
66
schedule:
7-
interval: "weekly"
7+
interval: "cron"
8+
cronjob: "0 5 2 * *" # Second day of each month at 05:00 UTC
89

910
- package-ecosystem: "github-actions"
1011
directory: "/"
1112
schedule:
12-
interval: "weekly"
13+
interval: "cron"
14+
cronjob: "0 5 2 * *" # Second day of each month at 05:00 UTC
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
---
2+
name: update-deps
3+
description: Keep dependencies up-to-date. Discovers outdated deps via dependabot alerts/PRs, creates one PR per ecosystem, iterates until CI is green, then assigns for review.
4+
user-invocable: true
5+
---
6+
7+
# Update Dependencies
8+
9+
Automate the full dependency update lifecycle: discover what's outdated, apply updates grouped by ecosystem, fix breakage, get CI green, and hand off for human review.
10+
11+
## Repository context
12+
13+
This is a Rust workspace containing utility crates published to crates.io. All dependency update PRs target the **`main`** branch.
14+
15+
Dependabot is configured (`.github/dependabot.yaml`) to open PRs against `main` on the 2nd of each month. This skill gathers individual dependabot PRs, combines updates by ecosystem, fixes any breakage, gets CI green, and creates consolidated PRs for human review.
16+
17+
### Crates in this workspace
18+
19+
| Crate | Description |
20+
|---|---|
21+
| **bpe** | Fast byte-pair encoding |
22+
| **bpe-openai** | OpenAI tokenizers built on bpe |
23+
| **geo_filters** | Probabilistic cardinality estimation |
24+
| **string-offsets** | UTF-8/UTF-16/Unicode position conversion (with WASM/JS bindings) |
25+
26+
Supporting packages (not published): `bpe-tests`, `bpe-benchmarks`.
27+
28+
### Ecosystems in this repo
29+
30+
| Ecosystem | Directories | Notes |
31+
|---|---|---|
32+
| **cargo** | `/` (workspace root) | Deps declared per-crate; `Cargo.lock` at workspace root pins versions |
33+
| **github-actions** | `.github/workflows/` | CI and publish workflows |
34+
| **npm** | `crates/string-offsets/js/` | JS bindings for string-offsets (WASM) |
35+
36+
### Build and validation commands
37+
38+
```bash
39+
make build # cargo build --all-targets --all-features
40+
make build-js # npm run compile in crates/string-offsets/js
41+
make lint # cargo fmt --check + cargo clippy (deny warnings, forbid unwrap_used)
42+
make test # cargo test + doc tests
43+
```
44+
45+
CI runs on `ubuntu-latest` with the `mold` linker. The lint job depends on build.
46+
47+
## Workflow
48+
49+
### 1. Assess repo state
50+
51+
Determine the repo identity and confirm the target branch.
52+
53+
```bash
54+
git remote get-url origin # extract owner/repo
55+
git fetch origin main
56+
git rev-parse --verify origin/main
57+
```
58+
59+
Detect which ecosystems have pending updates:
60+
61+
```bash
62+
[ -f Cargo.toml ] && echo "cargo"
63+
ls .github/workflows/*.yml .github/workflows/*.yaml 2>/dev/null && echo "github-actions"
64+
[ -f crates/string-offsets/js/package.json ] && echo "npm"
65+
```
66+
67+
Report discovered ecosystems to the user.
68+
69+
### 2. Gather dependency intelligence
70+
71+
Fetch open dependabot PRs:
72+
73+
```bash
74+
gh pr list --author 'app/dependabot' --base main --state open --json number,title,headRefName,labels --limit 100
75+
```
76+
77+
Fetch open dependabot alerts:
78+
79+
```bash
80+
gh api --paginate /repos/{owner}/{repo}/dependabot/alerts --jq '[.[] | select(.state=="open") | {number: .number, package: .security_vulnerability.package.name, ecosystem: .security_vulnerability.package.ecosystem, severity: .security_advisory.severity, summary: .security_advisory.summary}]'
81+
```
82+
83+
For ecosystems without dependabot coverage or when running ad-hoc, use native tooling:
84+
85+
- **cargo:** `cargo update --dry-run`
86+
- **npm:** find directories containing `package.json`, then run `npm outdated --json || true` in each (npm exits non-zero when updates exist)
87+
88+
Also fetch the advisory URLs for any security-related updates. Individual alert details are at `https://github.com/{owner}/{repo}/security/dependabot/{alert_number}`. Fetch alert numbers and GHSA IDs via:
89+
90+
```bash
91+
gh api --paginate /repos/{owner}/{repo}/dependabot/alerts --jq '[.[] | {number: .number, state, package: .security_vulnerability.package.name, ecosystem: .security_vulnerability.package.ecosystem, severity: .security_advisory.severity, ghsa_id: .security_advisory.ghsa_id, summary: .security_advisory.summary}]'
92+
```
93+
94+
Include both open and auto_dismissed/dismissed alerts — the update may resolve alerts in any state.
95+
96+
Cross-reference and group all updates by ecosystem. Present a summary to the user:
97+
98+
- How many updates per ecosystem
99+
- Which have security alerts (with severity, GHSA IDs, and advisory links)
100+
- Which dependabot PRs already exist
101+
102+
**Flag high-risk upgrades.** Before proceeding, explicitly call out upgrades that carry elevated risk:
103+
104+
- **Major version bumps** — likely contain breaking API changes
105+
- **Packages with wide blast radius** — for this repo, pay special attention to: `serde`, `itertools`, `regex-automata`, `wasm-bindgen`, `criterion`, and the Rust toolchain itself
106+
- **Multiple major bumps in the same PR** — each major bump multiplies the risk; consider splitting them
107+
108+
Present the risk assessment to the user and recommend which upgrades to include vs. defer. When in doubt, prefer a smaller, safe update over an ambitious one that might break.
109+
110+
### 3. Create branch and apply updates
111+
112+
For each selected ecosystem, starting from `main`:
113+
114+
```bash
115+
git checkout main
116+
git pull origin main
117+
git checkout -b deps/{ecosystem}-updates-$(date +%Y-%m-%d)
118+
```
119+
120+
Apply updates using ecosystem-appropriate tooling:
121+
122+
**cargo:**
123+
124+
```bash
125+
cargo update
126+
# For major bumps, edit Cargo.toml version constraints then:
127+
cargo check
128+
```
129+
130+
This is a Cargo workspace — always run from the repo root. All crate `Cargo.toml` files are in `crates/`. The `Cargo.lock` at the root is the single source of truth.
131+
132+
**npm:**
133+
134+
```bash
135+
cd crates/string-offsets/js
136+
npm update
137+
npm install
138+
```
139+
140+
**github-actions:**
141+
142+
- Parse workflow YAML files in `.github/workflows/` for `uses:` directives
143+
- For each action with an outdated version (from dependabot PRs/alerts), update the SHA or version tag
144+
- Be careful to preserve comments and formatting
145+
146+
### 4. Build, lint, and test locally
147+
148+
Always run:
149+
150+
```bash
151+
make lint # cargo fmt --check + clippy with deny warnings
152+
make test # cargo test with backtrace
153+
make build # full workspace build (all targets, all features)
154+
```
155+
156+
If npm dependencies changed:
157+
158+
```bash
159+
make build-js # npm compile for string-offsets JS binding
160+
```
161+
162+
**If the build/lint/test fails:**
163+
164+
1. Read the error output carefully
165+
2. Analyze what broke — likely API changes, type errors, or deprecation removals
166+
3. Make the necessary code changes to fix the breakage
167+
4. Run the pipeline again
168+
5. Repeat up to 3 times
169+
170+
If still failing after 3 iterations, report the situation to the user and ask for guidance. Do not push broken code.
171+
172+
### 5. Commit and push
173+
174+
Stage all changes and commit with a descriptive message:
175+
176+
```bash
177+
git add -A
178+
git commit -m "chore(deps): update {ecosystem} dependencies
179+
180+
Updated packages:
181+
- package-a: 1.0.0 → 2.0.0
182+
- package-b: 3.1.0 → 3.2.0
183+
184+
{If code changes were needed:}
185+
Fixed breaking changes:
186+
- Updated X API usage for package-a v2
187+
188+
Supersedes: #{dependabot_pr_1}, #{dependabot_pr_2}
189+
"
190+
```
191+
192+
Push the branch:
193+
194+
```bash
195+
git push -u origin HEAD
196+
```
197+
198+
### 6. Create the PR
199+
200+
**Title:** `chore(deps): update {ecosystem} dependencies`
201+
202+
**Body should include:**
203+
204+
- List of updated dependencies with version changes (old → new)
205+
- Any security alerts resolved — for each, link to the specific dependabot alert (`https://github.com/{owner}/{repo}/security/dependabot/{alert_number}`) and the GHSA advisory (`https://github.com/advisories/GHSA-xxxx-xxxx-xxxx`), along with severity and summary
206+
- **High-risk changes flagged for reviewer attention** (major version bumps, wide-blast-radius packages)
207+
- Code changes made to fix breakage (if any)
208+
- References to superseded dependabot PRs
209+
- Note that this was generated by the update-deps skill
210+
211+
Write the body to a temp file and create the PR **targeting `main`**:
212+
213+
```bash
214+
gh pr create --title "chore(deps): update {ecosystem} dependencies" --body-file /tmp/deps-pr-body.md --base main
215+
rm /tmp/deps-pr-body.md
216+
```
217+
218+
### 7. Monitor CI and iterate on failures
219+
220+
Watch the PR's checks:
221+
222+
```bash
223+
gh pr checks {pr_number} --watch --fail-fast
224+
```
225+
226+
**If checks fail:**
227+
228+
1. Get the failed run details:
229+
230+
```bash
231+
gh run list --branch {branch} --status failure --json databaseId,name --limit 1
232+
gh run view {run_id} --log-failed
233+
```
234+
235+
2. Analyze the failure — CI runs on `ubuntu-latest` with `mold` linker, which may differ from local builds.
236+
237+
3. Fix the issue locally, commit, and push:
238+
239+
```bash
240+
git add -A
241+
git commit -m "fix: resolve CI failure in {ecosystem} dep update
242+
243+
{Brief description of what failed and why}"
244+
git push
245+
```
246+
247+
4. Monitor again. Repeat up to 3 iterations total.
248+
249+
5. If still failing after 3 pushes, report to the user with the failure details and ask for help.
250+
251+
### 8. Close superseded dependabot PRs
252+
253+
For each dependabot PR that this update supersedes:
254+
255+
```bash
256+
gh pr close {dependabot_pr_number} --comment "Superseded by #{new_pr_number} which includes this update along with other {ecosystem} dependency updates."
257+
```
258+
259+
### 9. Assign for review
260+
261+
Request review from CODEOWNERS or a user-provided reviewer (not the PR author):
262+
263+
```bash
264+
gh pr edit {pr_number} --add-reviewer {reviewer_login}
265+
```
266+
267+
Report the final PR URL and a summary of what was done.
268+
269+
## Guidelines
270+
271+
- **All PRs target `main`.** There is no separate dev branch.
272+
- **Never push to `main` directly.** Always work on a feature branch.
273+
- **Never push code that doesn't pass `make lint` and `make test`.** If you can't fix it in 3 tries, stop and ask.
274+
- **Be conservative with major version bumps.** If a major version update breaks things and the fix isn't obvious, skip that package and note it in the PR description.
275+
- **Regenerate lockfiles.** Always regenerate `Cargo.lock` and `package-lock.json` after updating — don't just edit manifests.
276+
- **One ecosystem at a time.** Complete the full cycle (update → build → push → PR → CI green) for one ecosystem before moving to the next.
277+
- **If no updates are needed** for an ecosystem, skip it and tell the user.
278+
- **Security alerts take priority.** Address security alerts first within each ecosystem.
279+
- **Clippy is strict.** This repo forbids `unwrap_used` outside tests and denies all warnings. New dependency versions may trigger new clippy lints — fix them.
280+
281+
## Edge cases
282+
283+
- **Cargo workspace:** Dependencies are declared per-crate but share a single `Cargo.lock` at the workspace root. Always run `cargo update` and `cargo check` from the repo root.
284+
- **npm:** Look for `package.json` files to discover npm packages rather than hardcoding paths — the repo layout may change.
285+
- **WASM builds:** After updating `wasm-bindgen` or related deps, verify `make build-js` still works — WASM toolchain version mismatches are common.
286+
- **Rate limits:** If `gh api` hits rate limits, wait and retry. Report to user if persistent.
287+
- **Nothing to update:** Report cleanly and move to the next ecosystem (or exit).
288+
- **Merge conflicts on push:** Rebase on `main` and retry: `git fetch origin main && git rebase origin/main`.
289+
- **Branch already exists:** If `deps/{ecosystem}-updates-{date}` already exists, append a counter or ask user.

.github/workflows/ci.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ jobs:
2121
name: Build
2222
runs-on: ubuntu-latest
2323
steps:
24-
- uses: actions/checkout@v4
24+
- uses: actions/checkout@v6
2525

26-
- uses: rui314/setup-mold@702b1908b5edf30d71a8d1666b724e0f0c6fa035
26+
- uses: rui314/setup-mold@725a8794d15fc7563f59595bd9556495c0564878
2727

2828
- name: Build
2929
run: make build
@@ -36,9 +36,9 @@ jobs:
3636
runs-on: ubuntu-latest
3737
needs: build
3838
steps:
39-
- uses: actions/checkout@v4
39+
- uses: actions/checkout@v6
4040

41-
- uses: rui314/setup-mold@702b1908b5edf30d71a8d1666b724e0f0c6fa035
41+
- uses: rui314/setup-mold@725a8794d15fc7563f59595bd9556495c0564878
4242

4343
- name: Check formatting and clippy
4444
run: make lint
@@ -47,9 +47,9 @@ jobs:
4747
name: Test
4848
runs-on: ubuntu-latest
4949
steps:
50-
- uses: actions/checkout@v4
50+
- uses: actions/checkout@v6
5151

52-
- uses: rui314/setup-mold@702b1908b5edf30d71a8d1666b724e0f0c6fa035
52+
- uses: rui314/setup-mold@725a8794d15fc7563f59595bd9556495c0564878
5353

5454
- name: Run unit tests
5555
run: make test

.github/workflows/publish.yaml

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ on:
55

66
permissions:
77
contents: read
8+
id-token: write
89

910
jobs:
1011
publish-npm:
@@ -13,8 +14,8 @@ jobs:
1314
run:
1415
working-directory: crates/string-offsets/js
1516
steps:
16-
- uses: actions/checkout@v4
17-
- uses: actions/setup-node@v4
17+
- uses: actions/checkout@v6
18+
- uses: actions/setup-node@v6
1819
with:
1920
node-version: 22
2021
registry-url: https://registry.npmjs.org/
@@ -24,6 +25,4 @@ jobs:
2425
- run: npm run compile
2526
- run: npm test
2627
- run: echo "Publishing string-offsets"
27-
- run: npm whoami; npm --ignore-scripts publish
28-
env:
29-
NODE_AUTH_TOKEN: ${{secrets.NPM_TOKEN}}
28+
- run: npm --ignore-scripts publish --provenance

Makefile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ build:
2424

2525
.PHONY: build-js
2626
build-js:
27+
which wasm-pack || cargo install wasm-pack
2728
npm --prefix crates/string-offsets/js install
2829
npm --prefix crates/string-offsets/js run compile
2930

@@ -32,6 +33,11 @@ test:
3233
RUST_BACKTRACE=1 cargo test
3334
# Amazingly, `--all-targets` causes doc-tests not to run.
3435
RUST_BACKTRACE=1 cargo test --doc
36+
# Check that geo_filters compiles with each feature in isolation and with no features
37+
cargo check -p geo_filters --no-default-features
38+
cargo check -p geo_filters --features test-support
39+
cargo check -p geo_filters --features serde
40+
cargo check -p geo_filters --features evaluation
3541

3642
.PHONY: test-ignored
3743
test-ignored:

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ A collection of useful algorithms written in Rust. Currently contains:
66
- [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
77
- [`bpe-openai`](crates/bpe-openai): Fast tokenizers for OpenAI token sets based on the `bpe` crate.
88
- [`consistent-hashing`](crates/consistent-hashing): constant time consistent hashing algorithms with support for replication and bounded load.
9+
- [`sparse-ngrams`](crates/sparse-ngrams): fast sparse n-gram extraction from byte slices. Selects variable-length n-grams (2–8 bytes) deterministically using bigram frequency priorities, suitable for substring search indexes.
910
- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.
1011

1112
## Background

0 commit comments

Comments
 (0)