Skip to content

feat: tokenize endpoints + Property.TextAnalyzer + StopwordPresets (Weaviate 1.37.0+)#381

Merged
bevzzz merged 7 commits into
mainfrom
feat/tokenize-endpoint
Apr 27, 2026
Merged

feat: tokenize endpoints + Property.TextAnalyzer + StopwordPresets (Weaviate 1.37.0+)#381
bevzzz merged 7 commits into
mainfrom
feat/tokenize-endpoint

Conversation

@mpartipilo
Copy link
Copy Markdown
Contributor

@mpartipilo mpartipilo commented Apr 21, 2026

Summary

Two related Weaviate 1.37.0 ports, folded into this one PR to keep the 1.37
schema surface together:

1. Tokenize endpoints — port of python-client PR #2012.

Exposes POST /v1/tokenize and POST /v1/schema/{class}/properties/{prop}/tokenize. Public surface mirrors the TS client's tokenize namespace design (weaviate/typescript-client 1.37/introduce-tokenization-namespace), adapted to Go's fluent builder idiom:

  • client.Tokenize().Text().WithText(...).WithTokenization(...).Do(ctx)
  • client.Tokenize().Property().WithClassName(...).WithPropertyName(...).WithText(...).Do(ctx)

New package weaviate/tokenize/ with 9 Tokenization constants (Word, Lowercase, Whitespace, Field, Trigram, Gse, GseCh, KagomeJa, KagomeKr), AnalyzerConfig, StopwordConfig, TokenizeResult. AnalyzerConfig.AsciiFold is a *AsciiFoldConfig (nil = disabled, non-nil = enabled with optional Ignore list) so the invalid "ignore without fold" state is unrepresentable.

2. Property.TextAnalyzer + InvertedIndexConfig.StopwordPresets — port of python-client PR #2006.

The vendored weaviate module is bumped 1.36.0 → 1.37.1, which lights up two new model fields that round-trip through the existing ClassCreator / ClassUpdater / ClassGetter builders with no new API:

  • models.Property.TextAnalyzerasciiFold, asciiFoldIgnore, stopwordPreset
  • models.InvertedIndexConfig.StopwordPresets — named preset → word-list map

Client-side preflight in weaviate/internal/textAnalyzerCheck.go rejects schemas that use these fields when connected to Weaviate < 1.37.0, with a typed WeaviateClientError that names the offending field. schema.New now takes a *db.VersionProvider so the preflight can read the connected server version.

Out of scope

  • gRPC tokenize (no proto as of 1.37.0).

Test plan

  • go vet -mod=mod ./... clean
  • go build -mod=mod ./... clean
  • go test -mod=mod -count=1 ./test/tokenize/... → passing against Weaviate 1.37.1
  • go test -mod=mod -count=1 ./test/schema/... -run TestTextAnalyzer_integration → 6/6 passing against Weaviate 1.37.1
  • CI green

🤖 Generated with Claude Code

Port of python-client PR #2012, aligned with the TS client's `tokenize`
namespace design. Adds:

- `client.Tokenize().Text()...Do(ctx)` → POST /v1/tokenize
- `client.Tokenize().Property()...Do(ctx)` →
  POST /v1/schema/{class}/properties/{prop}/tokenize

Builder chains follow existing repo conventions (WithText, WithTokenization,
WithAnalyzerConfig, WithStopwordPresets, etc.). `AnalyzerConfig.AsciiFold` is
a nested struct pointer (nil = disabled, non-nil = enabled with optional
Ignore list) so the invalid "ignore without fold" state is unrepresentable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@orca-security-eu orca-security-eu Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

@mpartipilo mpartipilo marked this pull request as ready for review April 21, 2026 17:48
… (Weaviate 1.37.0+)

Port of python-client PR #2006, folded into the tokenize-endpoint PR to
keep the 1.37.0 schema features together.

- `models.Property.TextAnalyzer` (vendored weaviate v1.37.1) — asciiFold,
  asciiFoldIgnore, stopwordPreset — round-trips via ClassCreator/ClassGetter.
- `models.InvertedIndexConfig.StopwordPresets` — named preset → word-list
  map — round-trips via ClassCreator/ClassUpdater.
- Client-side preflight: ClassCreator.Do / ClassUpdater.Do reject schemas
  that use these fields when connected to Weaviate < 1.37.0, with a
  typed WeaviateClientError that names the offending field. Lives in
  weaviate/internal/ so it's reusable but not part of the public API.
- schema.New now takes *db.VersionProvider (was no extra arg) so the
  preflight can read the connected server version.
- go.mod: bump github.com/weaviate/weaviate 1.36.0 → 1.37.1 for the new
  model fields.
- Integration coverage under test/schema/text_analyzer_test.go: stopword
  presets round-trip, update-replaces-preset, remove-in-use rejection,
  combined asciifold + stopword preset, asciifold ignore round-trip, and
  nested property text analyzer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mpartipilo mpartipilo changed the title feat: add tokenize endpoint support (Weaviate 1.37.0+) feat: tokenize endpoints + Property.TextAnalyzer + StopwordPresets (Weaviate 1.37.0+) Apr 21, 2026
mpartipilo and others added 5 commits April 21, 2026 23:52
Weaviate v1.37.x declares `go 1.26` in go.mod, which propagates to this
client after `go mod tidy`. The CI workflow pinned Go 1.25, causing
`go: go.mod requires go >= 1.26 (running go 1.25.9; GOTOOLCHAIN=local)`
in the unit-tests, tests-deprecated, and auth-integration jobs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Match the minimum server version the go.mod dep already pins to, and pick
up bug fixes shipped since the rc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- text_analyzer_test.go: set Vectorizer: "none" on all classes. The
  docker-compose default is text2vec-contextionary, which rejects class
  names like "TestStopwordPresets1" because "stopword" isn't in its
  dictionary.
- docker-compose.yml: opt in to
  ENABLE_EXPERIMENTAL_ALTER_SCHEMA_DROP_VECTOR_INDEX_ENDPOINT so the
  existing TestSchema_integration/DELETE_/schema/.../vectors/.../index
  test keeps working after the server tag bump to 1.37.1 (the endpoint
  became flag-gated in 1.37.1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docker-compose-wcs.yml (used by auth_enabled integration runs on :8085)
needs the same ENABLE_EXPERIMENTAL_ALTER_SCHEMA_DROP_VECTOR_INDEX_ENDPOINT
flag as docker-compose.yml; the schema delete-vector-index test runs
against whichever compose the current matrix selects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bevzzz bevzzz merged commit b3e417b into main Apr 27, 2026
10 checks passed
@bevzzz bevzzz deleted the feat/tokenize-endpoint branch April 27, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants