feat: native tokenization engine for BaseModel by ptomecek · Pull Request #195 · Point72/ccflow

ptomecek · 2026-04-14T17:26:18Z

Summary

Replace dask-based tokenization with a standalone, composable engine that works with any pydantic BaseModel.

API

model.model_token — deterministic hex digest of a model's data (and optionally behavior)
__ccflow_tokenizer__ ClassVar — swap the tokenizer engine per-class or globally
__ccflow_tokenizer_deps__ ClassVar — declare standalone function dependencies for behavior hashing

Architecture

Two independent axes of composition:

FunctionCollector (which functions to hash) × SourceTokenizer (how to hash them)
DefaultTokenizer() — data only
DefaultTokenizer.with_bytecode() — data + bytecode-hashed own methods (default for behavior)
DefaultTokenizer.with_ast() — data + AST-normalized own methods

What changed

File	Change
`ccflow/utils/tokenize.py`	Core engine — SourceTokenizer/FunctionCollector ABCs, normalize_token singledispatch, DefaultTokenizer
`ccflow/base.py`	`model_token` property, `__ccflow_tokenizer__` ClassVar, cache invalidation
`ccflow/callable.py`	`ModelEvaluationContext.model_token` — strips transparent evaluator layers
`ccflow/evaluators/common.py`	Simplified `cache_key()` to use `model_token` directly
`ccflow/utils/__init__.py`	Export new types
`pyproject.toml`	Remove dask from tokenization path
`docs/wiki/Tokenization.md`	Comprehensive wiki page
`ccflow/tests/utils/test_tokenize.py`	182 test cases

Extension points

normalize_token.register(MyType) — custom type handlers via singledispatch
__ccflow_tokenize__() — custom canonical form for any object
FunctionCollector / SourceTokenizer subclasses — custom strategies
Field(default_factory=...) — inject external state (package versions, file checksums, env vars) into tokens

Limitations

No transitive dependency tracking (only own methods + explicit deps)
Bytecode tokens change across Python minor versions
AST tokens change on variable renames
See wiki page for full false-hit/false-miss tables

…table cache keys Evaluators that don't modify return values can now override is_transparent() to return True, which causes make_evaluation_context() to create TransparentModelEvaluationContext layers. cache_key() strips these layers so that wrapping a model with different transparent evaluators does not change its cache identity or dependency graph node identity. The is_transparent() method accepts the ModelEvaluationContext, allowing evaluators to be conditionally transparent based on context. Closes #192 Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

github-actions · 2026-04-14T18:49:01Z

Test Results

854 tests +206 852 ✅ +206 1m 44s ⏱️ -2s
1 suites ± 0 2 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit bac1375. ± Comparison against base commit 5adf933.

♻️ This comment has been updated with latest results.

codecov · 2026-04-14T18:49:30Z

Codecov Report

❌ Patch coverage is 93.63745% with 106 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.65%. Comparing base (5adf933) to head (bac1375).

Files with missing lines	Patch %	Lines
ccflow/tests/utils/test_tokenize.py	93.99%	74 Missing and 5 partials ⚠️
ccflow/utils/tokenize.py	91.91%	18 Missing and 6 partials ⚠️
ccflow/base.py	96.29%	0 Missing and 1 partial ⚠️
ccflow/callable.py	95.65%	0 Missing and 1 partial ⚠️
ccflow/evaluators/common.py	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@                     Coverage Diff                     @@
##           pit/evaluator_cache_key     #195      +/-   ##
===========================================================
- Coverage                    95.98%   95.65%   -0.34%     
===========================================================
  Files                          140      141       +1     
  Lines                         9797    11445    +1648     
  Branches                       568      619      +51     
===========================================================
+ Hits                          9404    10948    +1544     
- Misses                         275      367      +92     
- Partials                       118      130      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace dask-based tokenization with a standalone, composable engine. - model_token property on BaseModel with automatic cache invalidation - DefaultTokenizer with pluggable SourceTokenizer × FunctionCollector - AST and bytecode source hashing (bytecode default) - Singledispatch normalize_token with handlers for numpy, pandas, etc. - __ccflow_tokenizer__ ClassVar to swap tokenizer per class or globally - __ccflow_tokenizer_deps__ ClassVar for standalone function dependencies - Simplified cache_key() to use model_token directly - Removed dask dependency from tokenization path - Comprehensive tests (185 cases) and wiki documentation Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

NeejWeej · 2026-04-14T22:00:29Z

Maybe we could narrow model_token down.
Something like:

model_token means data token only
behavior becomes separate: behavior_version() or model_behavior_token
cache key combines them
That removes a lot of conceptual confusion.
So for a callable model:
model.model_token # explicit field-data snapshot
model.behavior_version() # code semantics
Then the cache-key generator composes both.

ptomecek · 2026-04-14T22:09:42Z

Maybe we could narrow model_token down. Something like:

model_token means data token only

behavior becomes separate: behavior_version() or model_behavior_token

cache key combines them
That removes a lot of conceptual confusion.
So for a callable model:
model.model_token # explicit field-data snapshot
model.behavior_version() # code semantics
Then the cache-key generator composes both.

Part of me wonders if we should even both with caching/having it on the model. We could remove any access from the model itself, and just have everything go through the existing tokenizer classes. It would then be the responsibility of the evaluator to do any caching.

The drawback of that is that I think it's then more difficult for people to debug issues because it's harder to reproduce exactly what was used (i.e. from a notebook)...

ptomecek force-pushed the pit/tokenization branch 9 times, most recently from c18ae02 to 64b564c Compare April 14, 2026 18:46

ptomecek force-pushed the pit/tokenization branch 4 times, most recently from 7bccbeb to 1910344 Compare April 14, 2026 19:35

ptomecek force-pushed the pit/tokenization branch from 1910344 to bac1375 Compare April 14, 2026 21:51

Base automatically changed from pit/evaluator_cache_key to main April 16, 2026 13:20

ptomecek mentioned this pull request Apr 17, 2026

feat: behavior hashing for cache key invalidation #196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: native tokenization engine for BaseModel#195

feat: native tokenization engine for BaseModel#195
ptomecek wants to merge 2 commits intomainfrom
pit/tokenization

ptomecek commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

NeejWeej commented Apr 14, 2026

Uh oh!

ptomecek commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ptomecek commented Apr 14, 2026

Summary

API

Architecture

What changed

Extension points

Limitations

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NeejWeej commented Apr 14, 2026

Uh oh!

ptomecek commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 14, 2026 •

edited

Loading

codecov Bot commented Apr 14, 2026 •

edited

Loading