Skip to content

feat: native tokenization engine for BaseModel#195

Draft
ptomecek wants to merge 2 commits intomainfrom
pit/tokenization
Draft

feat: native tokenization engine for BaseModel#195
ptomecek wants to merge 2 commits intomainfrom
pit/tokenization

Conversation

@ptomecek
Copy link
Copy Markdown
Collaborator

Summary

Replace dask-based tokenization with a standalone, composable engine that works with any pydantic BaseModel.

API

  • model.model_token — deterministic hex digest of a model's data (and optionally behavior)
  • __ccflow_tokenizer__ ClassVar — swap the tokenizer engine per-class or globally
  • __ccflow_tokenizer_deps__ ClassVar — declare standalone function dependencies for behavior hashing

Architecture

Two independent axes of composition:

  • FunctionCollector (which functions to hash) × SourceTokenizer (how to hash them)
  • DefaultTokenizer() — data only
  • DefaultTokenizer.with_bytecode() — data + bytecode-hashed own methods (default for behavior)
  • DefaultTokenizer.with_ast() — data + AST-normalized own methods

What changed

File Change
ccflow/utils/tokenize.py Core engine — SourceTokenizer/FunctionCollector ABCs, normalize_token singledispatch, DefaultTokenizer
ccflow/base.py model_token property, __ccflow_tokenizer__ ClassVar, cache invalidation
ccflow/callable.py ModelEvaluationContext.model_token — strips transparent evaluator layers
ccflow/evaluators/common.py Simplified cache_key() to use model_token directly
ccflow/utils/__init__.py Export new types
pyproject.toml Remove dask from tokenization path
docs/wiki/Tokenization.md Comprehensive wiki page
ccflow/tests/utils/test_tokenize.py 182 test cases

Extension points

  1. normalize_token.register(MyType) — custom type handlers via singledispatch
  2. __ccflow_tokenize__() — custom canonical form for any object
  3. FunctionCollector / SourceTokenizer subclasses — custom strategies
  4. Field(default_factory=...) — inject external state (package versions, file checksums, env vars) into tokens

Limitations

  • No transitive dependency tracking (only own methods + explicit deps)
  • Bytecode tokens change across Python minor versions
  • AST tokens change on variable renames
  • See wiki page for full false-hit/false-miss tables

…table cache keys

Evaluators that don't modify return values can now override is_transparent()
to return True, which causes make_evaluation_context() to create
TransparentModelEvaluationContext layers. cache_key() strips these layers so
that wrapping a model with different transparent evaluators does not change
its cache identity or dependency graph node identity.

The is_transparent() method accepts the ModelEvaluationContext, allowing
evaluators to be conditionally transparent based on context.

Closes #192

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
@ptomecek ptomecek force-pushed the pit/tokenization branch 9 times, most recently from c18ae02 to 64b564c Compare April 14, 2026 18:46
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 14, 2026

Test Results

854 tests  +206   852 ✅ +206   1m 44s ⏱️ -2s
  1 suites ±  0     2 💤 ±  0 
  1 files   ±  0     0 ❌ ±  0 

Results for commit bac1375. ± Comparison against base commit 5adf933.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 93.63745% with 106 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.65%. Comparing base (5adf933) to head (bac1375).

Files with missing lines Patch % Lines
ccflow/tests/utils/test_tokenize.py 93.99% 74 Missing and 5 partials ⚠️
ccflow/utils/tokenize.py 91.91% 18 Missing and 6 partials ⚠️
ccflow/base.py 96.29% 0 Missing and 1 partial ⚠️
ccflow/callable.py 95.65% 0 Missing and 1 partial ⚠️
ccflow/evaluators/common.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@                     Coverage Diff                     @@
##           pit/evaluator_cache_key     #195      +/-   ##
===========================================================
- Coverage                    95.98%   95.65%   -0.34%     
===========================================================
  Files                          140      141       +1     
  Lines                         9797    11445    +1648     
  Branches                       568      619      +51     
===========================================================
+ Hits                          9404    10948    +1544     
- Misses                         275      367      +92     
- Partials                       118      130      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ptomecek ptomecek force-pushed the pit/tokenization branch 4 times, most recently from 7bccbeb to 1910344 Compare April 14, 2026 19:35
Replace dask-based tokenization with a standalone, composable engine.

- model_token property on BaseModel with automatic cache invalidation
- DefaultTokenizer with pluggable SourceTokenizer × FunctionCollector
- AST and bytecode source hashing (bytecode default)
- Singledispatch normalize_token with handlers for numpy, pandas, etc.
- __ccflow_tokenizer__ ClassVar to swap tokenizer per class or globally
- __ccflow_tokenizer_deps__ ClassVar for standalone function dependencies
- Simplified cache_key() to use model_token directly
- Removed dask dependency from tokenization path
- Comprehensive tests (185 cases) and wiki documentation

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
@NeejWeej
Copy link
Copy Markdown
Collaborator

Maybe we could narrow model_token down.
Something like:

  • model_token means data token only
  • behavior becomes separate: behavior_version() or model_behavior_token
  • cache key combines them
    That removes a lot of conceptual confusion.
    So for a callable model:
    model.model_token # explicit field-data snapshot
    model.behavior_version() # code semantics
    Then the cache-key generator composes both.

@ptomecek
Copy link
Copy Markdown
Collaborator Author

Maybe we could narrow model_token down. Something like:

  • model_token means data token only
  • behavior becomes separate: behavior_version() or model_behavior_token
  • cache key combines them
    That removes a lot of conceptual confusion.
    So for a callable model:
    model.model_token # explicit field-data snapshot
    model.behavior_version() # code semantics
    Then the cache-key generator composes both.

Part of me wonders if we should even both with caching/having it on the model. We could remove any access from the model itself, and just have everything go through the existing tokenizer classes. It would then be the responsibility of the evaluator to do any caching.

The drawback of that is that I think it's then more difficult for people to debug issues because it's harder to reproduce exactly what was used (i.e. from a notebook)...

Base automatically changed from pit/evaluator_cache_key to main April 16, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants