feat: native tokenization engine for BaseModel#195
Conversation
…table cache keys Evaluators that don't modify return values can now override is_transparent() to return True, which causes make_evaluation_context() to create TransparentModelEvaluationContext layers. cache_key() strips these layers so that wrapping a model with different transparent evaluators does not change its cache identity or dependency graph node identity. The is_transparent() method accepts the ModelEvaluationContext, allowing evaluators to be conditionally transparent based on context. Closes #192 Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
c18ae02 to
64b564c
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## pit/evaluator_cache_key #195 +/- ##
===========================================================
- Coverage 95.98% 95.65% -0.34%
===========================================================
Files 140 141 +1
Lines 9797 11445 +1648
Branches 568 619 +51
===========================================================
+ Hits 9404 10948 +1544
- Misses 275 367 +92
- Partials 118 130 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7bccbeb to
1910344
Compare
Replace dask-based tokenization with a standalone, composable engine. - model_token property on BaseModel with automatic cache invalidation - DefaultTokenizer with pluggable SourceTokenizer × FunctionCollector - AST and bytecode source hashing (bytecode default) - Singledispatch normalize_token with handlers for numpy, pandas, etc. - __ccflow_tokenizer__ ClassVar to swap tokenizer per class or globally - __ccflow_tokenizer_deps__ ClassVar for standalone function dependencies - Simplified cache_key() to use model_token directly - Removed dask dependency from tokenization path - Comprehensive tests (185 cases) and wiki documentation Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
1910344 to
bac1375
Compare
|
Maybe we could narrow
|
Part of me wonders if we should even both with caching/having it on the model. We could remove any access from the model itself, and just have everything go through the existing tokenizer classes. It would then be the responsibility of the evaluator to do any caching. The drawback of that is that I think it's then more difficult for people to debug issues because it's harder to reproduce exactly what was used (i.e. from a notebook)... |
Summary
Replace dask-based tokenization with a standalone, composable engine that works with any pydantic BaseModel.
API
model.model_token— deterministic hex digest of a model's data (and optionally behavior)__ccflow_tokenizer__ClassVar — swap the tokenizer engine per-class or globally__ccflow_tokenizer_deps__ClassVar — declare standalone function dependencies for behavior hashingArchitecture
Two independent axes of composition:
DefaultTokenizer()— data onlyDefaultTokenizer.with_bytecode()— data + bytecode-hashed own methods (default for behavior)DefaultTokenizer.with_ast()— data + AST-normalized own methodsWhat changed
ccflow/utils/tokenize.pyccflow/base.pymodel_tokenproperty,__ccflow_tokenizer__ClassVar, cache invalidationccflow/callable.pyModelEvaluationContext.model_token— strips transparent evaluator layersccflow/evaluators/common.pycache_key()to usemodel_tokendirectlyccflow/utils/__init__.pypyproject.tomldocs/wiki/Tokenization.mdccflow/tests/utils/test_tokenize.pyExtension points
normalize_token.register(MyType)— custom type handlers via singledispatch__ccflow_tokenize__()— custom canonical form for any objectFunctionCollector/SourceTokenizersubclasses — custom strategiesField(default_factory=...)— inject external state (package versions, file checksums, env vars) into tokensLimitations