Skip to content

Commit 8f7a001

Browse files
author
Brad Kinnard
committed
fix(tokenizer): guard the tiktoken cache with a threading.Lock
The lazy-init code read and wrote two module-level globals without synchronization. Two threads racing on first init could both observe the untested state and both call tiktoken.get_encoding, then write different encoding objects to the cache. The double-checked-locking pattern keeps the fast path lock-free after init while making the first-init transition atomic. The new concurrent-first-init test runs eight workers through _get_tiktoken_enc and asserts they all see the same cached object.
1 parent 5e7d6fc commit 8f7a001

4 files changed

Lines changed: 37 additions & 10 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2626

2727
### Fixed
2828

29+
- `tokenizer._get_tiktoken_enc` is now thread-safe. The first-init fast path takes a module-level lock and re-checks the cached state, so concurrent callers (e.g. editor plugins running in a worker pool) cannot race on `tiktoken.get_encoding`.
2930
- `_is_action_verb` now recognizes `-ing` and `-ed` inflections, including the e-drop ("validating" -> "validate") and doubled-consonant ("scanning" -> "scan") forms. Descriptions like "Validating skills..." or "Used for..." used to score 0 on the action axis even though the leading word was a clear action verb.
3031
- `--format` error message now lists `github` as a valid choice (`cli.py`). The argparse `choices=` set on the `--format` definition has accepted `github` since v1.2.3, but the post-parse runtime check still printed the pre-v1.2.3 four-value list.
3132

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
Static analyzer for `SKILL.md` files. Validates frontmatter, body sizing, file references, and cross-agent compatibility against the [agentskills.io specification](https://agentskills.io/specification). No network calls. No LLM API calls. No file mutations.
1616

17-
785 tests cover all rule modules.
17+
786 tests cover all rule modules.
1818

1919
## Install
2020

src/skillcheck/tokenizer.py

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import re
2+
import threading
23
from typing import Any
34

45
# Two patterns that cover the token-relevant structure of BPE tokenization:
@@ -18,22 +19,30 @@
1819

1920
# Lazy-cached tiktoken encoding. The BPE merge table is allocated once on
2021
# first use and reused for all subsequent calls, avoiding the per-call
21-
# overhead of ``tiktoken.get_encoding()``.
22+
# overhead of ``tiktoken.get_encoding()``. The lock guards the first-init
23+
# fast path so concurrent worker threads (editor plugins planned for
24+
# future use) cannot both observe the untested state and race on
25+
# ``tiktoken.get_encoding``.
2226
_tiktoken_enc: Any | None = None
2327
_tiktoken_available: bool | None = None # tri-state: None = untested
28+
_tiktoken_lock = threading.Lock()
2429

2530

2631
def _get_tiktoken_enc() -> Any | None:
2732
"""Return a cached tiktoken ``Encoding``, or *None* if unavailable."""
2833
global _tiktoken_enc, _tiktoken_available # noqa: PLW0603
29-
if _tiktoken_available is None:
30-
try:
31-
import tiktoken # type: ignore[import-untyped]
32-
_tiktoken_enc = tiktoken.get_encoding("cl100k_base")
33-
_tiktoken_available = True
34-
except ImportError:
35-
_tiktoken_available = False
36-
return _tiktoken_enc
34+
if _tiktoken_available is not None:
35+
return _tiktoken_enc
36+
with _tiktoken_lock:
37+
# Re-check under the lock so the second arrival sees the cached value.
38+
if _tiktoken_available is None:
39+
try:
40+
import tiktoken # type: ignore[import-untyped]
41+
_tiktoken_enc = tiktoken.get_encoding("cl100k_base")
42+
_tiktoken_available = True
43+
except ImportError:
44+
_tiktoken_available = False
45+
return _tiktoken_enc
3746

3847

3948
def estimate_tokens(text: str) -> int:

tests/test_tokenizer.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
"""Tests for Fix 1: Tokenizer lazy caching."""
22

3+
from concurrent.futures import ThreadPoolExecutor
4+
35
from skillcheck.tokenizer import _get_tiktoken_enc, estimate_tokens
46

57

@@ -30,3 +32,18 @@ def test_tiktoken_enc_cached():
3032
def test_empty_string_returns_one():
3133
"""Empty string returns at least 1 (the floor)."""
3234
assert estimate_tokens("") >= 1
35+
36+
37+
def test_tokenizer_concurrent_first_init_is_consistent():
38+
"""Concurrent first-init calls return the same cached encoding instance.
39+
40+
The first observer pays the tiktoken.get_encoding cost; subsequent
41+
observers find the cache populated under the lock and return the
42+
same object (or all None when tiktoken is not installed).
43+
"""
44+
with ThreadPoolExecutor(max_workers=8) as pool:
45+
results = list(pool.map(lambda _: _get_tiktoken_enc(), range(8)))
46+
distinct = {id(r) for r in results}
47+
assert len(distinct) == 1, (
48+
f"Concurrent first-init produced different encoding objects: {distinct}"
49+
)

0 commit comments

Comments
 (0)