Skip to content

Commit 18ad00b

Browse files
committed
chore: improve skills to 100% review score and bump to v0.2.0
- Add trigger hints and code snippets to both skills - Add checkpoints after each step - Extract module reference and troubleshooting into linked files - Bump codeflash-skills tile to 0.2.0
1 parent 6718e66 commit 18ad00b

6 files changed

Lines changed: 173 additions & 72 deletions

File tree

tessl.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
"version": "0.1.0"
7272
},
7373
"codeflash/codeflash-skills": {
74-
"version": "0.1.0"
74+
"version": "0.2.0"
7575
}
7676
}
7777
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Module Reference
2+
3+
| Feature area | Primary module | Key files |
4+
|-------------|----------------|-----------|
5+
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
6+
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
7+
| New AI service endpoint | `api/` | `aiservice.py` |
8+
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
9+
| Context extraction change | `context/` | `code_context_extractor.py` |
10+
| New CLI command | `cli_cmds/` | `cli.py` |
11+
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
12+
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
13+
| PR/result changes | `github/`, `result/` | Relevant handlers |
Lines changed: 76 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,23 @@
11
---
22
name: add-codeflash-feature
3-
description: Step-by-step workflow for adding a new feature to the codeflash codebase
3+
description: >
4+
Guides implementation of new functionality in the codeflash optimization engine.
5+
Use when adding a feature, building new functionality, implementing a new
6+
optimization strategy, adding a language backend, creating an API endpoint,
7+
extending the verification pipeline, or developing any new codeflash capability.
8+
Covers module identification, Result type patterns, config, types, tests, and
9+
quality checks.
410
---
511

612
# Add Codeflash Feature
713

8-
Use this workflow when implementing a new feature in the codeflash codebase.
14+
Use this workflow when implementing new functionality in the codeflash codebase — new optimization strategies, language backends, API endpoints, CLI commands, config options, or pipeline extensions.
915

1016
## Step 1: Identify Target Modules
1117

12-
Determine which module(s) need modification based on the feature:
18+
Determine which module(s) need modification. See [MODULE_REFERENCE.md](MODULE_REFERENCE.md) for the full mapping of feature areas to modules and key files.
1319

14-
| Feature area | Primary module | Key files |
15-
|-------------|----------------|-----------|
16-
| New optimization strategy | `optimization/` | `function_optimizer.py`, `optimizer.py` |
17-
| New test type | `verification/`, `models/` | `test_runner.py`, `pytest_plugin.py`, `test_type.py` |
18-
| New AI service endpoint | `api/` | `aiservice.py` |
19-
| New language support | `languages/` | Create new `languages/<lang>/support.py` |
20-
| Context extraction change | `context/` | `code_context_extractor.py` |
21-
| New CLI command | `cli_cmds/` | `cli.py` |
22-
| New config option | `setup/`, `code_utils/` | `config_consts.py`, `setup/detector.py` |
23-
| Discovery filter | `discovery/` | `functions_to_optimize.py` |
24-
| PR/result changes | `github/`, `result/` | Relevant handlers |
20+
**Checkpoint**: Read the target files and understand existing patterns before writing any code. Look for similar features already implemented as reference.
2521

2622
## Step 2: Follow Result Type Pattern
2723

@@ -43,33 +39,76 @@ if not is_successful(result):
4339
value = result.unwrap()
4440
```
4541

42+
**Checkpoint**: Verify your function signatures match the `Result` pattern used in surrounding code. Not all functions use `Result` — match the convention of the module you're modifying.
43+
4644
## Step 3: Add Configuration Constants
4745

4846
If the feature needs configurable thresholds or limits:
4947

5048
1. Add constants to `code_utils/config_consts.py`
51-
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for `LOW`, `MEDIUM`, `HIGH`
52-
3. Add a corresponding `EffortKeys` enum entry
53-
4. Access via `get_effort_value(EffortKeys.MY_KEY, effort_level)`
49+
2. If effort-dependent, add to `EFFORT_VALUES` dict with values for all three levels:
50+
```python
51+
# In config_consts.py:
52+
class EffortKeys(str, Enum):
53+
MY_NEW_KEY = "MY_NEW_KEY"
54+
55+
EFFORT_VALUES: dict[str, dict[EffortLevel, Any]] = {
56+
# ... existing entries ...
57+
EffortKeys.MY_NEW_KEY.value: {
58+
EffortLevel.LOW: 1,
59+
EffortLevel.MEDIUM: 3,
60+
EffortLevel.HIGH: 5,
61+
},
62+
}
63+
```
64+
3. Access via `get_effort_value(EffortKeys.MY_NEW_KEY, effort_level)`
65+
66+
**Checkpoint**: Skip this step if the feature doesn't need configuration. Not every feature requires new constants.
5467

5568
## Step 4: Add Domain Types
5669

5770
If new data structures are needed:
5871

5972
1. Add Pydantic models or frozen dataclasses to `models/models.py` or `models/function_types.py`
60-
2. Use `@dataclass(frozen=True)` for immutable data
61-
3. Use `BaseModel` for models that need serialization
62-
4. Keep `function_types.py` dependency-free (no imports from other codeflash modules)
73+
2. Use `@dataclass(frozen=True)` for immutable data, `BaseModel` for models that need serialization
74+
3. Keep `function_types.py` dependency-free — no imports from other codeflash modules
75+
76+
Example following existing patterns:
77+
```python
78+
# In models/models.py:
79+
@dataclass(frozen=True)
80+
class MyNewType:
81+
name: str
82+
value: int
83+
source: OptimizedCandidateSource
84+
85+
# For serializable models:
86+
class MyNewModel(BaseModel):
87+
items: list[MyNewType] = []
88+
```
89+
90+
**Checkpoint**: Skip this step if you can reuse existing types. Check `models/models.py` for types that already fit your needs.
6391

6492
## Step 5: Write Tests
6593

6694
Follow existing test patterns:
6795

68-
1. Create test files in the `tests/` directory mirroring the source structure
69-
2. Use pytest's `tmp_path` fixture for temp directories
70-
3. Always call `.resolve()` on Path objects
96+
1. Create test files in `tests/` mirroring the source structure (e.g., `tests/test_optimization/test_my_feature.py`)
97+
2. Use pytest's `tmp_path` fixture for temp directories — never `NamedTemporaryFile`
98+
3. Always call `.resolve()` on Path objects and `.as_posix()` for string conversion
7199
4. Assert full string equality for code context tests — no substring matching
72-
5. Remember the pytest plugin patches `time`, `random`, `uuid`, `datetime` — don't rely on real values
100+
5. The pytest plugin patches `time`, `random`, `uuid`, `datetime` — never rely on real values in verification tests
101+
102+
```python
103+
def test_my_feature(tmp_path: Path) -> None:
104+
test_file = tmp_path / "test_module.py"
105+
test_file.write_text("def foo(): return 1", encoding="utf-8")
106+
result = my_operation(test_file.resolve())
107+
assert is_successful(result)
108+
assert result.unwrap() == expected_value
109+
```
110+
111+
**Checkpoint**: Run the new tests in isolation before proceeding: `uv run pytest tests/path/to/test_file.py -x`
73112

74113
## Step 6: Run Quality Checks
75114

@@ -86,11 +125,22 @@ uv run mypy codeflash/
86125
uv run pytest tests/path/to/relevant/tests -x
87126
```
88127

128+
**If checks fail**:
129+
- `prek run` failures: Fix formatting/lint issues reported by ruff, then re-run
130+
- `mypy` failures: Fix type errors — common issues are missing return types, wrong `Optional` usage, or missing imports in `TYPE_CHECKING` block
131+
- Test failures: Fix the failing test or the implementation, then re-run
132+
89133
## Step 7: Language Support Considerations
90134

91135
If the feature needs to work across languages:
92136

93-
1. Check if the feature uses language-specific APIs — use `get_language_support(identifier)` from `languages/registry.py`
137+
1. Use `get_language_support(identifier)` from `languages/registry.py` — never import language classes directly
94138
2. Current language is a singleton: `set_current_language()` / `current_language()` from `languages/current.py`
95139
3. Use `is_python()` / `is_javascript()` guards for language-specific branches
96-
4. New language support classes must use `@register_language` decorator
140+
4. New language support classes must use `@register_language` decorator and be instantiable without arguments
141+
142+
**Checkpoint**: Skip this step if the feature is Python-only. Most features don't need multi-language support.
143+
144+
## Troubleshooting
145+
146+
If you run into issues, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for common problems and fixes (circular imports, `UnsupportedLanguageError`, CI path failures, Pydantic validation errors, token limit exceeded).
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Troubleshooting
2+
3+
| Problem | Likely cause | Fix |
4+
|---------|-------------|-----|
5+
| Circular import at startup | Importing from `models/` in a module loaded early | Move import into `TYPE_CHECKING` block or use lazy import |
6+
| `UnsupportedLanguageError` | Language modules not registered yet | Call `_ensure_languages_registered()` or use `get_language_support()` which does it automatically |
7+
| Tests pass locally but fail in CI | Path differences (absolute vs relative) | Always use `.resolve()` on Path objects |
8+
| `ValidationError` from Pydantic | Invalid code passed to `CodeString` | Check that generated code passes syntax validation for the target language |
9+
| `encoded_tokens_len` exceeds limit | Context too large | Reduce helper functions or split into read-only vs read-writable |
Lines changed: 73 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
---
22
name: debug-optimization-failure
3-
description: Debug why a codeflash optimization failed at any pipeline stage
3+
description: >
4+
Diagnose why a codeflash optimization produced no results or failed silently.
5+
Use when an optimization run errors out, returns no candidates, or all candidates
6+
are rejected. Walks through discovery, ranking, context limits, AI service,
7+
test verification, deduplication, and repair stages.
48
---
59

610
# Debug Optimization Failure
@@ -11,85 +15,110 @@ Use this workflow when an optimization run fails or produces no results. Work th
1115

1216
Determine if the function was discovered by `FunctionVisitor`.
1317

14-
1. Look at the discovery output or logs for the function name
15-
2. Check `discovery/functions_to_optimize.py` — the `FunctionVisitor` filters out:
16-
- Functions that are too small or trivial
17-
- Functions matching exclude patterns in config
18-
- Functions already optimized (`was_function_previously_optimized()`)
19-
3. Verify the function file is under the configured `module-root`
18+
1. Search logs for the function name in discovery output:
19+
```python
20+
# In discovery/functions_to_optimize.py, FunctionVisitor filters out:
21+
# - Functions matching exclude patterns in pyproject.toml [tool.codeflash]
22+
# - Functions already optimized (was_function_previously_optimized())
23+
# - Functions outside the configured module-root
24+
```
25+
2. Verify the function file is under the configured `module-root` in `pyproject.toml`
26+
3. Check if the function was previously optimized — look for it in the optimization history
2027

21-
**If not discovered**: Check config patterns, file location, and function size.
28+
**Checkpoint**: If the function doesn't appear in discovery output, fix config patterns or file location before proceeding.
2229

2330
## Step 2: Check Ranking
2431

2532
If trace data is used, check if the function was ranked high enough.
2633

27-
1. Look at `benchmarking/function_ranker.py` output
28-
2. The function's **addressable time** must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`
29-
3. Addressable time = own time + callee time / call count
34+
1. Look at `benchmarking/function_ranker.py` output for the function's addressable time
35+
2. The function must exceed `DEFAULT_IMPORTANCE_THRESHOLD=0.001`:
36+
```python
37+
# Addressable time = own time + callee time / call count
38+
# Grep for the function in ranking output:
39+
# grep -i "function_name" in ranking logs
40+
```
41+
3. Functions below the threshold are silently skipped
3042

31-
**If ranked too low**: The function doesn't spend enough time to be worth optimizing.
43+
**Checkpoint**: If ranked too low, the function doesn't spend enough time to be worth optimizing. No fix needed — this is expected.
3244

3345
## Step 3: Check Context Token Limits
3446

3547
Verify the function's context fits within token limits.
3648

37-
1. Check `OPTIMIZATION_CONTEXT_TOKEN_LIMIT=16000` and `TESTGEN_CONTEXT_TOKEN_LIMIT=16000` in `code_utils/config_consts.py`
38-
2. Token counting is done by `encoded_tokens_len()` in `code_utils/code_utils.py`
39-
3. Large helper function chains or deep dependency trees can blow the limit
49+
1. Check thresholds in `code_utils/config_consts.py`:
50+
```python
51+
OPTIMIZATION_CONTEXT_TOKEN_LIMIT = 16000 # tokens
52+
TESTGEN_CONTEXT_TOKEN_LIMIT = 16000 # tokens
53+
```
54+
2. Token counting uses `encoded_tokens_len()` from `code_utils/code_utils.py`
55+
3. Common causes: large helper function chains, deep dependency trees, large class hierarchies
4056

41-
**If context too large**: The function has too many dependencies. Consider refactoring to reduce context size.
57+
**Checkpoint**: If context exceeds limits, the function is rejected. Consider refactoring to reduce dependencies or splitting large modules.
4258

4359
## Step 4: Check AI Service Response
4460

4561
Verify the AI service returned valid candidates.
4662

47-
1. Check logs for `AiServiceClient` request/response
48-
2. Look for HTTP errors (non-200 status codes)
49-
3. Verify `_get_valid_candidates()` parsed the response — empty `code_strings` means invalid markdown code blocks
50-
4. Check if all candidates were filtered out during parsing
63+
1. Look for HTTP errors in logs:
64+
```
65+
# Error patterns to search for:
66+
"Error generating optimized candidates"
67+
"Error generating jit rewritten candidate"
68+
"cli-optimize-error-caught"
69+
"cli-optimize-error-response"
70+
```
71+
2. Check `_get_valid_candidates()` in `api/aiservice.py` — empty `code_strings` after `CodeStringsMarkdown.parse_markdown_code()` means the LLM returned malformed code blocks
72+
3. Verify API key is valid (`get_codeflash_api_key()`)
5173

52-
**If no candidates returned**: Check API key, network connectivity, and service status.
74+
**Checkpoint**: If no candidates returned, check API key, network, and service status before proceeding.
5375

5476
## Step 5: Check Test Failures
5577

5678
Determine if candidates failed behavioral or benchmark tests.
5779

58-
1. **Behavioral failures**: Compare return values, stdout, pass/fail status between original baseline and candidate
59-
- Check `TestDiffScope`: `RETURN_VALUE`, `STDOUT`, `DID_PASS`
60-
- Look at JUnit XML results for specific test failures
61-
2. **Benchmark failures**: Check if candidate met `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
62-
3. **Stability failures**: Check if timing was stable within `STABILITY_WINDOW_SIZE=0.35`
80+
1. **Behavioral failures** — compare return values, stdout, pass/fail between baseline and candidate:
81+
```python
82+
# TestDiffScope enum values to look for:
83+
# RETURN_VALUE - function returned different value
84+
# STDOUT - different stdout output
85+
# DID_PASS - test passed/failed differently
86+
```
87+
2. **Benchmark failures** — candidate must beat `MIN_IMPROVEMENT_THRESHOLD=0.05` (5% speedup)
88+
3. **Stability failures** — timing must be stable within `STABILITY_WINDOW_SIZE=0.35` (35% of iterations)
89+
4. Check JUnit XML test results in the temp directory for specific failure messages
6390

64-
**If behavioral failure**: The optimization changed the function's behavior. Check test diffs for specific mismatches.
65-
**If benchmark failure**: The optimization didn't provide enough speedup.
91+
**Checkpoint**: Behavioral failure = optimization changed behavior (check test diffs). Benchmark failure = not fast enough. Stability failure = noisy timing environment.
6692

6793
## Step 6: Check Deduplication
6894

6995
Verify candidates weren't deduplicated away.
7096

71-
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized code → candidate mapping
72-
2. `normalize_code()` from `code_utils/deduplicate_code.py` normalizes AST for comparison
73-
3. If all candidates normalize to the same code, only one is actually tested
97+
1. `CandidateEvaluationContext.ast_code_to_id` tracks normalized AST → candidate mapping
98+
2. `normalize_code()` from `code_utils/deduplicate_code.py` strips comments/whitespace and normalizes the AST
99+
3. If all candidates normalize to identical code, only the first is tested — the rest copy its results
74100

75-
**If all duplicates**: The LLM generated the same optimization multiple times. Try higher effort level.
101+
**Checkpoint**: If all duplicates, the LLM generated the same optimization repeatedly. Try a higher effort level for more diverse candidates.
76102

77103
## Step 7: Check Repair/Refinement
78104

79105
If initial candidates failed, check repair and refinement stages.
80106

81-
1. Repair only runs if fewer than `MIN_CORRECT_CANDIDATES=2` passed
82-
2. Repair sends `AIServiceCodeRepairRequest` with test diffs
83-
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` — if too many tests failed, repair is skipped
84-
4. Refinement only runs on top valid candidates
107+
1. Repair only triggers if fewer than `MIN_CORRECT_CANDIDATES=2` passed behavioral tests
108+
2. Repair sends `AIServiceCodeRepairRequest` with `TestDiff` objects showing what went wrong
109+
3. Check `REPAIR_UNMATCHED_PERCENTAGE_LIMIT` (effort-dependent: 0.2/0.3/0.4) — if too many tests failed, repair is skipped entirely
110+
4. Refinement only runs on the top valid candidates (count depends on effort level)
85111

86-
**If repair also failed**: The optimization approach may not work for this function.
112+
**Checkpoint**: If repair also fails, the optimization approach likely doesn't work for this function. The function may rely on side effects or external state that the LLM can't safely optimize.
87113

88-
## Key Files to Check
114+
## Key Files Reference
89115

90-
- `optimization/function_optimizer.py` — Main optimization loop, `determine_best_candidate()`
91-
- `verification/test_runner.py` — Test execution
92-
- `api/aiservice.py` — AI service communication
93-
- `code_utils/config_consts.py` — Thresholds
94-
- `context/code_context_extractor.py` — Context extraction
95-
- `models/models.py``CandidateEvaluationContext`, `TestResults`
116+
| File | What to check |
117+
|------|---------------|
118+
| `optimization/function_optimizer.py` | Main loop, `determine_best_candidate()` |
119+
| `verification/test_runner.py` | Test subprocess execution |
120+
| `api/aiservice.py` | AI service requests/responses |
121+
| `code_utils/config_consts.py` | All thresholds and limits |
122+
| `context/code_context_extractor.py` | Context extraction and token counting |
123+
| `models/models.py` | `CandidateEvaluationContext`, `TestResults`, `TestDiff` |
124+
| `code_utils/deduplicate_code.py` | AST normalization for deduplication |

tiles/codeflash-skills/tile.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "codeflash/codeflash-skills",
3-
"version": "0.1.0",
3+
"version": "0.2.0",
44
"summary": "Procedural workflows for developing and debugging codeflash",
55
"private": true,
66
"skills": {

0 commit comments

Comments
 (0)