Skip to content

Commit 1d4da22

Browse files
authored
Harden multilingual decoder verification and restore follow-up test coverage (#82)
## Summary Follow-up to #67. This PR hardens the multilingual decoder/codegen pipeline by tightening interface completeness checks, generated-artifact hygiene, C/C++ verification semantics, and planner prompt safety. It also restores the extracted multilingual regression tests that were moved out of the original #67 branch for a follow-up PR. ## What changed ### Decoder and verification hardening - Build C/C++ CMake targets before running `ctest`, so verification does not falsely pass or fail because test executables were never built. - Treat C/C++ `make test` targets that only compile objects, without actually running tests, as verification errors instead of successful test runs. - Skip generated/build/cache directories when collecting C/C++ source files for syntax and verification commands. - Improve C/C++ prompt rules to: - avoid editing build/cache/generated artifacts, - use full CMake build + CTest commands, - avoid relying on undeclared/transitively included helper functions, - report explicit syntax-check summaries. ### Generated artifact hygiene - Add a shared generated-artifact classifier and prompt rule helper. - Install local `.git/info/exclude` hygiene rules during batch startup. - Reject persisted generated artifacts before post-verify and after verification runs. - Prevent batch branches containing generated artifacts from being merged. ### Interface and planner robustness - Deduplicate repeated whole-file `file_code` blocks in `interfaces.json` before serialization and before planner prompt construction. - Add interface coverage validation so `plan_tasks` fails fast when `interfaces.json` does not cover all skeleton features. - Improve Python dependency collection for same-file calls and `self.method()` invocations. - Save in-progress interface generation to `interfaces.json.partial` and only overwrite the canonical `interfaces.json` after successful completion. - Allow interface review additions to scaffold missing file entries under existing feature subtrees. ### Parser and language detection fixes - Classify header-heavy mixed C/C++ repositories as C++ when C and C++ votes appear together. - Harden fallback string-literal stripping against catastrophic regex backtracking on unterminated escaped strings. ### Final validation behavior - Propagate smoke-test failures into the final validation result instead of allowing a successful unit-test result to mask a failed smoke check. - Clarify that `plan --check-only` warning states are not complete/done states and should not allow downstream stages to proceed. ## Test coverage This PR adds extensive regression coverage for the multilingual pipeline, including: - generated artifact hygiene, - interface source deduplication, - skeleton/interface coverage validation, - multilingual dependency graph behavior, - multilingual encoder/codegen behavior, - planner language support and prompt deduplication, - C/C++/Go/Rust/TypeScript/JavaScript/Python parser behavior, - decoder language backends and planning phases, - zero-test guard behavior, - final test repair, - repo language resolution, - orphan/test/build exclusion handling, - smoke multilingual coverage. The diff adds 33 new test files and restores the extracted multilingual tests intended for this follow-up PR. ## Notes for reviewers - This branch was rebased on top of the squash-merged #67 commit, so the PR diff should now represent only the follow-up hardening and restored tests. - `run_batch` / `post_verify` now update `.git/info/exclude` with local generated-artifact exclusions. This is intentionally local-only and non-destructive. - `plan_tasks` now fails on incomplete interface coverage instead of silently planning from stale or partial interfaces. - C/C++ projects whose `make test` target only compiles objects but does not execute tests will now be rejected as invalid verification results.
1 parent 2265cf2 commit 1d4da22

57 files changed

Lines changed: 8676 additions & 70 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CoderMind/scripts/code_gen/batch_prompts.py

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from typing import Any, Dict, List, Optional
2727

2828
from common.execution_state import BatchExecutionState, load_code_gen_state
29+
from common.generated_artifacts import generated_artifact_prompt_rule
2930
from common.import_normalizer import build_import_convention_snippet
3031
from common.paths import (
3132
CODE_GEN_STATE_FILE as STATE_FILE,
@@ -212,6 +213,7 @@
212213
for example `5 passed in 0.42s`, `ok ./...`, or `test result: ok`. Copy it
213214
verbatim from the run you just performed; do NOT invent it. This lets the
214215
runner cross-check your claim against an independent re-run.
216+
{summary_fallback_rule}
215217
216218
## ── Capabilities ─────────────────────────────────────────
217219
@@ -413,6 +415,49 @@ def _fallback_test_command(backend: LanguageBackend) -> List[str]:
413415
return list(_FALLBACK_TEST_COMMANDS.get(backend.name, [backend.prompt_hints().test_framework_name]))
414416

415417

418+
def _dynamic_c_family_syntax_command(
419+
backend: LanguageBackend,
420+
command: List[str],
421+
) -> str:
422+
compiler = shlex.quote(str(command[0]))
423+
include_flags: List[str] = []
424+
for index, part in enumerate(command):
425+
if part == "-I" and index + 1 < len(command):
426+
include_flags.append('-I "$PWD"')
427+
standard = "-std=c++17" if backend.name == "cpp" else "-std=c99"
428+
patterns = (
429+
r'\( -name "*.cpp" -o -name "*.cc" -o -name "*.cxx" \)'
430+
if backend.name == "cpp"
431+
else r'-name "*.c"'
432+
)
433+
include_text = " ".join(include_flags)
434+
return (
435+
"bash -lc "
436+
+ shlex.quote(
437+
"mapfile -d '' sources < <(find . "
438+
r"\( -path './.git' -o -path './.cmind' -o -path './build' "
439+
r"-o -path './node_modules' -o -path './target' "
440+
r"-o -path './dist' -o -path './coverage' -o -path './.venv' "
441+
r"-o -path './venv' -o -path './CMakeFiles' \) -prune "
442+
f"-o -type f {patterns} -print0); "
443+
f"if (( ${{#sources[@]}} == 0 )); then echo 'No {backend.prompt_hints().display_name} source files found' >&2; exit 1; fi; "
444+
f"{compiler} {standard} {include_text} -Wall -Wextra -fsyntax-only \"${{sources[@]}}\""
445+
)
446+
)
447+
448+
449+
def _cmake_c_family_test_command(command: List[str]) -> str:
450+
ctest = shlex.quote(str(command[0]))
451+
return (
452+
"bash -lc "
453+
+ shlex.quote(
454+
"cmake -S . -B build && "
455+
"cmake --build build && "
456+
f"{ctest} --test-dir build --output-on-failure"
457+
)
458+
)
459+
460+
416461
def _build_backend_test_cmd(
417462
backend: LanguageBackend,
418463
repo_path: Path,
@@ -425,7 +470,12 @@ def _build_backend_test_cmd(
425470

426471
env = backend.detect_env(repo_path) or EnvHandle(project_root=repo_path.resolve())
427472
try:
428-
return _shell_join(backend.test_command(env))
473+
command = backend.test_command(env)
474+
if backend.name in {"c", "cpp"} and command and "ctest" in Path(str(command[0])).name:
475+
return _cmake_c_family_test_command(command)
476+
if backend.name in {"c", "cpp"} and "-fsyntax-only" in command:
477+
return _dynamic_c_family_syntax_command(backend, command)
478+
return _shell_join(command)
429479
except (ToolchainUnavailable, NotImplementedError, OSError):
430480
return _shell_join(_fallback_test_command(backend))
431481

@@ -513,6 +563,16 @@ def _test_timeout_rule(backend: LanguageBackend) -> str:
513563
return "- Run long-lived servers, watchers, or interactive commands instead of the exact test command"
514564

515565

566+
def _summary_fallback_rule(backend: LanguageBackend, test_command: str) -> str:
567+
if backend.name in {"c", "cpp"} and "-fsyntax-only" in test_command:
568+
return (
569+
"\nFor C/C++ syntax-only commands: if the exact command exits 0 "
570+
"and prints no summary line, use exactly "
571+
"`PYTEST_SUMMARY: syntax check passed`.\n"
572+
)
573+
return ""
574+
575+
516576
def _build_language_context(backend: LanguageBackend, test_command: str) -> str:
517577
"""Build the target-language prompt section."""
518578
hints = backend.prompt_hints()
@@ -526,6 +586,13 @@ def _build_language_context(backend: LanguageBackend, test_command: str) -> str:
526586
f"- Module naming: {hints.module_naming_rule}\n"
527587
f"- Style: {hints.style_directive}\n"
528588
)
589+
artifact_extra = ""
590+
if backend.name in {"c", "cpp"}:
591+
artifact_extra = (
592+
"If CTest needs arguments or target wiring, change source files "
593+
"such as `CMakeLists.txt` or the test source instead."
594+
)
595+
context += generated_artifact_prompt_rule(artifact_extra)
529596
if backend.name != "python":
530597
# The decoder's defaults are Python-centric; without an explicit
531598
# prohibition the sub-agent tends to add Python helpers (a main.py
@@ -542,6 +609,13 @@ def _build_language_context(backend: LanguageBackend, test_command: str) -> str:
542609
f"- Run tests ONLY with `{test_command}` ({hints.test_framework_name}). Do NOT wrap, "
543610
"re-implement, or drive the test suite through pytest or any Python script.\n"
544611
)
612+
if backend.name in {"c", "cpp"}:
613+
context += (
614+
"- C/C++ tests and examples must be valid standalone translation units. "
615+
"If a test or example calls a helper implemented in another `.c`/`.cpp` file, "
616+
"create or update a matching header and include that header; do NOT rely on "
617+
"transitive `.cpp` inclusion or undeclared functions.\n"
618+
)
545619
else:
546620
context += (
547621
"- Do NOT introduce Python-specific files, packages, or pytest conventions unless this is a Python project.\n"
@@ -886,6 +960,7 @@ def build_tdd_prompt(
886960
dependency_install_capability=_dependency_install_capability(backend, repo_path),
887961
dependency_management=_dependency_management_text(backend, repo_path),
888962
test_timeout_rule=_test_timeout_rule(backend),
963+
summary_fallback_rule=_summary_fallback_rule(backend, pytest_cmd),
889964
import_convention=import_convention,
890965
language_context=_build_language_context(backend, pytest_cmd),
891966
dependency_context=dep_ctx_str,
@@ -938,7 +1013,7 @@ def build_resume_prompt(
9381013
post_verify_section = (
9391014
"\n\n## ⚠ False-positive PASS detected\n"
9401015
"Your previous attempt ended with `BATCH_RESULT: PASS` and the\n"
941-
"PYTEST_SUMMARY line {agent_summary_repr}, but the runner's\n"
1016+
f"PYTEST_SUMMARY line {agent_summary_repr}, but the runner's\n"
9421017
"independent test-command re-run reported the failure shown below.\n"
9431018
"Possible causes you must investigate:\n"
9441019
"* You did not actually run the exact test command before declaring PASS.\n"

CoderMind/scripts/code_gen/final_validation.py

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,32 @@
4646
)
4747

4848

49+
def _fail_final_test_for_smoke_error(
50+
result_dict: Dict[str, Any],
51+
message: str,
52+
*,
53+
smoke_dict: Optional[Dict[str, Any]] = None,
54+
) -> None:
55+
"""Mark final validation failed because smoke validation failed."""
56+
result_dict["success"] = False
57+
result_dict["errors"] = max(int(result_dict.get("errors", 0) or 0), 1)
58+
result_dict["output"] = message
59+
result_dict["next_action"] = (
60+
"Unit tests passed, but smoke validation failed. Fix the smoke "
61+
"failure and re-run final validation."
62+
)
63+
result_dict["smoke_test_error"] = message
64+
if smoke_dict is None:
65+
smoke_dict = {
66+
"success": False,
67+
"type": "smoke_test",
68+
"findings": [{"severity": "error", "message": message}],
69+
"error_count": 1,
70+
"warning_count": 0,
71+
}
72+
result_dict["smoke_test"] = smoke_dict
73+
74+
4975
def final_test(
5076
repo_path: Optional[Path] = None,
5177
state_path: Path = STATE_FILE,
@@ -238,6 +264,8 @@ def final_test(
238264
actionable = [f for f in smoke_result.findings if f.severity == "error"]
239265

240266
if actionable:
267+
remaining = actionable
268+
recheck_success = True
241269
findings_desc = "\n".join(
242270
f"- [{f.severity}] {f.message}" for f in actionable
243271
)
@@ -293,6 +321,7 @@ def final_test(
293321
result_dict["smoke_test"] = smoke_result_2.to_dict()
294322
result_dict["smoke_repair_attempted"] = True
295323
result_dict["post_repair_tests_pass"] = recheck.success
324+
recheck_success = recheck.success
296325
remaining = [
297326
f for f in smoke_result_2.findings
298327
if f.severity == "error"
@@ -303,18 +332,39 @@ def final_test(
303332
len(remaining), len(actionable),
304333
"PASS" if recheck.success else "FAIL",
305334
)
335+
if remaining or not recheck_success:
336+
smoke_dict = result_dict.get("smoke_test")
337+
if not isinstance(smoke_dict, dict):
338+
smoke_dict = {}
339+
message = (
340+
"Smoke validation failed after unit tests passed. "
341+
f"Remaining smoke errors: {len(remaining)}; "
342+
f"post-repair tests pass: {recheck_success}."
343+
)
344+
_fail_final_test_for_smoke_error(
345+
result_dict,
346+
message,
347+
smoke_dict=smoke_dict,
348+
)
306349
except ImportError:
307350
logger.debug("smoke_test module not available, skipping")
308351
except Exception as exc:
309352
logger.warning("Smoke test / repair failed: %s", exc)
353+
_fail_final_test_for_smoke_error(
354+
result_dict,
355+
f"Smoke test failed to run: {exc}",
356+
)
310357

311358
# Save per-stage results for global_review context
312359
save_stage_result("final_test", {
313-
"success": result.success,
314-
"passed": result.passed,
315-
"failed": result.failed,
316-
"errors": result.errors,
317-
"output_tail": "\n".join(result.output.splitlines()[-40:]) if not result.success else "",
360+
"success": bool(result_dict.get("success")),
361+
"passed": result_dict.get("passed", result.passed),
362+
"failed": result_dict.get("failed", result.failed),
363+
"errors": result_dict.get("errors", result.errors),
364+
"output_tail": (
365+
"\n".join(str(result_dict.get("output", "")).splitlines()[-40:])
366+
if not result_dict.get("success") else ""
367+
),
318368
})
319369
smoke_data = result_dict.get("smoke_test")
320370
if isinstance(smoke_data, dict):

CoderMind/scripts/code_gen/git_ops.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@
2020
from pathlib import Path
2121
from typing import List, Optional, Tuple
2222

23+
from common.generated_artifacts import (
24+
find_persisted_generated_artifact_changes,
25+
format_generated_artifact_violation,
26+
)
2327
from common.git_utils import GitRunner, sanitize_branch_component
2428

2529
logger = logging.getLogger(__name__)
@@ -141,6 +145,15 @@ def merge_batch_branch(
141145
)
142146
return False, "branch_missing"
143147

148+
generated_artifact_changes = find_persisted_generated_artifact_changes(
149+
git.repo_path,
150+
base_ref=git.main_branch,
151+
)
152+
if generated_artifact_changes:
153+
summary = format_generated_artifact_violation(generated_artifact_changes)
154+
logger.error("Cannot merge generated artifact changes:\n%s", summary)
155+
return False, summary
156+
144157
# Commit any leftover changes
145158
if git.has_uncommitted_changes():
146159
git.stage_and_commit(f"batch: final changes for {batch_id}")

CoderMind/scripts/code_gen/post_verify.py

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,13 @@
2424
from pathlib import Path
2525
from typing import Tuple
2626

27+
from common.generated_artifacts import (
28+
ensure_generated_artifact_excludes,
29+
find_persisted_generated_artifact_changes,
30+
format_generated_artifact_violation,
31+
)
2732
from common.git_utils import GitRunner
2833
from common.task_batch import PlannedTask
29-
from code_gen.prompts import is_project_docs_batch
3034
from code_gen.test_runner import (
3135
ensure_deps_installed,
3236
find_related_test_files,
@@ -61,10 +65,16 @@ def post_verify(
6165
Returns:
6266
``(passed, test_output_summary)``
6367
"""
64-
# Skip verification for docs batches
65-
if is_project_docs_batch(task):
66-
logger.info("Skipping post-verification for docs batch")
67-
return True, "Documentation batch — no tests."
68+
ensure_generated_artifact_excludes(repo_path)
69+
70+
generated_artifact_changes = find_persisted_generated_artifact_changes(
71+
repo_path,
72+
base_ref=GitRunner.MAIN_BRANCH,
73+
)
74+
if generated_artifact_changes:
75+
summary = format_generated_artifact_violation(generated_artifact_changes)
76+
logger.warning("Post-verification rejected generated artifact changes:\n%s", summary)
77+
return False, summary
6878

6979
# Use the global safety-net timeout for all task types.
7080
# Per-test hang prevention is handled by pytest-timeout (--timeout=DEFAULT_TEST_TIMEOUT).
@@ -137,6 +147,15 @@ def _git_diff_test_files(prefix: str = "tests/") -> list:
137147
backend=backend,
138148
)
139149

150+
generated_artifact_changes = find_persisted_generated_artifact_changes(
151+
repo_path,
152+
base_ref=GitRunner.MAIN_BRANCH,
153+
)
154+
if generated_artifact_changes:
155+
summary = format_generated_artifact_violation(generated_artifact_changes)
156+
logger.warning("Post-verification rejected generated artifact changes:\n%s", summary)
157+
return False, summary
158+
140159
# Build summary
141160
summary_lines = [
142161
f"passed={result.passed} failed={result.failed} "
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""Shared helpers for collapsing duplicated interface source blocks.
2+
3+
Interface synthesis stores each unit's code as the whole-file text for
4+
non-Python units (``LPCodeUnit`` has no ``count_lines`` slicing), so a
5+
file with N units repeats the entire file N times when those blocks are
6+
joined into ``file_code``. These helpers collapse identical blocks so the
7+
joined source reconstructs the original single file (imports plus each
8+
unit once) instead of an O(units x file_size) blow-up.
9+
"""
10+
from __future__ import annotations
11+
12+
from typing import Iterable, List
13+
14+
15+
def dedup_code_blocks(codes: Iterable[str]) -> List[str]:
16+
"""Return ``codes`` with blank and duplicate blocks removed.
17+
18+
Order of first appearance is preserved. Duplicates are detected on the
19+
whitespace-stripped block so trivially different indentation does not
20+
defeat dedup, but each surviving block keeps its own leading indentation
21+
(only trailing whitespace is trimmed) so indented unit slices stay valid
22+
when joined into ``file_code``.
23+
"""
24+
seen: set[str] = set()
25+
unique: List[str] = []
26+
for code in codes:
27+
key = code.strip()
28+
if key and key not in seen:
29+
seen.add(key)
30+
unique.append(code.rstrip())
31+
return unique
32+
33+
34+
def dedup_file_code(unit_codes: Iterable[str], fallback: str = "") -> str:
35+
"""Build ``file_code`` from per-unit code blocks with duplication removed.
36+
37+
``unit_codes`` are the values of ``units_to_code``. When every block is
38+
an identical whole-file copy, the result is that single file; when
39+
blocks are genuinely distinct per-unit slices they are all kept. Falls
40+
back to ``fallback`` when no non-empty block survives.
41+
"""
42+
unique = dedup_code_blocks(unit_codes)
43+
if not unique:
44+
return fallback
45+
return "\n\n".join(unique)

0 commit comments

Comments
 (0)