Harden multilingual decoder verification and restore follow-up test coverage by QingtaoLi1 · Pull Request #82 · microsoft/RPG-ZeroRepo

QingtaoLi1 · 2026-06-25T11:03:17Z

Summary

Follow-up to #67. This PR hardens the multilingual decoder/codegen pipeline by tightening interface completeness checks, generated-artifact hygiene, C/C++ verification semantics, and planner prompt safety. It also restores the extracted multilingual regression tests that were moved out of the original #67 branch for a follow-up PR.

What changed

Decoder and verification hardening

Build C/C++ CMake targets before running ctest, so verification does not falsely pass or fail because test executables were never built.
Treat C/C++ make test targets that only compile objects, without actually running tests, as verification errors instead of successful test runs.
Skip generated/build/cache directories when collecting C/C++ source files for syntax and verification commands.
Improve C/C++ prompt rules to:
- avoid editing build/cache/generated artifacts,
- use full CMake build + CTest commands,
- avoid relying on undeclared/transitively included helper functions,
- report explicit syntax-check summaries.

Generated artifact hygiene

Add a shared generated-artifact classifier and prompt rule helper.
Install local .git/info/exclude hygiene rules during batch startup.
Reject persisted generated artifacts before post-verify and after verification runs.
Prevent batch branches containing generated artifacts from being merged.

Interface and planner robustness

Deduplicate repeated whole-file file_code blocks in interfaces.json before serialization and before planner prompt construction.
Add interface coverage validation so plan_tasks fails fast when interfaces.json does not cover all skeleton features.
Improve Python dependency collection for same-file calls and self.method() invocations.
Save in-progress interface generation to interfaces.json.partial and only overwrite the canonical interfaces.json after successful completion.
Allow interface review additions to scaffold missing file entries under existing feature subtrees.

Parser and language detection fixes

Classify header-heavy mixed C/C++ repositories as C++ when C and C++ votes appear together.
Harden fallback string-literal stripping against catastrophic regex backtracking on unterminated escaped strings.

Final validation behavior

Propagate smoke-test failures into the final validation result instead of allowing a successful unit-test result to mask a failed smoke check.
Clarify that plan --check-only warning states are not complete/done states and should not allow downstream stages to proceed.

Test coverage

This PR adds extensive regression coverage for the multilingual pipeline, including:

generated artifact hygiene,
interface source deduplication,
skeleton/interface coverage validation,
multilingual dependency graph behavior,
multilingual encoder/codegen behavior,
planner language support and prompt deduplication,
C/C++/Go/Rust/TypeScript/JavaScript/Python parser behavior,
decoder language backends and planning phases,
zero-test guard behavior,
final test repair,
repo language resolution,
orphan/test/build exclusion handling,
smoke multilingual coverage.

The diff adds 33 new test files and restores the extracted multilingual tests intended for this follow-up PR.

Notes for reviewers

This branch was rebased on top of the squash-merged feat: multi-language support for the encoder and decoder pipelines #67 commit, so the PR diff should now represent only the follow-up hardening and restored tests.
run_batch / post_verify now update .git/info/exclude with local generated-artifact exclusions. This is intentionally local-only and non-destructive.
plan_tasks now fails on incomplete interface coverage instead of silently planning from stale or partial interfaces.
C/C++ projects whose make test target only compiles objects but does not execute tests will now be rejected as invalid verification results.

Testing

Not run as part of this PR-description preparation.

Stage the 35 test files that were split out of the code PR (commit 5e02cfc) on top of the latest pipeline code, including the review-fix commits. Basing this branch on the current pipeline HEAD keeps the eventual follow-up PR's diff limited to the test files once the code PR has merged to main.

Non-Python interface units are LPCodeUnit instances, which have no count_lines method. The except branch in interface synthesis then stores the whole interface block as every unit's code, so interfaces.json's file_code embeds the entire file once per unit. On large modules that O(units x file_size) blow-up pushes the plan_tasks prompt past the 128 KB single-argument limit, crashing the planner with "Argument list too long" and producing an incomplete tasks.json. Collapse identical per-unit blocks before building the planner prompt. Keeping one copy reconstructs the original complete file (imports plus each unit once), so the planner sees valid source while distinct per-unit slices are preserved. Measured 55-67% reduction on a real Rust subtree.

dominant_language votes one ballot per file and detect_language maps every .h to C (the only config owning .h). A C++ repo that uses .h headers — googletest has 2018 .h vs 1062 .cc — therefore gets more C votes than C++ and is misclassified as C, which fails the encoder's dominant_language expectation and poisons every downstream language decision (backend, test_command, entry_point). Fold C votes into C++ whenever both appear: a pure C repo never carries .cc/.cpp/.hpp sources, so the presence of any C++-only extension means the repo is C++ and its .h files are C++ headers. Pure C repos (only .c/.h) and C mixed with unrelated languages are unaffected.

The planner-side dedup (1f22944) only repaired the prompt; interfaces.json itself still stored file_code as the whole file repeated once per unit, and context_collector writes that file_code straight to disk as the code-gen seed source — so generated repos were seeded with N duplicate definitions. Add a shared common/code_dedup helper and apply it at every point that rebuilds file_code from per-unit code: InterfacesStore.to_interfaces_json (serialization) and interface_review prune (regeneration). The planner helper now reuses it as a consumer-side safety net for older artifacts. Keeping one copy reconstructs the original single file (imports plus each unit once); units_to_code is left untouched so per-unit stubs stay valid.

Exclude backslashes from the normal string-literal branch so escaped characters have one matching path. This prevents zod-like commented regex literals from hanging TypeScript encoding.

Allow Python add_interface fixes to materialize a new interface file when the feature root maps to an existing subtree. Add deterministic same-file Python invocation edges for self calls, private helper fan-out, and constructor-composed class dependencies so review orphan checks can converge.

Run C++ ctest with --test-dir build so post-verification sees the CMake-generated test registry. Also make plan_tasks reject interfaces.json files that do not cover skeleton.json features, and update the plan command template so warning states are treated as incomplete.

Keep interface generation progress in a .partial file until complete, add design_interfaces heartbeat output, exclude skipped directories from C/C++ syntax source discovery, and make C/C++ prompt test commands discover sources at run time with standalone translation-unit guidance.

Keep interface generation progress in a .partial file until complete, add design_interfaces heartbeat output, exclude skipped directories from C/C++ syntax source discovery, make C/C++ prompt test commands discover sources at run time, and document the syntax-only success summary fallback.

Run cmake --build build during C/C++ test environment preparation so ctest --test-dir build can execute generated test binaries instead of failing on missing executables.

Add a shared generated-artifact policy for prompt guidance, local git excludes, post-verification, and merge-time checks. Align CMake prompt verification with configure, build, and ctest so agents repair source and build configuration instead of generated build files.

Run post-verification for documentation batches so README and docs edits cannot bypass native test suites. Treat smoke validation failures and C/C++ no-op make test output as validation failures. Narrow generated artifact matching to root build directories so source paths such as configs/build remain valid.

Copilot

Pull request overview

This PR is a follow-up hardening pass on the multilingual decoder/codegen pipeline, primarily tightening verification correctness (especially for C/C++), improving generated-artifact hygiene, strengthening interface/planner robustness, and restoring/adding extensive regression coverage across the multilingual stack.

Changes:

Harden C/C++ verification semantics (build before ctest, detect “compile-only” make test, skip build/generated dirs when collecting sources) and improve prompt rules/summaries for compiled backends.
Add shared helpers for interface-source deduplication and generated-artifact hygiene, and enforce generated-artifact rejection at key pipeline gates (run_batch startup, post_verify, merge).
Restore/add broad multilingual regression tests covering language parsing, repo language resolution, entry-point reconciliation, smoke/final validation behavior, and zero-test guarding.

Reviewed changes

Copilot reviewed 55 out of 56 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
CoderMind/tests/test_zero_test_guard.py	Adds regression coverage for “exit-0 but no tests ran” detection across backends.
CoderMind/tests/test_smoke_multilang.py	Adds smoke-test coverage for clean env execution and skipping Python-only layers for non-Python repos.
CoderMind/tests/test_rpg_builder.py	Ensures RPG builder preserves target language metadata.
CoderMind/tests/test_repo_language_resolution.py	Adds tests for on-disk language resolution fallback and wrapper behavior.
CoderMind/tests/test_plan_prompt_dedup.py	Covers planner prompt dedup + interface/skeleton coverage validation behavior.
CoderMind/tests/test_orphan_test_build_exclusion.py	Adds regression coverage for excluding test/build units from orphan detection.
CoderMind/tests/test_multilingual_prompt_safety.py	Guards encoder prompts against Python-only wording regressions.
CoderMind/tests/test_multilingual_encoder_pipeline.py	Tests multilingual encoder discovery + semantic parsing entry behavior.
CoderMind/tests/test_multilingual_code_unit.py	Tests multilingual ParsedFile and snippet fencing behavior.
CoderMind/tests/test_lang_parser_typescript.py	Adds TypeScript parser tests including hang regression coverage.
CoderMind/tests/test_lang_parser_rust.py	Adds Rust parser unit/dependency/invocation behavior tests.
CoderMind/tests/test_lang_parser_registry.py	Expands registry tests (configs, dominant language, fences, API).
CoderMind/tests/test_lang_parser_python_parity.py	Ensures Python parser/ParsedFile parity with legacy AST behavior.
CoderMind/tests/test_lang_parser_javascript.py	Adds JavaScript parser tests.
CoderMind/tests/test_lang_parser_go.py	Adds Go parser tests (units, deps, invocation extraction).
CoderMind/tests/test_lang_parser_fallback.py	Adds fallback delimiter/syntax scanner regression tests.
CoderMind/tests/test_lang_parser_cpp.py	Adds C++ parser tests (units, invokes, syntax errors).
CoderMind/tests/test_lang_parser_c.py	Adds C parser tests (includes, invokes, syntax errors).
CoderMind/tests/test_interface_source_dedup.py	Tests shared interface-source dedup and serialization wiring.
CoderMind/tests/test_init_codebase_gitignore.py	Tests `.gitignore` update behavior for dev env entries.
CoderMind/tests/test_generated_artifact_hygiene.py	Adds end-to-end tests for generated-artifact exclude/install + rejection gates.
CoderMind/tests/test_final_test_repair.py	Tests bounded final-test repair loop + smoke failure propagation.
CoderMind/tests/test_feature_build.py	Tests feature-build tree promotion + target language preservation.
CoderMind/tests/test_entry_reconciliation.py	Adds cross-language entry-point reconciliation protocol coverage.
CoderMind/tests/test_code_gen_multilingual.py	Adds broad multilingual codegen prompt/test runner/static check coverage.
CoderMind/tests/test_branch_name_sanitization.py	Tests branch name sanitization hardening and shared sanitizer usage.
CoderMind/templates/commands/plan.md	Clarifies `warning` is “not done” in `plan --check-only` semantics.
CoderMind/scripts/run_batch.py	Installs local generated-artifact excludes during batch startup.
CoderMind/scripts/plan_tasks.py	Adds interface source dedup in planner prompts + validates interface coverage vs skeleton.
CoderMind/scripts/lang_parser/registry.py	Fixes dominant language classification for header-heavy mixed C/C++ repos.
CoderMind/scripts/lang_parser/extractors/fallback.py	Hardens string-literal stripping regex against catastrophic backtracking.
CoderMind/scripts/func_design/interfaces_store.py	Deduplicates `file_code` during interfaces serialization.
CoderMind/scripts/func_design/interface_review.py	Adds file-entry scaffolding on interface review + uses shared dedup when regenerating `file_code`.
CoderMind/scripts/func_design/interface_agent.py	Improves Python dependency collection (same-file calls, `self.method()` calls) and saves partial interfaces output.
CoderMind/scripts/design_interfaces.py	Adds a periodic heartbeat while long interface design runs.
CoderMind/scripts/decoder_lang/tests/test_unit_kind.py	Adds tests for shared unit-kind classification and backend callable/type behavior.
CoderMind/scripts/decoder_lang/tests/test_python_backend.py	Expands Python backend/registry protocol invariants and behavior tests.
CoderMind/scripts/decoder_lang/tests/test_phase5_prompt_directive.py	Tests language directive preamble behavior (no-op for Python).
CoderMind/scripts/decoder_lang/tests/test_phase3_code_structure.py	Tests backend code-structure helpers and cross-language behavior.
CoderMind/scripts/decoder_lang/tests/test_phase2_skeleton.py	Tests backend-aware skeleton behavior and identifier validation.
CoderMind/scripts/decoder_lang/tests/test_phase1_propagation.py	Tests language propagation/resolution through decoder entry points.
CoderMind/scripts/decoder_lang/tests/test_javascript_backend.py	Adds JavaScript backend contract tests.
CoderMind/scripts/decoder_lang/tests/test_c_cpp_backend.py	Adds C/C++ backend tests for build + `ctest` semantics and “compile-only make test” detection.
CoderMind/scripts/decoder_lang/tests/init.py	Package marker for decoder_lang tests.
CoderMind/scripts/decoder_lang/cpp_backend.py	Skips build/generated dirs in source collection + strengthens no-test/compile-only detection.
CoderMind/scripts/decoder_lang/c_backend.py	Skips build/generated dirs in source collection + strengthens no-test/compile-only detection.
CoderMind/scripts/decoder_lang/backend.py	Ensures CMake projects are built before `ctest` in C/C++ verification prep.
CoderMind/scripts/common/generated_artifacts.py	Adds shared generated-artifact classifier, local exclude installer, and persisted-change detection.
CoderMind/scripts/common/code_dedup.py	Adds shared helpers for deduplicating repeated interface code blocks.
CoderMind/scripts/code_gen/post_verify.py	Enforces generated-artifact rejection before and after post-verify test runs.
CoderMind/scripts/code_gen/git_ops.py	Rejects merging batch branches containing persisted generated artifacts.
CoderMind/scripts/code_gen/final_validation.py	Propagates smoke-test failures into final validation outcome.
CoderMind/scripts/code_gen/batch_prompts.py	Adds generated-artifact prompt rules and improves C/C++ verification command construction in prompts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Keep shell expansion for C/C++ syntax-check include paths and root-anchor virtualenv artifact directories to avoid false positives. Also cover the false-positive resume prompt summary interpolation with a regression test.

prune_orphan_interfaces is a deprecated helper with no production callers; the active design flow uses InterfacesStore.find_orphan_units and prune_units, and file_code is already deduplicated at the serialization source. Restore the plain join so the legacy helper no longer carries duplicate dedup logic, and drop the now-unused dedup_file_code import.

The re-added follow-up tests encoded pre-refactor behavior that no longer matches the merged decoder pipeline: - add_interface for a non-Python project is recorded as an advisory manual follow-up rather than silently skipped. - correct_intra_subtree_file_order reports reason "backend_file_dependencies", and Go ordering resolves module imports through go.mod. - Go command-path resolution moved to go_backend.find_existing_entry. Update the assertions to match, and remove the standalone C++ ctest backend test, whose behavior is fully covered by the comprehensive decoder_lang backend suite.

Copilot

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 3 comments.

Second Copilot review pass flagged three correctness issues in the restored hardening code: - code_dedup: dedup_code_blocks returned the stripped block, dropping leading indentation from indented unit slices and corrupting the file_code used as a codegen seed. Dedup on the stripped key but keep the block's own indentation (trim only trailing whitespace). - interface_agent: _analyze_python_invocations walked the whole AST, attributing calls inside nested def/class bodies to the enclosing caller. Walk only the caller's own scope. - decoder_lang.backend: cmake_reconfigure now skips the build step when configure fails so ctest surfaces the real error instead of running against a stale build directory.

Copilot

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 3 comments.

A second Copilot review pass surfaced three correctness gaps in the restored C/C++ verification and Python invocation analysis: - c/cpp backend: the compile-only `make test` guard regex did not match version-suffixed compilers (gcc-13, clang-18, g++-13, clang++-18, c++-14), so a compile-only run could be reported as a passing test run. Allow an optional version suffix. - batch_prompts: the C/C++ syntax-only find now also prunes dist/coverage/.venv/venv/CMakeFiles so generated sources are not pulled into the syntax check. - interface_agent: lambdas are not separate units, so keep their bodies in the enclosing caller's scope rather than dropping their same-file calls (nested def/class scopes are still excluded).

Copilot

Pull request overview

Copilot reviewed 56 out of 57 changed files in this pull request and generated 1 comment.

HuYaSen added 12 commits June 25, 2026 10:48

fix(parser): Prevent regex backtracking in string stripping

4ba22f5

Exclude backslashes from the normal string-literal branch so escaped characters have one matching path. This prevents zod-like commented regex literals from hanging TypeScript encoding.

fix(decoder): build CMake targets before CTest verification

aa88b4c

Run cmake --build build during C/C++ test environment preparation so ctest --test-dir build can execute generated test binaries instead of failing on missing executables.

QingtaoLi1 requested a review from HuYaSen June 25, 2026 11:09

HuYaSen requested a review from Copilot June 25, 2026 11:16

Copilot started reviewing on behalf of HuYaSen June 25, 2026 11:16 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread CoderMind/scripts/code_gen/batch_prompts.py

Comment thread CoderMind/scripts/common/generated_artifacts.py

HuYaSen added 3 commits June 25, 2026 20:04

fix(verify): Address review feedback

c7a6e6c

Keep shell expansion for C/C++ syntax-check include paths and root-anchor virtualenv artifact directories to avoid false positives. Also cover the false-positive resume prompt summary interpolation with a regression test.

HuYaSen requested a review from Copilot June 26, 2026 09:29

Copilot started reviewing on behalf of HuYaSen June 26, 2026 09:30 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread CoderMind/scripts/common/code_dedup.py

Comment thread CoderMind/scripts/func_design/interface_agent.py

Comment thread CoderMind/scripts/decoder_lang/backend.py Outdated

HuYaSen requested a review from Copilot June 26, 2026 09:50

Copilot started reviewing on behalf of HuYaSen June 26, 2026 09:52 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread CoderMind/scripts/decoder_lang/c_backend.py Outdated

Comment thread CoderMind/scripts/decoder_lang/cpp_backend.py

Comment thread CoderMind/scripts/code_gen/batch_prompts.py

HuYaSen requested a review from Copilot June 26, 2026 10:11

Copilot started reviewing on behalf of HuYaSen June 26, 2026 10:13 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Comment thread CoderMind/scripts/func_design/interface_review.py

Copilot started work on behalf of HuYaSen June 26, 2026 10:26 View session

Copilot stopped work on behalf of HuYaSen due to an error June 26, 2026 10:26
Request session.create failed with message: Model "gpt-5.3-codex" is not available.

HuYaSen approved these changes Jun 26, 2026

View reviewed changes

HuYaSen merged commit 1d4da22 into main Jun 26, 2026
3 of 4 checks passed

HuYaSen mentioned this pull request Jun 26, 2026

Fix multilingual pipeline bugs and restore extracted tests #81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden multilingual decoder verification and restore follow-up test coverage#82

Harden multilingual decoder verification and restore follow-up test coverage#82
HuYaSen merged 17 commits into
mainfrom
dev/decoder-multilang-pipeline-fix2

QingtaoLi1 commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

QingtaoLi1 commented Jun 25, 2026

Summary

What changed

Decoder and verification hardening

Generated artifact hygiene

Interface and planner robustness

Parser and language detection fixes

Final validation behavior

Test coverage

Notes for reviewers

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants