Skip to content

Latest commit

 

History

History
263 lines (175 loc) · 7.9 KB

File metadata and controls

263 lines (175 loc) · 7.9 KB

Contributing to CodeClone

Thank you for your interest in contributing to CodeClone.

CodeClone provides structural code quality analysis for Python, including clone detection, quality metrics, baseline-aware CI governance, and an optional MCP agent interface.

Contributions are welcome — especially those that improve signal quality, CFG semantics, and real-world CI usability.


Project Philosophy

Core principles:

  • Low noise over high recall
  • Structural and control-flow similarity, not semantic equivalence
  • Deterministic and explainable behavior
  • Optimized for CI usage and architectural analysis

If a change increases false positives, reduces determinism, or weakens explainability, it is unlikely to be accepted.


Areas Open for Contribution

We especially welcome contributions in the following areas:

  • Control Flow Graph (CFG) construction and semantics
  • AST normalization improvements
  • Segment-level clone detection and reporting
  • Quality metrics (complexity, coupling, cohesion, dead-code, dependencies)
  • False-positive reduction
  • HTML report UX improvements
  • MCP server tools and agent workflows
  • GitHub Action improvements
  • Performance optimizations
  • Documentation and real-world examples

Reporting Bugs

Please use the appropriate GitHub Issue Template.

When reporting issues related to clone detection, include:

  • minimal reproducible code snippets (preferred over screenshots);
  • the CodeClone version;
  • the Python version (python_tag, e.g. cp313);
  • whether the issue is primarily:
    • AST-related,
    • CFG-related,
    • normalization-related,
    • metrics-related,
    • MCP-related,
    • reporting / UI-related.

Screenshots alone are usually insufficient for analysis.


False Positives

False positives are expected edge cases, not necessarily bugs.

When reporting a false positive:

  • explain why the detected code is architecturally distinct;
  • avoid arguments based solely on naming, comments, or formatting;
  • focus on control-flow, responsibilities, or structural differences.

Well-argued false-positive reports are valuable and appreciated.


CFG Semantics Discussions

If proposing changes to CFG semantics, include:

  • a description of the current behavior;
  • the proposed new behavior;
  • the expected impact on clone detection quality (noise/recall);
  • concrete code examples;
  • a note on determinism implications.

Such changes often require design-level discussion and may be staged across versions.


Security & Safety Expectations

  • Assume untrusted input (paths and source code).
  • Prefer fail-closed in gating modes and fail-open in normal modes only when explicitly intended.
  • Add negative tests for any normalization/CFG change.
  • Changes must preserve determinism and avoid introducing new false positives.

Baseline & CI

Baseline contract (v2)

  • The baseline schema is versioned (meta.schema_version, currently 2.0).
  • Compatibility/trust gates include schema_version, fingerprint_version, python_tag, and meta.generator.name.
  • Integrity is tamper-evident via meta.payload_sha256 over canonical payload.
  • The baseline may embed a metrics section for metrics-baseline-aware CI gating.

When baseline regeneration is required

  • Regenerate baseline with codeclone . --update-baseline when fingerprint_version or python_tag changes.
  • Regeneration is not required for UI/report/CLI/cache/performance-only changes if both fingerprint_version and python_tag are unchanged.

Gating behavior

  • In --ci (or explicit gating flags), untrusted baseline states fail fast as a contract error (exit 2).
  • Outside gating mode, an untrusted/missing baseline is ignored with a warning and comparison proceeds against an empty baseline.

Exit codes contract

  • 0 — success
  • 2 — contract error (e.g., missing/untrusted baseline in gating, invalid output path/extension, incompatible versions)
  • 3 — gating failure (new clones detected, --fail-threshold exceeded)
  • 5 — internal error (unexpected exception; please report)

Versioned schemas

CodeClone maintains several versioned schema contracts:

Schema Current version Owner
Baseline 2.1 codeclone/baseline.py
Report 2.8 codeclone/report/json_contract.py
Cache 2.4 codeclone/cache_io.py
Metrics baseline 1.2 codeclone/metrics_baseline.py

Any change to schema shape or semantics requires version review, documentation, and tests.


MCP Interface

CodeClone includes an optional read-only MCP server (codeclone[mcp]) for AI agents.

When contributing to MCP:

  • MCP must remain read-only — it must never mutate baselines, source files, or repo state.
  • Session-local review markers are the only allowed mutable state (in-memory, ephemeral).
  • MCP reuses pipeline/report contracts — do not create a second analysis truth path.
  • Tool names, resource URIs, and response shapes are public surfaces — changes require tests and docs.

See docs/mcp.md and docs/book/20-mcp-interface.md for details.


GitHub Action

CodeClone ships a composite GitHub Action (.github/actions/codeclone/).

When contributing to the Action:

  • Never inline ${{ inputs.* }} in shell scripts — pass through env: variables.
  • Prefer major-tag pinning for actions (e.g., actions/setup-python@v5).
  • Add timeouts to all subprocess.run calls.

Development Setup

git clone https://github.com/orenlab/codeclone.git
cd codeclone
uv sync --all-extras --dev
uv run pre-commit install

Run tests:

uv run pytest

Static checks:

uv run pre-commit run --all-files

Build documentation (if you touched docs/ or mkdocs.yml):

uv run --with mkdocs --with mkdocs-material mkdocs build --strict

Run MCP tests (if you touched mcp_service.py or mcp_server.py):

uv run pytest -q tests/test_mcp_service.py tests/test_mcp_server.py

Commit Messages

Use the repository's existing Conventional Commits style:

  • format: type(scope): imperative summary
  • keep type lowercase (feat, fix, docs, chore, ...)
  • keep the summary short, imperative, and specific to the user-visible change
  • use a narrow scope when it helps (metrics, mcp,vscode, core,ci, ...)
  • split unrelated changes into separate commits instead of writing one broad summary

Examples from the current history:

  • fix(core,ci): harden git diff validation, make segment digests canonical, and align CI policy
  • feat(metrics): add adoption and public API baselines with compact schema-aware storage
  • chore(docs): align AGENTS and contract docs with current code

If a commit needs extra context, keep the subject line concise and explain the rest in the commit body.


Code Style

  • Python 3.10 – 3.14
  • Type annotations are required
  • Any should be minimized; prefer precise types and small typed helpers
  • mypy must pass
  • ruff check must pass
  • Code must be formatted with ruff format
  • Prefer explicit, readable logic over clever or implicit constructs

Versioning

CodeClone follows semantic versioning:

  • MAJOR: fundamental detection model changes
  • MINOR: new detection capabilities (e.g., new detectors or major CFG/normalization behavior shifts)
  • PATCH: bug fixes, performance improvements, and UI/UX polish

Any change that affects detection behavior must include documentation and tests, and may require a fingerprint_version bump (and thus baseline regeneration).


License

By contributing code to CodeClone, you agree that your contributions will be licensed under MPL-2.0.

Documentation contributions are licensed under MIT.