Skip to content

feat: add CelestialBody ingestion layer to A2A handoff server#73

Open
Igor Holt (igor-holt) wants to merge 1 commit into
mainfrom
codex/document-test-results-for-yennefer-sync
Open

feat: add CelestialBody ingestion layer to A2A handoff server#73
Igor Holt (igor-holt) wants to merge 1 commit into
mainfrom
codex/document-test-results-for-yennefer-sync

Conversation

@igor-holt
Copy link
Copy Markdown
Member

Motivation

  • Provide an A2A ingestion pipeline to convert a raw repository into a standardized CelestialBody JSON object for downstream Yennefer/hand-off workflows.
  • Surface simple heuristics (mass, atmosphere, gravity, orbital_state) and a deterministic "seismic test" so external agents can validate repository invariants before handoff.
  • Expose ingestion over the existing A2A service so other agents can programmatically register or inspect repository artifacts.

Description

  • Added genesis-q-mem/celestial_ingestion.py containing a CelestialBody dataclass and ingestion pipeline that infers language counts, LOC, summary, mass, atmosphere, gravity, orbital_state, and a seismic_test (schema check + deterministic hash citation).
  • Added POST /api/a2a/ingest to genesis-q-mem/a2a_handoff_server.py which validates input (repo_path), calls ingest_repository, and returns the celestial_body payload or a 400 error on invalid input.
  • Integrated ingest_repository and IngestionError into the A2A server and kept all existing endpoints unchanged.
  • Added unit tests at genesis-q-mem/tests/test_celestial_ingestion.py to verify successful ingestion and missing-path error behavior.

Testing

  • Ran the unit tests with cd genesis-q-mem && pytest -q tests/test_celestial_ingestion.py and the final run reported 2 passed (tests cover successful ingestion and missing-path rejection).
  • An initial test collection error due to import path was observed and resolved in the test file (adjusted sys.path), after which automated tests passed.
  • No other automated test suites were modified; the A2A service continues to listen on port 8200 and only the single new endpoint POST /api/a2a/ingest was added.

Codex Task

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 10, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
yennefer c22db0a Apr 10 2026, 12:36 AM

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c22db0ab2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

seismic_test={},
)
payload = body.to_dict()
payload["seismic_test"] = _run_seismic_test(payload)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compute seismic citation from the returned payload

The citation is generated before seismic_test is populated (payload["seismic_test"] is still {} at hash time), and then the payload is mutated with the final seismic block. As a result, consumers cannot validate the returned object by hashing the response body: the advertised deterministic citation will always disagree with the final payload content, breaking downstream verification workflows.

Useful? React with 👍 / 👎.

mass = _compute_mass(file_count=file_count, loc=loc)
gravity = _compute_gravity(mass, atmosphere, language_counts)
orbital_state = {
"phase": metadata.get("phase", "stable"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate metadata type before field access

ingest_repository assumes metadata is a mapping and calls .get(...), but the new /api/a2a/ingest endpoint accepts arbitrary JSON values for metadata. If a client sends a non-object (for example a string or array), this raises AttributeError and escapes as a 500 because the route only catches IngestionError; invalid client input should be rejected as 400 instead of crashing the request.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new API endpoint for repository ingestion and adds utility functions to process repository metadata into CelestialBody objects. My review highlights a critical path traversal vulnerability in the API endpoint, as user-provided paths are not validated. Additionally, I have suggested optimizing the line counting logic to avoid loading entire files into memory, simplifying the redundant hash calculation in the seismic test, and refactoring the filesystem traversal to reduce I/O overhead.

def ingest_celestial_body():
"""Convert a raw repository path into a CelestialBody payload."""
data = request.json or {}
repo_path = data.get("repo_path")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The repo_path is taken directly from user input and used to access the filesystem without validation. This presents a significant security risk as it allows for Path Traversal attacks. An attacker could provide paths like /etc/passwd or other sensitive directories, which the server would then attempt to process and potentially expose through the ingestion metadata. You should validate that the provided path is within a designated, safe base directory.

continue
try:
if file.suffix.lower() in CODE_EXTENSIONS:
total += len(file.read_text(encoding="utf-8", errors="ignore").splitlines())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using file.read_text().splitlines() loads the entire content of every source file into memory at once. For repositories containing very large files, this can lead to excessive memory consumption or Out-Of-Memory (OOM) errors. It is more efficient to iterate over the file object to count lines.

Suggested change
total += len(file.read_text(encoding="utf-8", errors="ignore").splitlines())
with file.open("r", encoding="utf-8", errors="ignore") as f:
total += sum(1 for _ in f)

Comment on lines +112 to +114
digest_a = hashlib.sha256(canonical.encode()).hexdigest()
digest_b = hashlib.sha256(canonical.encode()).hexdigest()
invariant = digest_a == digest_b
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calculating the same SHA-256 hash twice (digest_a and digest_b) from the same input and then comparing them is redundant. This does not provide a meaningful "invariance" check or stress test. If the intent is to provide a deterministic citation for the payload, a single calculation is sufficient.

Suggested change
digest_a = hashlib.sha256(canonical.encode()).hexdigest()
digest_b = hashlib.sha256(canonical.encode()).hexdigest()
invariant = digest_a == digest_b
digest_a = hashlib.sha256(canonical.encode()).hexdigest()
invariant = True

Comment on lines +131 to +133
language_counts = _infer_language_counts(root)
file_count = sum(language_counts.values())
loc = _line_count(root)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository is traversed twice: once in _infer_language_counts and again in _line_count. This doubles the I/O overhead, which can be significant for large repositories. Consider refactoring these into a single pass over the filesystem that collects both language statistics and line counts simultaneously.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a repository “ingestion” pipeline that infers heuristic metadata and exposes it via a new A2A API endpoint for downstream handoff workflows.

Changes:

  • Introduces CelestialBody + ingest_repository() to derive language/LOC signals, heuristics, and a deterministic “seismic test” hash.
  • Adds POST /api/a2a/ingest to call ingestion and return a celestial_body payload (or 400 on invalid input).
  • Adds pytest coverage for successful ingestion and missing-path behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
genesis-q-mem/celestial_ingestion.py Implements CelestialBody model + repository ingestion heuristics and seismic test output.
genesis-q-mem/a2a_handoff_server.py Adds a new API endpoint that validates input and returns ingestion results.
genesis-q-mem/tests/test_celestial_ingestion.py Adds unit tests for ingestion success and error handling.

class CelestialBody:
body_id: str
name: str
source_path: str
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ingestion payload includes source_path as an absolute resolved server filesystem path. Since this payload is returned by the API, it can unintentionally disclose server directory structure. Consider omitting source_path from the returned payload, or returning only a sanitized identifier (e.g., repo basename) or a path relative to a configured allowed base directory.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +151
body = CelestialBody(
body_id=body_id,
name=body_name,
source_path=str(root),
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ingestion payload includes source_path as an absolute resolved server filesystem path. Since this payload is returned by the API, it can unintentionally disclose server directory structure. Consider omitting source_path from the returned payload, or returning only a sanitized identifier (e.g., repo basename) or a path relative to a configured allowed base directory.

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +254
@app.route('/api/a2a/ingest', methods=['POST'])
def ingest_celestial_body():
"""Convert a raw repository path into a CelestialBody payload."""
data = request.json or {}
repo_path = data.get("repo_path")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint accepts an arbitrary repo_path and passes it to ingest_repository, which walks the filesystem and reads files (README + code files). Without an allowlist/sandbox, a client can point this at any readable directory on the server, causing unintended data exposure. Recommend restricting ingestion to a configured base directory (and verifying the resolved path stays within it), or requiring a pre-registered repo identifier instead of a raw filesystem path.

Copilot uses AI. Check for mistakes.
Comment on lines +260 to +261
try:
body = ingest_repository(repo_path, metadata)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint accepts an arbitrary repo_path and passes it to ingest_repository, which walks the filesystem and reads files (README + code files). Without an allowlist/sandbox, a client can point this at any readable directory on the server, causing unintended data exposure. Recommend restricting ingestion to a configured base directory (and verifying the resolved path stays within it), or requiring a pre-registered repo identifier instead of a raw filesystem path.

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +255
data = request.json or {}
repo_path = data.get("repo_path")
metadata = data.get("metadata", {})
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata is assumed to be a dict but isn’t validated. If a client sends a non-object JSON value (e.g., string/list/null) for metadata, ingest_repository will later call metadata.get(...) and throw AttributeError, returning a 500 instead of a 400. Add input validation here (ensure metadata is a dict, otherwise return 400) and/or defensive validation inside ingest_repository.

Copilot uses AI. Check for mistakes.

if not repo_path:
return jsonify({"error": "repo_path is required"}), 400

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata is assumed to be a dict but isn’t validated. If a client sends a non-object JSON value (e.g., string/list/null) for metadata, ingest_repository will later call metadata.get(...) and throw AttributeError, returning a 500 instead of a 400. Add input validation here (ensure metadata is a dict, otherwise return 400) and/or defensive validation inside ingest_repository.

Suggested change
if not isinstance(metadata, dict):
return jsonify({"error": "metadata must be an object"}), 400

Copilot uses AI. Check for mistakes.
@app.route('/api/a2a/ingest', methods=['POST'])
def ingest_celestial_body():
"""Convert a raw repository path into a CelestialBody payload."""
data = request.json or {}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using request.json can raise a BadRequest exception for invalid JSON with an application/json content-type, which would bypass this handler and return a default error. Prefer request.get_json(silent=True) or {} so you can consistently return a structured 400 response for invalid bodies.

Copilot uses AI. Check for mistakes.
def _run_seismic_test(payload: dict[str, Any]) -> dict[str, Any]:
canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"))
digest_a = hashlib.sha256(canonical.encode()).hexdigest()
digest_b = hashlib.sha256(canonical.encode()).hexdigest()
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seismic test hashes the same canonical string twice, so invariant will always be true unless there’s an internal hashing failure. This is redundant work and doesn’t add validation value. Consider computing a single digest and basing the test on that, or (if the intent is to detect non-determinism) re-canonicalize from independently re-materialized data rather than re-hashing the same bytes.

Suggested change
digest_b = hashlib.sha256(canonical.encode()).hexdigest()
rematerialized = json.loads(canonical)
canonical_rematerialized = json.dumps(rematerialized, sort_keys=True, separators=(",", ":"))
digest_b = hashlib.sha256(canonical_rematerialized.encode()).hexdigest()

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +6
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parents[1]))

from celestial_ingestion import IngestionError, ingest_repository
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying sys.path in the test file is a brittle workaround and can cause import-order issues as the test suite grows. Prefer configuring imports via packaging (e.g., making genesis-q-mem importable) or pytest configuration (e.g., pythonpath/PYTHONPATH in a pytest.ini/conftest.py) so tests don’t need to mutate global interpreter state.

Suggested change
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from celestial_ingestion import IngestionError, ingest_repository
import importlib.util
from pathlib import Path
_MODULE_PATH = Path(__file__).resolve().parents[1] / "celestial_ingestion.py"
_SPEC = importlib.util.spec_from_file_location("celestial_ingestion", _MODULE_PATH)
_MODULE = importlib.util.module_from_spec(_SPEC)
assert _SPEC is not None and _SPEC.loader is not None
_SPEC.loader.exec_module(_MODULE)
IngestionError = _MODULE.IngestionError
ingest_repository = _MODULE.ingest_repository

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +32
def test_ingest_repository_rejects_missing_path():
try:
ingest_repository("/tmp/does-not-exist-123")
assert False, "Expected IngestionError"
except IngestionError:
assert True
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is likely to be flaky because it hardcodes a path under /tmp that could exist on some systems. Also, the try/except + assert True/False pattern is harder to read and can hide unexpected exceptions. Prefer using a tmp-based guaranteed-missing path (e.g., from tmp_path) and pytest.raises(IngestionError).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants