feat(lifecycle): add forget flag + TTL to v2 ingest API (#166)#228
feat(lifecycle): add forget flag + TTL to v2 ingest API (#166)#228chiragkhatri19 wants to merge 6 commits into
Conversation
Add src/schemas/memory_lifecycle.py with the forget/TTL boundary model (MemoryLifecycle + RESERVED_LIFECYCLE_KEYS) and src/pipelines/lifecycle.py with pure, side-effect-free helpers: - is_retrievable(metadata, now): retrieval-time gate; handles forgotten state, expired TTL, and legacy records (no lifecycle keys → always retrievable). - build_lifecycle_metadata(now, ttl_days, reason): stamps forget + expires_at onto a metadata dict for storage at v2 ingest time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ough Weaver - src/config/settings.py: add memory_forget_default_ttl_days (default 30, configurable via MEMORY_FORGET_DEFAULT_TTL_DAYS env var). - src/pipelines/weaver.py: add extra_metadata=None to execute() and _execute_batched_vector(); inject _now callable for deterministic tests. extra_metadata is merged into vector metadata at both ADD sites (batched flush + _vector_add). Temporal, code, and snippet paths are unaffected. extra_metadata=None is a no-op — existing behaviour is byte-identical. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_metadata
- src/api/schemas.py: add forget: bool = False to IngestRequest. Additive,
default off; accepted by v1 but only acted on in the v2 path.
- src/api/routes/v2/memory.py: when forget=true, compute lifecycle_metadata
{forget, expires_at, lifecycle_state} using memory_forget_default_ttl_days
and store in payload. Include forget in idempotency fields so a forget vs
non-forget request of the same content are treated as distinct jobs.
- src/api/routes/v2/activities.py: pass lifecycle_metadata from payload into
per-domain state dicts (profile, temporal, summary, image).
- src/pipelines/ingest.py: pass extra_metadata=state.get("lifecycle_metadata")
to weaver.execute() in the three vector-domain nodes (_node_extract_profile,
_node_extract_image, _node_extract_summary). Temporal, code, snippet deferred.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h_summary + profile catalog Apply is_retrievable(metadata, now) in the two vector read seams: - _search_summary: expired/forgotten records excluded from summary results. - _fetch_profile_catalog: expired/forgotten records stripped from the profile catalog and cached records before the LLM sees them. Legacy records (no lifecycle keys) always pass through — no behavior change for any memory not ingested with forget=true. _now is injectable for tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ftest merge bug - tests/conftest.py: fix InMemoryVectorStore.update() to merge metadata instead of replacing it. Prod backends (SQLite local.py, Pinecone) merge; the old fake replaced, creating false-green test risk. A regression test (test_fake_update_merges_metadata) locks the fix. - tests/unit/test_memory_lifecycle.py: 13 unit tests covering is_retrievable (legacy/forgotten/TTL expired/live/no-expires_at), build_lifecycle_metadata, Weaver extra_metadata merge on ADD, extra_metadata=None no-op, and _search_summary filtering (expired hidden, non-forget unaffected). - tests/integration/test_weaver_pipeline.py: two new lifecycle round-trip tests — forget metadata persisted on every ADD; extra_metadata=None no-op. - tests/integration/test_ingest_pipeline.py: update RecordingWeaver.execute() to accept extra_metadata=None so existing ingest-pipeline tests keep passing. - CHANGELOG.md: document v2 forget flag + TTL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a memory forget/TTL lifecycle feature for the v2 ingest API, allowing memories ingested with forget=true to be tagged with an expiration time and automatically filtered out at retrieval time. The feedback highlights several critical improvements: addressing a potential security vulnerability in weaver.py where system metadata fields could be overwritten by filtering extra_metadata with RESERVED_LIFECYCLE_KEYS; utilizing the new build_lifecycle_metadata helper in the API route to avoid duplication; making is_retrievable and metadata retrievals defensive against None values to prevent runtime crashes; and optimizing the datetime parsing logic in the lifecycle helper.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if extra_metadata: | ||
| meta.update(extra_metadata) |
There was a problem hiding this comment.
Merging extra_metadata directly into meta without sanitization or filtering can allow overwriting critical system metadata fields like user_id or domain. We should use the newly defined RESERVED_LIFECYCLE_KEYS constant to ensure only valid lifecycle fields are merged, preventing cross-user data contamination or privilege escalation.
| if extra_metadata: | |
| meta.update(extra_metadata) | |
| if extra_metadata: | |
| from src.schemas.memory_lifecycle import RESERVED_LIFECYCLE_KEYS | |
| meta.update({k: v for k, v in extra_metadata.items() if k in RESERVED_LIFECYCLE_KEYS}) |
| if extra_metadata: | ||
| metadata.update(extra_metadata) |
There was a problem hiding this comment.
Merging extra_metadata directly into metadata in _vector_add without filtering can allow overwriting critical system metadata fields like user_id or domain. We should filter extra_metadata using RESERVED_LIFECYCLE_KEYS to ensure only valid lifecycle fields are merged.
| if extra_metadata: | |
| metadata.update(extra_metadata) | |
| if extra_metadata: | |
| from src.schemas.memory_lifecycle import RESERVED_LIFECYCLE_KEYS | |
| metadata.update({k: v for k, v in extra_metadata.items() if k in RESERVED_LIFECYCLE_KEYS}) |
| if req.forget: | ||
| now = datetime.now(timezone.utc) | ||
| expires_at = now + timedelta(days=settings.memory_forget_default_ttl_days) | ||
| payload["lifecycle_metadata"] = { | ||
| "forget": True, | ||
| "expires_at": expires_at.isoformat(), | ||
| "lifecycle_state": "active", | ||
| } |
There was a problem hiding this comment.
Instead of manually constructing the lifecycle_metadata dictionary, use the newly introduced build_lifecycle_metadata helper function from src.pipelines.lifecycle. This avoids code duplication and ensures that the created_at and updated_at fields are consistently populated on ingested records as designed.
if req.forget:
from src.pipelines.lifecycle import build_lifecycle_metadata
now = datetime.now(timezone.utc)
payload["lifecycle_metadata"] = build_lifecycle_metadata(
now=now,
ttl_days=settings.memory_forget_default_ttl_days,
)| def is_retrievable(metadata: Mapping[str, Any], now: datetime) -> bool: | ||
| """Return True when a record should appear in retrieval results. | ||
|
|
||
| Rules (applied in order): | ||
| 1. ``lifecycle_state == "forgotten"`` → hidden (manual soft-forget). | ||
| 2. ``forget is True`` and ``expires_at`` is present and in the past → hidden (TTL expired). | ||
| 3. Everything else (including all legacy records with no lifecycle keys) → retrievable. | ||
|
|
||
| Missing keys default to the legacy-safe value so records stored before | ||
| lifecycle was introduced are never hidden. | ||
| """ | ||
| if metadata.get("lifecycle_state", "active") == "forgotten": | ||
| return False |
There was a problem hiding this comment.
The is_retrievable function should be defensive against metadata being None or not a mapping. If metadata is None, calling metadata.get will raise an AttributeError. Adding a guard clause to return True (treating it as a legacy-safe active record) prevents potential runtime crashes.
| def is_retrievable(metadata: Mapping[str, Any], now: datetime) -> bool: | |
| """Return True when a record should appear in retrieval results. | |
| Rules (applied in order): | |
| 1. ``lifecycle_state == "forgotten"`` → hidden (manual soft-forget). | |
| 2. ``forget is True`` and ``expires_at`` is present and in the past → hidden (TTL expired). | |
| 3. Everything else (including all legacy records with no lifecycle keys) → retrievable. | |
| Missing keys default to the legacy-safe value so records stored before | |
| lifecycle was introduced are never hidden. | |
| """ | |
| if metadata.get("lifecycle_state", "active") == "forgotten": | |
| return False | |
| def is_retrievable(metadata: Optional[Mapping[str, Any]], now: datetime) -> bool: | |
| """Return True when a record should appear in retrieval results. | |
| Rules (applied in order): | |
| 1. ``lifecycle_state == "forgotten"`` → hidden (manual soft-forget). | |
| 2. ``forget is True`` and ``expires_at`` is present and in the past → hidden (TTL expired). | |
| 3. Everything else (including all legacy records with no lifecycle keys) → retrievable. | |
| Missing keys default to the legacy-safe value so records stored before | |
| lifecycle was introduced are never hidden. | |
| """ | |
| if not metadata: | |
| return True | |
| if metadata.get("lifecycle_state", "active") == "forgotten": | |
| return False |
| expires_at = datetime.fromisoformat(str(expires_raw)) | ||
| # Make both sides timezone-aware or both naive for comparison | ||
| if expires_at.tzinfo is None and now.tzinfo is not None: | ||
| from datetime import timezone | ||
| expires_at = expires_at.replace(tzinfo=timezone.utc) | ||
| elif expires_at.tzinfo is not None and now.tzinfo is None: | ||
| expires_at = expires_at.replace(tzinfo=None) | ||
| if expires_at < now: | ||
| return False | ||
| except (ValueError, TypeError): | ||
| pass |
There was a problem hiding this comment.
If expires_raw is already a datetime object (which can happen depending on the vector store backend or deserialization), calling str(expires_raw) and then datetime.fromisoformat is highly inefficient and redundant. We should check if it is already a datetime instance before parsing.
| expires_at = datetime.fromisoformat(str(expires_raw)) | |
| # Make both sides timezone-aware or both naive for comparison | |
| if expires_at.tzinfo is None and now.tzinfo is not None: | |
| from datetime import timezone | |
| expires_at = expires_at.replace(tzinfo=timezone.utc) | |
| elif expires_at.tzinfo is not None and now.tzinfo is None: | |
| expires_at = expires_at.replace(tzinfo=None) | |
| if expires_at < now: | |
| return False | |
| except (ValueError, TypeError): | |
| pass | |
| if isinstance(expires_raw, datetime): | |
| expires_at = expires_raw | |
| else: | |
| expires_at = datetime.fromisoformat(str(expires_raw)) | |
| # Make both sides timezone-aware or both naive for comparison | |
| if expires_at.tzinfo is None and now.tzinfo is not None: | |
| from datetime import timezone | |
| expires_at = expires_at.replace(tzinfo=timezone.utc) | |
| elif expires_at.tzinfo is not None and now.tzinfo is None: | |
| expires_at = expires_at.replace(tzinfo=None) | |
| if expires_at < now: | |
| return False | |
| except (ValueError, TypeError): | |
| pass |
| for r in results: | ||
| if not is_retrievable(r.metadata, now): | ||
| continue | ||
| live_results.append(r) | ||
| main_content = r.metadata.get("main_content", "") | ||
| if not main_content or main_content in seen: | ||
| continue |
There was a problem hiding this comment.
If r.metadata is None (which is possible in some backends or mock tests, and is now handled safely by is_retrievable), calling r.metadata.get("main_content", "") on line 525 will raise an AttributeError. We should make this access defensive.
| for r in results: | |
| if not is_retrievable(r.metadata, now): | |
| continue | |
| live_results.append(r) | |
| main_content = r.metadata.get("main_content", "") | |
| if not main_content or main_content in seen: | |
| continue | |
| for r in results: | |
| if not is_retrievable(r.metadata, now): | |
| continue | |
| live_results.append(r) | |
| main_content = r.metadata.get("main_content", "") if r.metadata else "" | |
| if not main_content or main_content in seen: | |
| continue |
|
| Filename | Overview |
|---|---|
| src/pipelines/lifecycle.py | New pure helper module — is_retrievable() handles naive/aware datetime normalization and legacy-safe defaults; build_lifecycle_metadata() builds the full TTL dict. Well-tested with 13 unit tests. |
| src/pipelines/weaver.py | extra_metadata is sanitized once at execute() and forwarded through every write path (batched ADD, UPDATE, UPDATE→ADD fallback); previously reported UPDATE regression is now fixed. Weaver._now is stored but never called — unused seam. |
| src/pipelines/retrieval.py | is_retrievable() filter correctly applied in _search_summary and _fetch_profile_catalog; _search_profile inherits filtering via cached live_results. Docstring for _fetch_profile_catalog still says 'raw_results — the full SearchResult list', which is now stale. |
| src/api/routes/v2/memory.py | build_lifecycle_metadata() called before idempotency check; batch-ingest guard added correctly. Inline lifecycle dict in the route still diverges from build_lifecycle_metadata() (flagged in prior review). |
| src/schemas/memory_lifecycle.py | New schema with RESERVED_LIFECYCLE_KEYS and PROTECTED_METADATA_KEYS constants; clean separation of lifecycle schema from denylist logic. RESERVED_LIFECYCLE_KEYS is exported but not imported by any changed file — currently unused. |
| tests/conftest.py | InMemoryVectorStore.update() now merges metadata (matching prod backends) instead of replacing — this was a false-green risk for lifecycle tests; a dedicated unit test locks the fix. |
| tests/integration/test_weaver_pipeline.py | 10 new integration tests cover ADD/UPDATE/batch/fallback paths, protected-key denylist, unknown-key passthrough, and the full expired→hidden round-trip via RetrievalPipeline. |
| src/pipelines/ingest.py | lifecycle_metadata threaded into profile, summary, and image nodes via extra_metadata=state.get('lifecycle_metadata'); temporal/code/snippet nodes remain out of scope per PR description. |
| src/api/routes/v2/activities.py | lifecycle_metadata extracted once from payload and threaded into all four domain node dicts (profile, temporal, summary, image); temporal will silently ignore it until a follow-up PR, which is intentional. |
Sequence Diagram
sequenceDiagram
participant Client
participant Route as v2/memory.py
participant Activity as activities.py
participant Pipeline as ingest.py
participant Weaver as weaver.py
participant Store as VectorStore
Client->>Route: "POST /v2/memory/ingest (forget=true)"
Route->>Route: build_lifecycle_metadata(now, ttl_days)
Route->>Activity: enqueue job with lifecycle_metadata
Activity->>Pipeline: "_node_extract_*(state with lifecycle_metadata)"
Pipeline->>Weaver: execute(judge_result, domain, user_id, extra_metadata)
Weaver->>Weaver: sanitize PROTECTED_METADATA_KEYS
Weaver->>Store: add/update with forget+expires_at merged
Note over Client,Store: Retrieval time
Client->>Pipeline: retrieve(query)
Pipeline->>Store: search results
Store-->>Pipeline: raw results
Pipeline->>Pipeline: is_retrievable(metadata, now())
Pipeline-->>Client: filtered results
Reviews (2): Last reviewed commit: "fix(lifecycle): patch UPDATE data-leak, ..." | Re-trigger Greptile
|
Hi @chiragkhatri19 have a look on the suggestions |
…en lifecycle guards Addresses P1 data-leak (Greptile) and two Gemini security-high findings from PR XortexAI#228 review. `_vector_update` + batched UPDATE path now carry `extra_metadata` so forget/TTL is preserved when the Judge emits an UPDATE op. `extra_metadata` is sanitized once at `execute()` via a PROTECTED_METADATA_KEYS denylist so callers cannot overwrite `user_id`/`domain`. Route refactored to use `build_lifecycle_metadata()`; batch ingest rejects `forget=true` with HTTP 400 (was a silent no-op on a privacy path). Defensive guards added to `is_retrievable(None)` and datetime parse. 15 new regression + route tests; 92/92 passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review remediation — commit
|
Summary
Closes part of #166 (memory lifecycle — PR #1 of a planned series).
This PR ships the forget + TTL feature on the v2 ingest path, as agreed with @ishaanxgupta before opening. Versioning and decay are the follow-up PRs.
What this adds
forget: boolflag onPOST /v2/memory/ingest— whentrue, the stored memory is tagged withforget=trueandexpires_at = now + TTL(default 30 days, configurable viaMEMORY_FORGET_DEFAULT_TTL_DAYS)._search_summaryand the profile catalog. Legacy records (no lifecycle keys) are never affected.Architecture
v2andv1share the sameIngestPipelineandWeaver; the v2 layer is only the route + Temporal workflow/activities. The feature is wired additively viaextra_metadata=Noneall the way down — v1 callers and any ingest withoutforget=trueare byte-identical to before.Files changed
src/api/schemas.pyforget: bool = FalsetoIngestRequestsrc/api/routes/v2/memory.pylifecycle_metadatawhenforget=true; include in idempotencysrc/api/routes/v2/activities.pylifecycle_metadatainto domain/pipeline activity statesrc/pipelines/ingest.pyextra_metadatatoweaver.execute()in profile/image/summary nodessrc/pipelines/weaver.pyexecute(extra_metadata=None)merged at both ADD sites;_nowinjectablesrc/pipelines/retrieval.pyis_retrievablefilter in_search_summary+_fetch_profile_catalogsrc/pipelines/lifecycle.pyis_retrievable(meta, now)+build_lifecycle_metadata()src/schemas/memory_lifecycle.pyMemoryLifecyclePydantic model +RESERVED_LIFECYCLE_KEYSsrc/config/settings.pymemory_forget_default_ttl_days: float = 30.0tests/conftest.pyInMemoryVectorStore.update()to merge metadata (not replace) — prod backends merge; fake was replacing, a false-green risktests/unit/test_memory_lifecycle.pytests/integration/test_weaver_pipeline.pytests/integration/test_ingest_pipeline.pyRecordingWeaver.execute()signatureCHANGELOG.mdTest plan
ruff check— zero errors on all changed filesuv run python -m pytest tests/unit/test_memory_lifecycle.py tests/integration/test_weaver_pipeline.py tests/integration/test_ingest_pipeline.py tests/e2e/test_memory_flow.py -v # → 25 passedOn versioning (answering @ishaanxgupta's question)
Yes — the
Weaver._vector_updateseam is exactly right. That is the one place the pipeline currently overwrites and loses prior memory content. The plan for PR #2: before the overwrite, read the existing record, write a new version record withversion = old_version + 1andparent_memory_id, then flip the old record tois_current = False. Readers pickmax(version)amongis_currentrecords. No cross-store transaction needed.Out of scope (follow-up PRs)
Weaver._vector_updateparent lineage)🤖 Generated with Claude Code