feat(#752): Embed LLM traces in sync log entries#784
feat(#752): Embed LLM traces in sync log entries#784niti-go wants to merge 13 commits intopromptdriven:mainfrom
Conversation
2f29976 to
da5810a
Compare
…n#752) - Move sync log from .pdd/meta/ to .pdd/logs/ (gitignored); migrate legacy files on both read and write inside get_log_path() - Store all LLM prompt/response pairs per operation in llm_traces list (was: only last pair, only on failure) - Add thinking field to LLMTracePair; normalize list-of-objects, int, and None inputs to string-or-null before storing - Rename pop_last_pair to pop_all_pairs (backward-compat alias kept) - Pass thinking from LiteLLM responses to record_llm_pair at all 3 call sites in llm_invoke.py - Truncate prompt/response/thinking at 20k chars; redact Bearer tokens, api_key, token, secret, password from all text fields - Add agentic_trace pointer (session_file, provider, format) to log entries when a CLI agent runs; replaces inlined session content - Add session discovery for Claude (construct path from session_id), Gemini (glob by first-8-chars of UUID via projects.json slug), and Codex (filesystem snapshot-and-diff of rollout files) - Session discovery is best-effort: never raises, never warns, returns None on any failure, logs at DEBUG only - Remove _log_agentic_interaction, AGENTIC_LOG_DIR, _AGENTIC_SESSION_ID from agentic_common.py; add .pdd/agentic-logs/ to .gitignore - Core dump no longer embeds *_sync.log in file_contents; adds sync_log_refs array and reads sync_steps directly from disk - _display_sync_log (--dry-run) now uses get_log_path() instead of hardcoded .pdd/meta/ path - Add .pdd/meta/*_sync.log, .pdd/logs/, .pdd/agentic-logs/ to .gitignore - Add 82 Tier 1 mocked tests (sections A-Q) in test_llm_traces.py - Add Tier 2 E2E tests (sections S-Y) in test_llm_traces_e2e.py, gated by PDD_RUN_REAL_LLM_TESTS and PDD_RUN_AGENTIC_TESTS env vars - Update existing tests for new log path and 4-tuple return type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test mocked pop_last_pair but sync_orchestration now calls pop_all_pairs and writes traces to llm_traces (top-level list) instead of details.llm_trace (single pair). Update the mock and assertion to match. Same regression coverage, new API shape. This test was already failing before promptdriven#752 (the drain call on line 2170 consumed the mock before the attach call could use it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge test_agentic_sync_writes_llm_traces, test_format_matches_provider, test_claude_path_from_session_id, and test_gemini_path_by_prefix into a single test_agentic_sync_clean_path that runs one pdd sync --agentic and checks everything: llm_traces on generate, agentic_trace structure, format/extension match, provider-specific path patterns, and session file exists on disk with content. Cuts agentic clean-path tests from 5 separate syncs to 1 per provider. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_test_operation_has_traces failed because sync reused existing tests (cost 0, no LLM calls) instead of generating new ones. Only assert llm_traces on entries that actually called an LLM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test operation on this minimal greeter project always has cost 0 and no LLM calls — sync runs existing tests rather than generating new ones. The guarded assertion never executed, making it a vacuous pass. Coverage is already provided by test_generate_writes_traces (verifies llm_traces on a real LLM-calling operation) and test_every_llm_operation_has_traces (catches any operation with LLM calls that fails to capture traces). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge 5 separate LiteLLM tests (generate traces, example traces, model field, no log in meta, fingerprint, orphaned calls) into one test_litellm_sync_traces that runs a single pdd sync. Merge skip, dry-run, and core-dump tests into one test_skip_dryrun_coredump that syncs twice then checks all three. Total LLM sync runs drops from 11 to 5. Same assertions, less cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LiteLLM trace tests (sync traces, skip/dry-run/core-dump, migration) now live in test_sync_orchestration.py alongside the other sync tests, gated by PDD_RUN_REAL_LLM_TESTS=1. This matches the existing pattern where that file already has an @pytest.mark.e2e test. test_llm_traces_e2e.py now only contains agentic provider-parametrized tests (Claude/Gemini/Codex), gated by PDD_RUN_AGENTIC_TESTS=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cation The crash tests tried to force a "crash" operation by breaking code after initial sync, but operation classification is unreliable in this context. The clean-path test already covers the actual feature (traces in sync log). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on, stale trace leak - Add JSON-quoted-value redaction pattern so "api_key": "sk-..." in LLM traces gets scrubbed (the old regex stopped at quote characters) - Change _last_agentic_trace from plain global to ContextVar for thread safety, matching llm_trace.py's existing pattern - Add pre-operation drain of stale agentic traces in sync_orchestration so an unconsumed trace from a prior exception path can't leak forward - Add defensive newline guard in append migration so JSONL lines don't merge when the target file lacks a trailing newline - Remove unused pop_last_pair import from sync_orchestration - Rewrite Sections C, D, and I.test_failed_discovery tests to exercise actual production code (discovery functions, sync_orchestration) instead of asserting on hand-built dict literals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… imports, destructive alias
- Replace shutil.move with os.rename (atomic on POSIX) for the fast
migration path; handle FileNotFoundError/OSError for concurrent access.
Append path uses a single r+b file handle to check trailing newline and
write in one shot, eliminating TOCTOU between separate read and append.
- Change _pairs_by_operation ContextVar default from {} (shared mutable)
to None; all call sites already use `or {}` copy pattern.
- Move two mid-file `import logging as ...` statements to a single
top-level `import logging` in agentic_common.py.
- Make pop_last_pair non-destructive: peek at last pair instead of
draining all pairs via pop_all_pairs, so backward-compat callers
don't silently destroy the trace list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment said .pdd/meta was used for "logs and locks". Since sync logs moved to .pdd/logs/, clarify that meta holds fingerprints and run reports while logs/ holds the sync logs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The .pdd directory listing said meta/ held sync logs. Since promptdriven#752 moved them to .pdd/logs/ (gitignored), update the listing and the version control note accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR embeds LLM traces (all prompt/response pairs per operation) and agentic session pointers directly into sync log entries, and moves the sync log from Confidence Score: 5/5Safe to merge; all remaining findings are minor style or edge-case quality issues with no impact on the primary sync path. The implementation is thorough and well-tested (90 mocked + several real-LLM tests). All three findings are P2: an empty-list thinking normalisation edge case, a cosmetic redaction UX issue, and a partial-write recovery path that silently continues. None affect correctness under normal operation. pdd/core/llm_trace.py (empty-list thinking normalizes to empty string instead of None), pdd/operation_log.py (silent swallow of write errors in slow-path migration).
|
| Filename | Overview |
|---|---|
| pdd/core/llm_trace.py | Accumulates all LLM pairs per operation instead of keeping only the last; adds thinking normalization and JSON-quoted redaction. Two minor issues: empty-list thinking normalizes to "" instead of None; new redaction pattern strips key names alongside values. |
| pdd/operation_log.py | Moves sync log from .pdd/meta/ to .pdd/logs/ with best-effort migration. Atomic rename fast path is correct; slow-path append silently swallows write errors without resetting to a clean state, risking JSONL corruption on retry. |
| pdd/agentic_common.py | Adds session discovery for Claude/Gemini/Codex, ContextVar-based _last_agentic_trace, replaces _log_agentic_interaction with the new trace pointer scheme. All failure paths return None and never raise. 4-tuple return is consistent throughout. |
| pdd/sync_orchestration.py | Attaches llm_traces and agentic_trace to log entries on both success and failure; pre-op drain correctly flushes stale traces. Import of get_last_agentic_trace is guarded by ImportError for environments without agentic support. |
| pdd/core/dump.py | Replaces file_contents-based sync log embedding with direct disk reads; adds sync_log_refs. Correctly deduplicates logs_dir vs meta_dir entries and strips trace data from step records. |
| pdd/llm_invoke.py | Passes thinking output to record_llm_pair at all three call sites (cloud, litellm, retry); best-effort extraction wrapped in try/except at each site. |
Sequence Diagram
sequenceDiagram
participant SO as sync_orchestration
participant LT as llm_trace (ContextVar)
participant LI as llm_invoke
participant AC as agentic_common
participant OL as operation_log
SO->>LT: set_current_operation(op)
SO->>LT: pop_all_pairs(op) [drain stale]
SO->>AC: get_last_agentic_trace() [drain stale]
SO->>LI: code_generator_main / test / fix
LI->>LT: record_llm_pair(prompt, response, model, thinking)
Note over LT: accumulated in _pairs_by_operation[op]
SO->>AC: run_agentic_task(provider, ...)
AC->>AC: _run_with_provider()
AC->>AC: _discover_session()
AC->>LT: _last_agentic_trace.set(trace)
SO->>LT: pop_all_pairs(op) -> llm_traces list
SO->>AC: get_last_agentic_trace() -> agentic_trace
SO->>OL: append_log_entry(entry + llm_traces + agentic_trace)
OL->>OL: get_log_path() -> .pdd/logs/ (migrates legacy .pdd/meta/)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: pdd/core/llm_trace.py
Line: 75-86
Comment:
**Empty-list thinking normalizes to `""` instead of `None`**
When `thinking` is `[]` or a list where every dict has empty `"thinking"`/`"text"`/`"content"`, `_normalize_thinking` returns `"\n".join([])` = `""`. Back in `record_llm_pair`, the guard `if thinking_text is not None` is satisfied by `""`, so `LLMTracePair.thinking` is stored as an empty string rather than `None`. Callers doing `thinking is None` to mean "absent" will silently get `False` and treat it as a non-null field with no content.
```suggestion
return "\n".join(parts) or None
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: pdd/core/llm_trace.py
Line: 30
Comment:
**JSON-quoted redaction drops the key name**
The new pattern replaces the full `"api_key": "sk-..."` span with `<redacted>`, erasing the key name. A prompt snippet like `{"api_key": "sk-abc...", "model": "gpt-4"}` becomes `{<redacted>, "model": "gpt-4"}`, which gives developers no hint that a secret key field was present. Consider preserving the key while only scrubbing the value by using a replacement like `"\1": "<redacted>"`.
This is only a usability concern — the redaction itself is secure either way.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: pdd/operation_log.py
Line: 83-102
Comment:
**Partial write leaves `new_path` corrupted without clearing `old_path`**
If `f.write(old_content)` fails mid-write (e.g., disk full), the broad `except Exception: pass` swallows the error and `old_path.unlink()` never runs. On the next call to `get_log_path`, `old_path` still exists and `new_path` exists with a partial/corrupted tail, so the slow path fires again and appends `old_content` a second time. The partial first write can corrupt the JSONL line at the boundary, not merely duplicate it.
The doc comment acknowledges that "the worst case is a duplicate append (idempotent JSONL lines)" — but that covers the concurrent-process race, not the partial-write case. Logging the failure at DEBUG level (without re-raising) would at least surface the issue without breaking the workflow.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Update README: sync logs moved from .pdd..." | Re-trigger Greptile
- Update test_issue_902 mock return values to 4-tuple (added agentic_trace) - Update test_e2e_issue_894 unpacking to 4-tuple for _run_with_provider - Fix _normalize_thinking returning "" instead of None for empty lists - Preserve key names in JSON-quoted secret redaction - Log migration failures at DEBUG instead of silently swallowing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gltanaka
left a comment
There was a problem hiding this comment.
Thanks for taking this on. The motivation here is absolutely useful: having better traces around agentic and LLM-driven workflows would make prompt iteration and debugging much easier.
I am going to request changes because I think this PR currently solves a different problem than #752, and it also removes some existing debug coverage that we still need.
Issue #752 asks for compact, standardized agentic trace files under .pdd/agentic-traces/, with provider-specific raw logs distilled into a common, redacted format. This PR instead stores LiteLLM prompt/response pairs in sync logs and adds best-effort pointers to provider-owned session files. That is useful in places, but it is not the same artifact the issue describes.
A few concerns I would like addressed before merge:
- Please preserve or replace the existing agentic failure logging path. Today
mainrecords agentic failures via.pdd/agentic-logs/; this PR removes that, while the new provider-session discovery misses cases like timeout, CLI not found, non-zero exit, invalid JSON, and parse failures. - Please wire agentic trace capture at the
run_agentic_task()level, not only throughsync_orchestration, since bug/change/fix/update/split/checkup workflows also use agentic tasks. - Please generate project-local standardized trace files, preferably under
.pdd/agentic-traces/, rather than only storing absolute pointers to provider-private session files. - Please keep the sync-log relocation / migration / core-dump behavior separate unless it is strictly required. That part adds a lot of blast radius and seems independent from the trace feature.
The LiteLLM llm_traces work may be worth splitting into a smaller PR, since collecting all prompt/response pairs for sync operations is useful on its own. For #752 specifically, I would like to see this refocused around the standardized agentic trace artifact described in the issue.
Thanks again for pushing this forward. The direction is valuable; I think it just needs to be realigned with the requested trace format and avoid regressing the existing debug logs.
Summary
Closes #752
Store LLM trace data in sync log entries so developers can see every prompt, response, and reasoning step when a pdd operation gives unexpected results.
It will be helpful to have a standardized way of storing LLM internal thinking processes so we can reference them when iterating upon pdd agentic LLM prompts.
Before this PR, the only method to view agentic traces was to manually dig into provider-specific folders to search for session files (~/.claude/projects/, ~/.gemini/tmp/). And for non-agentic pdd operations, the LLM input/output pairs were not saved entirely.
.pdd/meta/to.pdd/logs/(gitignored). Migrate legacy files automatically on read and write.llm_traces: Every LiteLLM operation (generate, test, example, fix) now records all prompt/response pairs in anllm_traceslist on the log entry — on both success and failure. Each item hasprompt,response,model, andthinking(string or null).agentic_trace: When a CLI agent runs (Claude, Gemini, Codex), the log entry gets anagentic_tracepointer to the provider's session file on disk.session_id), Gemini (glob by first-8-chars of UUID viaprojects.jsonslug), Codex (filesystem snapshot-and-diff). Best-effort, never raises, never warns.api_key,token,secret,passwordscrubbed from all text fields — including JSON-quoted values like"api_key": "sk-...".*_sync.loginfile_contents. Addssync_log_refsarray. Readssync_stepsdirectly from disk._log_agentic_interaction,AGENTIC_LOG_DIR,_AGENTIC_SESSION_IDdeleted..pdd/agentic-logs/added to.gitignore._last_agentic_tracechanged from plain global toContextVar, matchingllm_trace.py's pattern.os.rename(atomic on POSIX) withFileNotFoundError/OSErrorhandling for concurrent access. Append path uses a singler+bfile handle to check trailing newline and write in one shot, eliminating TOCTOU._pairs_by_operationdefault changed from{}(shared mutable) toNoneto prevent cross-context mutation.pop_last_pairnon-destructive: Backward-compat alias now peeks at the last pair without draining the list, so it doesn't destroy traces thatpop_all_pairsexpects to collect.import logging as ...inagentic_common.pyconsolidated to a single top-levelimport logging.Files changed
pdd/core/llm_trace.pythinkingfield,pop_all_pairs, JSON-quoted redaction pattern,ContextVardefaultNoneinstead of{},pop_last_pairis now non-destructive peekpdd/llm_invoke.pythinkingtorecord_llm_pairat all 3 call sitespdd/operation_log.py.pdd/logs/path, migration via atomicos.renamewith race-safe fallback,r+bnewline guardpdd/sync_orchestration.pyllm_tracesandagentic_traceafter each op, pre-op drain of stale agentic tracepdd/agentic_common.py_last_agentic_traceasContextVar, top-levelimport loggingpdd/core/dump.pysync_log_refs, read steps from disk, stop embedding logs.gitignore.pdd/meta/*_sync.log,.pdd/logs/,.pdd/agentic-logs/Tests
test_llm_traces.py— 90 mocked testsNo LLM calls. Fast, deterministic, runs in CI.
test_new_write_path,test_migration_on_write,test_migration_on_read,test_already_migrated,test_only_sync_logs_migrate,test_directory_creation_safe_twiceget_log_path()returns.pdd/logs/, legacy files move from.pdd/meta/, fingerprint and run report stay puttest_generate_gets_llm_traces_list,test_crash_retries_collect_all,test_trace_item_schema,test_traces_on_success_and_failure,test_failed_operation_keeps_all_traces,test_thinking_populated,test_thinking_null_when_absenttest_claude_discovery_returns_correct_schema,test_gemini_discovery_returns_json_format,test_codex_discovery_returns_correct_schema,test_both_llm_traces_and_agentic_trace_on_same_entry{session_file, provider, format}shape; sync_orchestration attaches bothllm_tracesandagentic_traceto the same entrytest_skip_entry_has_no_trace_keys,test_error_decision_entry_has_no_trace_keys,test_event_logged_via_log_event_has_no_trace_keyslog_event()directly — verifies logged entries have nollm_tracesoragentic_tracetest_prompt_truncation,test_response_truncation,test_thinking_truncation,test_secret_redactiontest_finds_session_by_id,test_slug_has_leading_dash,test_slug_with_spaces,test_deep_path,test_missing_session_id,test_file_not_found,test_missing_claude_dir,test_empty_session_file,test_parallel_safetytest_happy_path,test_falls_back_to_jsonl,test_prefers_json_over_jsonl,test_multiple_json_picks_newest,test_uses_first_8_chars,test_slug_with_special_chars,test_rejects_subdirectory,test_bad_projects_json_schema,test_missing_projects_json,test_no_session_id,test_corrupt_projects_json,test_cwd_not_in_projects,test_parallel_safetytest_happy_path_one_new_file,test_multiple_new_files,test_pre_existing_files_ignored,test_respects_codex_home,test_missing_codex_dir,test_only_rollout_files,test_no_new_filestest_never_raises_claude,test_never_raises_gemini,test_never_raises_codex,test_no_warnings,test_failed_discovery_does_not_block_logging,test_run_with_provider_sets_last_agentic_traceagentic_trace(via sync_orchestration),_run_with_providerstores trace in module statetest_reads_sync_steps_from_logs_dir,test_includes_sync_log_refs,test_does_not_embed_sync_logs,test_sync_steps_excludes_trace_data,test_falls_back_to_meta_for_unmigratedtest_reads_from_new_location,test_triggers_migration,test_does_not_show_trace_content_display_sync_logusesget_log_path(), doesn't print prompt/response texttest_each_operation_gets_own_traces,test_no_trace_leakage,test_mixed_entriespop_all_pairsdrains correctlytest_returns_all_recorded,test_empty_when_nothing_recorded,test_scoped_by_operationtest_separate_log_filestest_log_agentic_interaction_removed,test_agentic_log_dir_removed,test_agentic_session_id_removed,test_no_agentic_logs_dir_createdtest_response_redacted,test_thinking_redactedtest_thinking_list_of_objects,test_thinking_unexpected_type,test_thinking_nonetest_json_api_key_redacted,test_json_token_redacted,test_json_secret_in_response,test_json_secret_in_thinking"api_key": "sk-..."style JSON values redacted across prompt, response, and thinking fieldstest_agentic_trace_is_context_var,test_agentic_trace_isolated_across_threads_last_agentic_traceis aContextVar, traces set in one thread are invisible in anothertest_append_migration_preserves_newline_boundarytest_sync_orchestration.py— 4 new tests (1 mocked + 3 real LLM)The mocked test runs with all other sync orchestration tests. The 3 real LLM tests are gated by
PDD_RUN_REAL_LLM_TESTS=1.test_sync_orchestration_attaches_llm_trace_on_failed_operationpop_all_pairsandcode_generator_main, verifies failed generate entry hasllm_tracestest_e2e_litellm_sync_tracespdd sync. Checks: log in.pdd/logs/not.pdd/meta/, generate entry hasllm_traceswith valid prompt/response/model/thinking, example has traces, model field is recognizable, every traced operation has >= 2 items (no orphaned calls), fingerprint intacttest_e2e_skip_dryrun_coredump--dry-runand--core-dump. Checks: skip entries have no traces, dry-run shows operation names. Unconditionally asserts core dump file exists, containssync_log_refs, and all refs point to.pdd/logs/test_e2e_legacy_sync_log_migrates.pdd/meta/, syncs. Checks: old file gone, new file in.pdd/logs/, old entries preservedtest_llm_traces_e2e.py— 3 agentic tests (3 providers x 1 test)Parametrized over
[anthropic, google, openai]. Skips if the provider's CLI isn't installed.test_agentic_sync_clean_pathpdd sync --agenticon a fresh project. Asserts: log in.pdd/logs/, generate entry hasllm_traces, anyagentic_traceentries have valid session file on disk, provider-specific path patterns (Claude/.claude/projects/, Gemini/.gemini/tmp/.../chats/, Codexjsonl)Test results
test_llm_traces.py(90 mocked)test_sync_orchestration.pymocked trace testtest_e2e_litellm_sync_tracestest_e2e_skip_dryrun_coredumptest_e2e_legacy_sync_log_migratestest_agentic_sync_clean_path[google]test_agentic_*[anthropic]test_agentic_*[openai]Existing test suites — all green
test_operation_log.pytest_core_dump.pytest_agentic_common.pytest_llm_invoke.pyPre-existing test failures (not related to this PR)
This PR has 10 CI failures, but these tests also fail on upstream
main(Unit Tests run 24949525104). These failures are unrelated to this PR, but described below so they can be fixed on main:Group 1:
test_ci_detect_changed_modules.py— 5 failuresRoot cause: Test loader references
scripts/ci_detect_changed_modules.pybut the file lives atpdd/ci_detect_changed_modules.py. Wrong path in the test.test_basename_from_nested_code_pathsFileNotFoundError: scripts/ci_detect_changed_modules.pytest_basename_from_nested_context_and_teststest_reverse_dep_detects_recursive_nested_includetest_reverse_dep_detects_include_manytest_detect_combines_nested_direct_and_reverse_dependenciesFix (not this PR): Change
"scripts"to"pdd"on line 11 of the test.Group 2:
test_agentic_e2e_fix_orchestrator.py— 3 failuresRoot cause: Tests assert the orchestrator prompt documents safety features (
_detect_meaningful_changes,_is_noise_path,cycle_start_hashes), but the prompt file was never updated to include these tokens.test_orchestrator_prompt_documents_direct_edit_guard_detect_meaningful_changes,has_direct_editstest_orchestrator_prompt_documents_noise_filter_is_noise_path,__pycache__,.pytest_cache,node_modulestest_orchestrator_prompt_documents_cycle_waste_breakercycle_start_hashes,current_cycle > 1Fix (not this PR): Add the required documentation tokens to
prompts/agentic_e2e_fix_orchestrator_python.prompt.Group 3:
test_issue_1240_generate_prompt_meta_framing.py— 2 failuresRoot cause: Tests expect
utils/mcp/prompts/generate_prompt.promptandutils/mcp/prompts/*_python.promptfiles, but the MCP prompts directory was never committed to main.test_mcp_generate_prompt_template_is_not_meta_framedFileNotFoundError: utils/mcp/prompts/generate_prompt.prompttest_mcp_checked_in_python_prompts_are_not_meta_framedAssertionError: no *_python.prompt files foundFix (not this PR): Commit the MCP prompt files or add skip guards for when the directory is missing.