feat(logs): redact PII from workflow logs via configurable rules#5136
feat(logs): redact PII from workflow logs via configurable rules#5136TheodoreSpeaks wants to merge 6 commits into
Conversation
Enterprise PII redaction for workflow execution logs, configured under Data Retention as org-scoped rules (each rule picks entity types + which workspaces it applies to). Reuses the guardrails Presidio engine in mask mode at the log-persist choke point, with a check-digit-validated VIN recognizer. Also adds per-workspace data-retention-hours overrides.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryHigh Risk Overview Persistence path: Guardrails gain a shared client-safe PII entity catalog, a VIN recognizer with check-digit validation, and refactored Python subprocess helpers. Schema/types formalize Reviewed by Cursor Bugbot for commit a5ca0f4. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
@greptile review |
…rkflow-log # Conflicts: # packages/db/migrations/meta/0241_snapshot.json # packages/db/migrations/meta/_journal.json # scripts/check-api-validation-contracts.ts
Greptile SummaryAdds enterprise PII redaction for workflow execution logs using Microsoft Presidio, configured via org-scoped rules in the Data Retention settings UI. Redaction is applied at the single
Confidence Score: 3/5The log-persist hot path now has an unguarded async call that, on any transient DB error, will throw and drop the entire workflow execution log rather than degrading gracefully. Several issues raised in prior review threads remain unaddressed: the apps/sim/lib/logs/execution/logger.ts — the Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant EL as ExecutionLogger
participant APR as applyPiiRedaction
participant DB as Database
participant RER as redactPIIFromExecution
participant MPIB as maskPIIBatch
participant PY as validate_pii.py
EL->>APR: workspaceId, payload
APR->>DB: SELECT orgSettings FROM workspace LEFT JOIN organization
DB-->>APR: row
APR->>APR: resolveEffectivePiiRedaction
alt enabled false
APR-->>EL: payload unchanged
else enabled true
APR->>RER: payload, entityTypes
RER->>RER: collect eligible strings pass 1
RER->>MPIB: collected strings
loop per 256 KB chunk
MPIB->>PY: texts via stdin
PY-->>MPIB: masked results
end
MPIB-->>RER: masked array
RER->>RER: substitute masked values pass 2
RER-->>APR: redacted payload
APR-->>EL: redacted payload
end
EL->>DB: persist cleanExecutionData
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant EL as ExecutionLogger
participant APR as applyPiiRedaction
participant DB as Database
participant RER as redactPIIFromExecution
participant MPIB as maskPIIBatch
participant PY as validate_pii.py
EL->>APR: workspaceId, payload
APR->>DB: SELECT orgSettings FROM workspace LEFT JOIN organization
DB-->>APR: row
APR->>APR: resolveEffectivePiiRedaction
alt enabled false
APR-->>EL: payload unchanged
else enabled true
APR->>RER: payload, entityTypes
RER->>RER: collect eligible strings pass 1
RER->>MPIB: collected strings
loop per 256 KB chunk
MPIB->>PY: texts via stdin
PY-->>MPIB: masked results
end
MPIB-->>RER: masked array
RER->>RER: substitute masked values pass 2
RER-->>APR: redacted payload
APR-->>EL: redacted payload
end
EL->>DB: persist cleanExecutionData
Reviews (4): Last reviewed commit: "refactor(logs): drop per-workspace reten..." | Re-trigger Greptile |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Mask PII from log content before persistence when the execution's workspace | ||
| * (via workspace override or org default) has enterprise PII redaction enabled. | ||
| * Resolved at persist time so both the inline and externalized write paths are | ||
| * covered. Returns the payload unchanged when disabled or non-enterprise. | ||
| */ | ||
| private async applyPiiRedaction( | ||
| workspaceId: string | null, | ||
| payload: RedactablePayload | ||
| ): Promise<RedactablePayload> { | ||
| if (!workspaceId) return payload | ||
|
|
||
| const [row] = await db | ||
| .select({ orgSettings: organization.dataRetentionSettings }) | ||
| .from(workspace) | ||
| .leftJoin(organization, eq(organization.id, workspace.organizationId)) | ||
| .where(eq(workspace.id, workspaceId)) | ||
| .limit(1) | ||
| if (!row) return payload | ||
|
|
||
| const config = resolveEffectivePiiRedaction({ orgSettings: row.orgSettings, workspaceId }) | ||
| if (!config.enabled) return payload | ||
|
|
||
| // Settings are only writable by enterprise orgs, but re-verify at read time | ||
| // (e.g. a plan downgrade) before doing the work. | ||
| if (!(await isWorkspaceOnEnterprisePlan(workspaceId))) return payload | ||
|
|
||
| return redactPIIFromExecution(payload, { entityTypes: config.entityTypes }) | ||
| } | ||
|
|
||
| async completeWorkflowExecution(params: { | ||
| executionId: string |
There was a problem hiding this comment.
Unconditional DB join on every workflow completion
applyPiiRedaction always fires a workspace → LEFT JOIN organization query, plus a second isWorkspaceOnEnterprisePlan call when redaction is configured, regardless of whether the org has any PII rules or is even on an enterprise plan. For deployments where the majority of executions are personal/non-enterprise workspaces this adds two round-trips to the hot path of every workflow. A lightweight guard (e.g. caching the org's enterprise status or checking a feature-flag ahead of the query) would let the common case skip both calls entirely.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Greptile SummaryThis PR adds enterprise-grade PII redaction for workflow execution logs, a new per-workspace data-retention-hours override, and a check-digit-validated VIN recognizer. The core redaction pipeline is well-engineered: a deterministic two-pass collect/substitute approach batches all strings from a single execution into one chunked Presidio call, and a fail-safe ensures PII is never persisted when masking fails.
Confidence Score: 3/5The redaction pipeline and workspace-override logic are structurally sound, but three copies of a wrong comment on a security-sensitive field create a genuine risk of operators deploying empty-entity-type rules under the false impression that they cover all PII. The core machinery — two-pass collect/substitute, byte-chunked Presidio calls, fail-safe scrubbing, and the workspace ?? org ?? plan-default retention resolution — is implemented correctly. Three conflicting entityTypes comments in schema.ts, primitives.ts, and pii-redaction.ts say the opposite of what retention.ts implements. An enterprise admin who follows those comments and creates a rule with no entity types selected will have no PII masked at all — a silent failure in a privacy-protection feature. packages/db/schema.ts, apps/sim/lib/api/contracts/primitives.ts, and apps/sim/lib/logs/execution/pii-redaction.ts all carry the wrong entityTypes comment that contradicts retention.ts. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant E as Executor
participant L as ExecutionLogger
participant R as retention.ts
participant P as pii-redaction.ts
participant V as validate_pii.ts
participant Py as validate_pii.py
participant DB as Postgres
E->>L: completeWorkflowExecution
L->>L: filterForDisplay + redactApiKeys
L->>DB: SELECT org.dataRetentionSettings
DB-->>L: orgSettings
L->>R: resolveEffectivePiiRedaction
R-->>L: enabled + entityTypes
alt PII enabled
L->>DB: isWorkspaceOnEnterprisePlan
DB-->>L: true
L->>P: redactPIIFromExecution
P->>P: collect string leaves
loop Per 256KB chunk
P->>V: maskPIIBatch
V->>Py: spawn subprocess
Py-->>V: masked strings
V-->>P: masked strings
end
P-->>L: redacted payload
end
L->>DB: UPDATE workflowExecutionLogs
L-->>E: WorkflowExecutionLog
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant E as Executor
participant L as ExecutionLogger
participant R as retention.ts
participant P as pii-redaction.ts
participant V as validate_pii.ts
participant Py as validate_pii.py
participant DB as Postgres
E->>L: completeWorkflowExecution
L->>L: filterForDisplay + redactApiKeys
L->>DB: SELECT org.dataRetentionSettings
DB-->>L: orgSettings
L->>R: resolveEffectivePiiRedaction
R-->>L: enabled + entityTypes
alt PII enabled
L->>DB: isWorkspaceOnEnterprisePlan
DB-->>L: true
L->>P: redactPIIFromExecution
P->>P: collect string leaves
loop Per 256KB chunk
P->>V: maskPIIBatch
V->>Py: spawn subprocess
Py-->>V: masked strings
V-->>P: masked strings
end
P-->>L: redacted payload
end
L->>DB: UPDATE workflowExecutionLogs
L-->>E: WorkflowExecutionLog
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/sta..." | Re-trigger Greptile |
| private async applyPiiRedaction( | ||
| workspaceId: string | null, | ||
| payload: RedactablePayload | ||
| ): Promise<RedactablePayload> { | ||
| if (!workspaceId) return payload | ||
|
|
||
| const [row] = await db | ||
| .select({ orgSettings: organization.dataRetentionSettings }) | ||
| .from(workspace) | ||
| .leftJoin(organization, eq(organization.id, workspace.organizationId)) | ||
| .where(eq(workspace.id, workspaceId)) | ||
| .limit(1) | ||
| if (!row) return payload | ||
|
|
||
| const config = resolveEffectivePiiRedaction({ orgSettings: row.orgSettings, workspaceId }) | ||
| if (!config.enabled) return payload | ||
|
|
||
| // Settings are only writable by enterprise orgs, but re-verify at read time | ||
| // (e.g. a plan downgrade) before doing the work. | ||
| if (!(await isWorkspaceOnEnterprisePlan(workspaceId))) return payload | ||
|
|
||
| return redactPIIFromExecution(payload, { entityTypes: config.entityTypes }) | ||
| } |
There was a problem hiding this comment.
Two extra DB round-trips per execution on the hot completion path
applyPiiRedaction always issues a JOIN query for org settings, then — if rules are active — calls isWorkspaceOnEnterprisePlan (another query). For the vast majority of executions (no PII rules configured, or non-enterprise workspace), this is wasted work on the critical log-persist path. The existing existingLog query at the top of completeWorkflowExecution already loads the workspace row; adding organization.dataRetentionSettings to that initial SELECT would remove the need for this second query entirely.
| /** Presidio entity types to mask. Empty = redact all detected PII. */ | ||
| entityTypes: z.array(z.string().min(1, 'Entity type cannot be empty')).max(100), |
There was a problem hiding this comment.
Same misleading comment as on
PiiRedactionRule in schema.ts. resolveEffectivePiiRedaction treats an empty entityTypes array on a rule as contributing nothing to the union, resulting in enabled: false (no redaction). The comment should reflect that.
| /** Presidio entity types to mask. Empty = redact all detected PII. */ | |
| entityTypes: z.array(z.string().min(1, 'Entity type cannot be empty')).max(100), | |
| /** Presidio entity types to mask. Empty array = rule contributes no types (redact nothing). */ | |
| entityTypes: z.array(z.string().min(1, 'Entity type cannot be empty')).max(100), |
|
@greptile review |
| private async applyPiiRedaction( | ||
| workspaceId: string | null, | ||
| payload: RedactablePayload | ||
| ): Promise<RedactablePayload> { | ||
| if (!workspaceId) return payload | ||
|
|
||
| const [row] = await db | ||
| .select({ orgSettings: organization.dataRetentionSettings }) | ||
| .from(workspace) | ||
| .leftJoin(organization, eq(organization.id, workspace.organizationId)) | ||
| .where(eq(workspace.id, workspaceId)) | ||
| .limit(1) | ||
| if (!row) return payload | ||
|
|
||
| const config = resolveEffectivePiiRedaction({ orgSettings: row.orgSettings, workspaceId }) | ||
| if (!config.enabled) return payload | ||
|
|
||
| // Settings are only writable by enterprise orgs, but re-verify at read time | ||
| // (e.g. a plan downgrade) before doing the work. | ||
| if (!(await isWorkspaceOnEnterprisePlan(workspaceId))) return payload | ||
|
|
||
| return redactPIIFromExecution(payload, { entityTypes: config.entityTypes }) | ||
| } |
There was a problem hiding this comment.
applyPiiRedaction DB failure propagates to log persistence
The db.select() inside applyPiiRedaction has no error handling. A transient DB error (connection timeout, pool exhaustion) while looking up org PII settings will throw from this function, propagate through completeWorkflowExecution — which has no local catch around the applyPiiRedaction call — and cause the entire workflow execution log to fail to persist. The PR's own fail-safe principle ("scrubs rather than leaks on error") in redactPIIFromExecution is not applied here to the settings-lookup path. A try/catch around the entire method body returning payload on any error would prevent PII configuration lookup failures from destroying execution history.
…t lazy - Extend PII redaction to span error/errorMessage/toolCalls and top-level error/completionFailure/trigger/executionState (Bugbot: PII in execution metadata). executionState is safe to redact — resume reads from the separate pausedExecutions table, not the log copy. - Lazy-import validate_pii in pii-redaction so the Python/child_process guardrails module stays out of the static middleware/RSC graph. - Type the org retention mutation to the contract body (optional, non-null).
…stays org-scoped - Remove the unused per-workspace data-retention-hours override (no UI; superseded by workspace-scoped PII rules). Reverts cleanup-dispatcher to org-only retention, drops resolveEffectiveRetentionHours, the workspace.dataRetentionSettings column + migration, and the workspace data-retention route/contract/hooks. Fixes Bugbot's null-as-unset finding by removing the buggy path entirely; org retention behavior is unchanged. - Stop re-checking isWorkspaceOnEnterprisePlan at persist time (it returns false on transient errors, which would fail-open and leak PII). Enabled rules already imply entitlement; redact whenever rules apply (fail-safe).
|
@greptile review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e5dfe05. Configure here.
- Drop the per-string size cap in PII redaction: oversized strings were left unmasked (leak). Nothing is skipped now; large payloads still fail-safe via the total-bytes ceiling + per-chunk timeout (scrub, never leak). - Add executionData.environment (incl. variables) to the redaction set.
|
@greptile review |

Summary
maskmode, applied at the single log-persist choke point (completeWorkflowExecution), covering both the inline and externalized write paths.structuredClone), and fail-safe (scrubs rather than leaks on error).workspace ?? org ?? plan default) via a new nullableworkspace.data_retention_settingscolumn.ChipModal(grouped checkbox grid for entity types, workspace multiselect), persisting on save with an unsaved-changes guard.Type of Change
Testing
bun run lint,bun run check:api-validation:strict, andbun run check:migrations origin/stagingall pass. Migration is a single additive nullable column (expand-phase, backward-compatible).Checklist