Skip to content

Commit 8f312d2

Browse files
feat(guardrails): PII redaction via Presidio sidecar (native VIN, per-rule language) (#5174)
* fix(logs): run PII redaction over HTTP and fix Presidio provisioning - resolve the guardrails venv via candidate paths and fail fast instead of silently falling back to system python3 (the misleading "Presidio not installed" that broke redaction and the guardrails block in deployed runtimes) - install the en_core_web_lg spaCy model in setup.sh and app.Dockerfile - route log redaction through an internal /api/guardrails/mask-batch endpoint so Presidio always runs in the app container, including async executions that persist inside the trigger.dev runtime * fix(guardrails): chunk + time-bound internal PII mask requests - chunk maskPIIBatchViaHttp by count (2000) and bytes (256KB) so large executions split across requests and never hit the contract's 100k cap - add AbortSignal.timeout(45s) per request so a slow/unreachable app container aborts and the caller scrubs, instead of hanging the trigger.dev job - catch maskPIIBatch failures in the route: log and return a structured 500 (broken venv fails loudly server-side; caller still scrubs, no leak) - add mask-client tests (order across chunks, count split, non-2xx, empty) * fix(guardrails): mint internal token per mask request A single token (5min TTL) could expire mid-batch when a large execution fans out into many sequential chunk requests; mint one per request instead. * feat(guardrails): run PII via Presidio sidecars + TS recognizer registry - replace the per-call python3 subprocess (cold spaCy load every call) with two long-lived Presidio sidecars (analyzer + anonymizer) reached over HTTP; the app image no longer carries Python/Presidio/venv - add PRESIDIO_ANALYZER_URL / PRESIDIO_ANONYMIZER_URL - move VIN out of Python into a TS recognizer (check-digit validated) behind a CUSTOM_RECOGNIZERS registry so new custom detectors are one entry; masking is handled uniformly by the anonymizer - drive the guardrails block's PII type picker from the shared pii-entities catalog (adds VIN, fixes drift) so block + Data Retention never diverge - delete validate_pii.py, requirements.txt, setup.sh and the Dockerfile venv step * fix(guardrails): bound-parallelize mask batch; refresh stale comments - maskPIIBatch runs per-string sidecar calls with bounded concurrency (8) via mapWithConcurrency, so a chunk of many small leaves finishes within the 45s request timeout instead of aborting and scrubbing; order + fail-on-error kept - drop stale comments referencing the deleted Python venv / 30s subprocess timeout * refactor(guardrails): single Presidio image, native VIN, per-rule redaction language - collapse the analyzer/anonymizer URLs into one PRESIDIO_URL (combined image serves /analyze + /anonymize) - remove the TS VIN recognizer (vin.ts, recognizers.ts) — VIN is now native + multi-language in the image; validate_pii is a thin analyze→anonymize client - trim KR_RRN/TH_TNIN from the catalog (no Korean/Thai model in the image) - add per-rule redaction language: PII_LANGUAGES catalog drives the contract enum, the Data Retention rule modal, and the guardrails block dropdown; resolver + logger thread it through to maskPIIBatch (default en), so non-English entity rules (e.g. ES_NIF) actually fire instead of silently no-op'ing under en * fix(guardrails): correct sidecar port (5001) + README for combined image The combined Presidio image (docker/pii.Dockerfile) serves /analyze + /anonymize on a single port 5001 with native VIN + multi-language recognizers. Fix the PRESIDIO_URL default (was 5002) and rewrite the README, which still described two stock containers and a TS VIN recognizer. * fix(guardrails): coerce stored redaction language in the resolver The persist-path resolver accepted any stored language string, so a stale/invalid code (e.g. a dropped locale) would reach Presidio and scrub the log even though the admin UI shows English. Coerce against the supported set via a shared coercePiiLanguage helper (now reused by the data-retention route too), falling back to en for unknown values. * fix(guardrails): rename PRESIDIO_URL env var to PII_URL Match the infra taskdef, which sets PII_URL on the app container for the combined Presidio sidecar.
1 parent 4d2e7d5 commit 8f312d2

26 files changed

Lines changed: 702 additions & 684 deletions

File tree

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
/**
2+
* @vitest-environment node
3+
*/
4+
import { createMockRequest } from '@sim/testing'
5+
import { beforeEach, describe, expect, it, vi } from 'vitest'
6+
7+
const { mockCheckInternalAuth, mockMaskPIIBatch } = vi.hoisted(() => ({
8+
mockCheckInternalAuth: vi.fn(),
9+
mockMaskPIIBatch: vi.fn(),
10+
}))
11+
12+
vi.mock('@/lib/auth/hybrid', () => ({
13+
checkInternalAuth: mockCheckInternalAuth,
14+
}))
15+
16+
vi.mock('@/lib/guardrails/validate_pii', () => ({
17+
maskPIIBatch: mockMaskPIIBatch,
18+
}))
19+
20+
import { POST } from '@/app/api/guardrails/mask-batch/route'
21+
22+
describe('POST /api/guardrails/mask-batch', () => {
23+
beforeEach(() => {
24+
vi.clearAllMocks()
25+
mockCheckInternalAuth.mockResolvedValue({ success: true })
26+
mockMaskPIIBatch.mockImplementation(async (texts: string[]) => texts.map((t) => `M(${t})`))
27+
})
28+
29+
it('returns 401 without internal auth', async () => {
30+
mockCheckInternalAuth.mockResolvedValue({
31+
success: false,
32+
error: 'Internal authentication required',
33+
})
34+
35+
const res = await POST(
36+
createMockRequest('POST', { texts: ['a@b.com'], entityTypes: ['EMAIL_ADDRESS'] })
37+
)
38+
39+
expect(res.status).toBe(401)
40+
expect(mockMaskPIIBatch).not.toHaveBeenCalled()
41+
})
42+
43+
it('masks the batch in-process and preserves order', async () => {
44+
const res = await POST(
45+
createMockRequest('POST', {
46+
texts: ['a@b.com', 'hello'],
47+
entityTypes: ['EMAIL_ADDRESS'],
48+
language: 'en',
49+
})
50+
)
51+
52+
expect(res.status).toBe(200)
53+
const json = await res.json()
54+
expect(json.masked).toEqual(['M(a@b.com)', 'M(hello)'])
55+
expect(mockMaskPIIBatch).toHaveBeenCalledWith(['a@b.com', 'hello'], ['EMAIL_ADDRESS'], 'en')
56+
})
57+
58+
it('rejects an invalid body with 400', async () => {
59+
const res = await POST(createMockRequest('POST', { texts: 'not-an-array', entityTypes: [] }))
60+
61+
expect(res.status).toBe(400)
62+
expect(mockMaskPIIBatch).not.toHaveBeenCalled()
63+
})
64+
})
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import { createLogger } from '@sim/logger'
2+
import { getErrorMessage } from '@sim/utils/errors'
3+
import { type NextRequest, NextResponse } from 'next/server'
4+
import { guardrailsMaskBatchContract } from '@/lib/api/contracts'
5+
import { parseRequest } from '@/lib/api/server'
6+
import { checkInternalAuth } from '@/lib/auth/hybrid'
7+
import { withRouteHandler } from '@/lib/core/utils/with-route-handler'
8+
import { maskPIIBatch } from '@/lib/guardrails/validate_pii'
9+
10+
const logger = createLogger('GuardrailsMaskBatchAPI')
11+
12+
/**
13+
* Internal batch PII masking. The log-redaction persist path runs in both the
14+
* Next.js server and the trigger.dev runtime, but the Presidio sidecars live only
15+
* in the app task — so redaction calls this endpoint server-to-server (internal
16+
* JWT) to keep Presidio centralized here.
17+
*/
18+
export const POST = withRouteHandler(async (request: NextRequest) => {
19+
const auth = await checkInternalAuth(request, { requireWorkflowId: false })
20+
if (!auth.success) {
21+
return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
22+
}
23+
24+
const parsed = await parseRequest(guardrailsMaskBatchContract, request, {})
25+
if (!parsed.success) return parsed.response
26+
27+
const { texts, entityTypes, language } = parsed.data.body
28+
29+
try {
30+
const masked = await maskPIIBatch(texts, entityTypes, language)
31+
logger.info('Masked PII batch', { count: texts.length })
32+
return NextResponse.json({ masked })
33+
} catch (error) {
34+
// An unreachable/misconfigured Presidio sidecar makes maskPIIBatch throw; fail
35+
// loudly here (the caller scrubs to REDACTION_FAILED, so PII is never leaked).
36+
logger.error('PII batch masking failed', {
37+
error: getErrorMessage(error),
38+
count: texts.length,
39+
})
40+
return NextResponse.json(
41+
{ error: getErrorMessage(error, 'PII masking failed') },
42+
{ status: 500 }
43+
)
44+
}
45+
})

apps/sim/app/api/organizations/[id]/data-retention/route.ts

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ import { isOrganizationOnEnterprisePlan } from '@/lib/billing/core/subscription'
1616
import { isBillingEnabled } from '@/lib/core/config/env-flags'
1717
import { isFeatureEnabled } from '@/lib/core/config/feature-flags'
1818
import { withRouteHandler } from '@/lib/core/utils/with-route-handler'
19+
import { coercePiiLanguage } from '@/lib/guardrails/pii-entities'
1920

2021
const logger = createLogger('DataRetentionAPI')
2122

@@ -35,7 +36,14 @@ function normalizeConfigured(
3536
logRetentionHours: settings?.logRetentionHours ?? null,
3637
softDeleteRetentionHours: settings?.softDeleteRetentionHours ?? null,
3738
taskCleanupHours: settings?.taskCleanupHours ?? null,
38-
piiRedaction: settings?.piiRedaction?.rules ? { rules: settings.piiRedaction.rules } : null,
39+
piiRedaction: settings?.piiRedaction?.rules
40+
? {
41+
rules: settings.piiRedaction.rules.map((rule) => ({
42+
...rule,
43+
language: coercePiiLanguage(rule.language),
44+
})),
45+
}
46+
: null,
3947
}
4048
}
4149

apps/sim/blocks/blocks/guardrails.ts

Lines changed: 11 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import { ShieldCheckIcon } from '@/components/icons'
2+
import { PII_ENTITY_GROUPS, PII_LANGUAGES } from '@/lib/guardrails/pii-entities'
23
import type { BlockConfig } from '@/blocks/types'
34
import {
45
getModelOptions,
@@ -170,65 +171,15 @@ Return ONLY the regex pattern - no explanations, no quotes, no forward slashes,
170171
title: 'PII Types to Detect',
171172
type: 'grouped-checkbox-list',
172173
maxHeight: 400,
173-
options: [
174-
// Common PII types
175-
{ label: 'Person name', id: 'PERSON', group: 'Common' },
176-
{ label: 'Email address', id: 'EMAIL_ADDRESS', group: 'Common' },
177-
{ label: 'Phone number', id: 'PHONE_NUMBER', group: 'Common' },
178-
{ label: 'Location', id: 'LOCATION', group: 'Common' },
179-
{ label: 'Date or time', id: 'DATE_TIME', group: 'Common' },
180-
{ label: 'IP address', id: 'IP_ADDRESS', group: 'Common' },
181-
{ label: 'URL', id: 'URL', group: 'Common' },
182-
{ label: 'Credit card number', id: 'CREDIT_CARD', group: 'Common' },
183-
{ label: 'International bank account number (IBAN)', id: 'IBAN_CODE', group: 'Common' },
184-
{ label: 'Cryptocurrency wallet address', id: 'CRYPTO', group: 'Common' },
185-
{ label: 'Medical license number', id: 'MEDICAL_LICENSE', group: 'Common' },
186-
{ label: 'Nationality / religion / political group', id: 'NRP', group: 'Common' },
187-
188-
// USA
189-
{ label: 'US bank account number', id: 'US_BANK_NUMBER', group: 'USA' },
190-
{ label: 'US driver license number', id: 'US_DRIVER_LICENSE', group: 'USA' },
191-
{
192-
label: 'US individual taxpayer identification number (ITIN)',
193-
id: 'US_ITIN',
194-
group: 'USA',
195-
},
196-
{ label: 'US passport number', id: 'US_PASSPORT', group: 'USA' },
197-
{ label: 'US Social Security number', id: 'US_SSN', group: 'USA' },
198-
199-
// UK
200-
{ label: 'UK National Insurance number', id: 'UK_NINO', group: 'UK' },
201-
{ label: 'UK NHS number', id: 'UK_NHS', group: 'UK' },
202-
203-
// Spain
204-
{ label: 'Spanish NIF number', id: 'ES_NIF', group: 'Spain' },
205-
{ label: 'Spanish NIE number', id: 'ES_NIE', group: 'Spain' },
206-
207-
// Italy
208-
{ label: 'Italian fiscal code', id: 'IT_FISCAL_CODE', group: 'Italy' },
209-
{ label: 'Italian driver license', id: 'IT_DRIVER_LICENSE', group: 'Italy' },
210-
{ label: 'Italian identity card', id: 'IT_IDENTITY_CARD', group: 'Italy' },
211-
{ label: 'Italian passport', id: 'IT_PASSPORT', group: 'Italy' },
212-
213-
// Poland
214-
{ label: 'Polish PESEL', id: 'PL_PESEL', group: 'Poland' },
215-
216-
// Singapore
217-
{ label: 'Singapore NRIC/FIN', id: 'SG_NRIC_FIN', group: 'Singapore' },
218-
219-
// Australia
220-
{ label: 'Australian business number (ABN)', id: 'AU_ABN', group: 'Australia' },
221-
{ label: 'Australian company number (ACN)', id: 'AU_ACN', group: 'Australia' },
222-
{ label: 'Australian tax file number (TFN)', id: 'AU_TFN', group: 'Australia' },
223-
{ label: 'Australian Medicare number', id: 'AU_MEDICARE', group: 'Australia' },
224-
225-
// India
226-
{ label: 'Indian Aadhaar', id: 'IN_AADHAAR', group: 'India' },
227-
{ label: 'Indian PAN', id: 'IN_PAN', group: 'India' },
228-
{ label: 'Indian vehicle registration', id: 'IN_VEHICLE_REGISTRATION', group: 'India' },
229-
{ label: 'Indian voter number', id: 'IN_VOTER', group: 'India' },
230-
{ label: 'Indian passport', id: 'IN_PASSPORT', group: 'India' },
231-
],
174+
// Driven by the shared catalog (includes VIN and custom recognizers) so the
175+
// block and the Data Retention settings never drift.
176+
options: PII_ENTITY_GROUPS.flatMap((group) =>
177+
group.entities.map((entity) => ({
178+
label: entity.label,
179+
id: entity.value,
180+
group: group.label,
181+
}))
182+
),
232183
condition: {
233184
field: 'validationType',
234185
value: ['pii'],
@@ -255,13 +206,7 @@ Return ONLY the regex pattern - no explanations, no quotes, no forward slashes,
255206
id: 'piiLanguage',
256207
title: 'Language',
257208
type: 'dropdown',
258-
options: [
259-
{ label: 'English', id: 'en' },
260-
{ label: 'Spanish', id: 'es' },
261-
{ label: 'Italian', id: 'it' },
262-
{ label: 'Polish', id: 'pl' },
263-
{ label: 'Finnish', id: 'fi' },
264-
],
209+
options: PII_LANGUAGES.map((language) => ({ label: language.label, id: language.value })),
265210
defaultValue: 'en',
266211
condition: {
267212
field: 'validationType',

apps/sim/ee/data-retention/components/data-retention-settings.tsx

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,13 @@ import {
2121
} from '@/components/emcn'
2222
import { useSession } from '@/lib/auth/auth-client'
2323
import { isBillingEnabled } from '@/lib/core/config/env-flags'
24-
import { PII_ENTITY_GROUPS, SUPPORTED_PII_ENTITIES } from '@/lib/guardrails/pii-entities'
24+
import {
25+
DEFAULT_PII_LANGUAGE,
26+
PII_ENTITY_GROUPS,
27+
PII_LANGUAGES,
28+
type PIILanguage,
29+
SUPPORTED_PII_ENTITIES,
30+
} from '@/lib/guardrails/pii-entities'
2531
import { getUserRole } from '@/lib/workspaces/organization/utils'
2632
import { SettingsSection } from '@/app/workspace/[workspaceId]/settings/components/settings-section/settings-section'
2733
import { InfoNote } from '@/ee/components/info-note'
@@ -59,6 +65,7 @@ interface RuleDraft {
5965
id: string
6066
entityTypes: string[]
6167
workspaceId: string | null
68+
language: PIILanguage
6269
}
6370

6471
function hoursToDisplayDays(hours: number | null): string {
@@ -75,6 +82,7 @@ function normalizeRule(rule: RuleDraft): string {
7582
return JSON.stringify({
7683
entityTypes: [...rule.entityTypes].sort(),
7784
workspaceId: rule.workspaceId,
85+
language: rule.language,
7886
})
7987
}
8088

@@ -227,6 +235,18 @@ function RuleModal({
227235
onChange={(entityTypes) => onChange({ ...draft, entityTypes })}
228236
/>
229237
</ChipModalField>
238+
<ChipModalField
239+
type='custom'
240+
title='Language'
241+
hint='Detection runs with this language’s recognizers — match it to your log content.'
242+
>
243+
<ChipSelect
244+
value={draft.language}
245+
onChange={(language) => onChange({ ...draft, language: language as PIILanguage })}
246+
options={PII_LANGUAGES.map((l) => ({ value: l.value, label: l.label }))}
247+
align='start'
248+
/>
249+
</ChipModalField>
230250
</ChipModalBody>
231251
<ChipModalFooter
232252
onCancel={onClose}
@@ -291,6 +311,7 @@ export function DataRetentionSettings() {
291311
id: r.id,
292312
entityTypes: r.entityTypes,
293313
workspaceId: r.workspaceId,
314+
language: r.language ?? DEFAULT_PII_LANGUAGE,
294315
}))
295316
)
296317
hydratedOrgRef.current = orgId
@@ -327,6 +348,7 @@ export function DataRetentionSettings() {
327348
id: r.id,
328349
entityTypes: r.entityTypes,
329350
workspaceId: r.workspaceId,
351+
language: r.language,
330352
})),
331353
},
332354
},
@@ -335,7 +357,12 @@ export function DataRetentionSettings() {
335357
}
336358

337359
function openEditDefault() {
338-
const rule: RuleDraft = defaultRule ?? { id: generateId(), entityTypes: [], workspaceId: null }
360+
const rule: RuleDraft = defaultRule ?? {
361+
id: generateId(),
362+
entityTypes: [],
363+
workspaceId: null,
364+
language: DEFAULT_PII_LANGUAGE,
365+
}
339366
setModalIsNew(defaultRule === null)
340367
setModalOriginal(rule)
341368
setModalDraft({ ...rule })
@@ -344,7 +371,12 @@ export function DataRetentionSettings() {
344371
function openAddOverride() {
345372
const workspaceId = freeWorkspaces[0]?.value
346373
if (!workspaceId) return
347-
const blank: RuleDraft = { id: generateId(), entityTypes: [], workspaceId }
374+
const blank: RuleDraft = {
375+
id: generateId(),
376+
entityTypes: [],
377+
workspaceId,
378+
language: DEFAULT_PII_LANGUAGE,
379+
}
348380
setModalIsNew(true)
349381
setModalOriginal(blank)
350382
setModalDraft(blank)

apps/sim/lib/api/contracts/hotspots.ts

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,34 @@ export const guardrailsValidateContract = defineRouteContract({
4545
},
4646
})
4747

48+
const guardrailsMaskBatchBodySchema = z.object({
49+
texts: z.array(z.string()).max(100_000),
50+
entityTypes: z.array(z.string().min(1, 'Entity type cannot be empty')).max(200),
51+
language: z.string().min(1).max(20).optional(),
52+
})
53+
54+
const guardrailsMaskBatchResponseSchema = z.object({
55+
masked: z.array(z.string()),
56+
})
57+
58+
/**
59+
* Internal batch PII masking. Called server-to-server (internal JWT) from the
60+
* log-redaction persist path so Presidio always runs in the app container,
61+
* including for async executions that persist inside the trigger.dev runtime.
62+
*/
63+
export const guardrailsMaskBatchContract = defineRouteContract({
64+
method: 'POST',
65+
path: '/api/guardrails/mask-batch',
66+
body: guardrailsMaskBatchBodySchema,
67+
response: {
68+
mode: 'json',
69+
schema: guardrailsMaskBatchResponseSchema,
70+
},
71+
})
72+
73+
export type GuardrailsMaskBatchBody = z.input<typeof guardrailsMaskBatchBodySchema>
74+
export type GuardrailsMaskBatchResult = z.output<typeof guardrailsMaskBatchResponseSchema>
75+
4876
const chatMessageSchema = z.object({
4977
role: z.enum(['user', 'assistant', 'system']),
5078
content: z.string(),

apps/sim/lib/api/contracts/primitives.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import { z } from 'zod'
2+
import { PII_LANGUAGE_CODES } from '@/lib/guardrails/pii-entities'
23

34
export const unknownRecordSchema = z.record(z.string(), z.unknown())
45

@@ -93,6 +94,8 @@ export const piiRedactionRuleSchema = z.object({
9394
entityTypes: z.array(z.string().min(1, 'Entity type cannot be empty')).max(100),
9495
/** null = all workspaces; otherwise the single targeted workspace. */
9596
workspaceId: z.string().min(1).nullable(),
97+
/** Language whose Presidio recognizers apply; defaults to English. */
98+
language: z.enum(PII_LANGUAGE_CODES).optional(),
9699
})
97100

98101
export type PiiRedactionRule = z.output<typeof piiRedactionRuleSchema>

0 commit comments

Comments
 (0)