Skip to content

Commit 4f4ff53

Browse files
fix(uploads): authorize internal file URLs before download (#5049)
* fix(uploads): authorize internal file URLs before download downloadFileFromUrl treated any URL containing /api/files/serve/ as trusted-internal and read the object straight from storage by key with no access check, while every other resolution path in the file calls verifyFileAccess. Reachable during workflow execution via file[] inputs (type: 'url'), letting an authenticated user read arbitrary storage objects across tenants by supplying a storage key. Thread the caller's userId into downloadFileFromUrl and run verifyFileAccess(key, userId, undefined, context, false) on the resolved key before downloadFile; fail closed when no userId is present. Update all callers (execution file inputs, tool file outputs, KB ingestion); webhook and chat inputs already thread userId via processExecutionFiles. * chore(uploads): log denied internal file downloads for rollout telemetry * fix(uploads): derive internal file context from key, not query param Cursor Bugbot flagged a context-spoofing bypass: downloadFileFromUrl resolved context via parseInternalFileUrl, which honors a caller-controlled ?context= query param. An attacker could label a private storage key with a world-readable context (profile-pictures/og-images/workspace-logos) so verifyFileAccess short-circuits to granted while downloadFile still reads the private object. Infer context from the key only (inferContextFromKey), mirroring how /api/files/serve resolves it; ignore the query param. Also move the userId guard ahead of key resolution so auth failures surface first. * docs(uploads): move context-derivation rationale into TSDoc * fix(uploads): match internal file marker in URL path only isInternalFileUrl matched the /api/files/serve/ substring anywhere in the string, so a crafted URL could carry it in a query string or fragment and skip DNS/SSRF validation. Match it in the path component only. The raw path is checked without URL normalization on purpose: the files parse route relies on traversal sequences surviving this check (an absolute https://host/api/files/serve/../.. URL must classify as internal so the '..' check rejects it, rather than being normalized to /etc/... and waved through as external). Host is intentionally not gated — cross-tenant reads are prevented at the storage sink by verifyFileAccess, and host-allowlisting would break self-hosted/multi-domain deployments. Adds unit tests. * consolidate access, billing principals --------- Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai>
1 parent dd32abe commit 4f4ff53

7 files changed

Lines changed: 150 additions & 32 deletions

File tree

apps/sim/executor/utils/file-tool-processor.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ export class FileToolProcessor {
138138
}
139139

140140
if (!buffer && data.url) {
141-
buffer = await downloadFileFromUrl(data.url)
141+
buffer = await downloadFileFromUrl(data.url, { userId: context.userId })
142142
}
143143

144144
if (buffer) {

apps/sim/lib/execution/files.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ export async function processExecutionFile(
5757

5858
if (file.type === 'url' && file.data) {
5959
const { downloadFileFromUrl } = await import('@/lib/uploads/utils/file-utils.server')
60-
const buffer = await downloadFileFromUrl(file.data)
60+
const buffer = await downloadFileFromUrl(file.data, { userId })
6161

6262
if (buffer.length > MAX_FILE_SIZE) {
6363
const fileSizeMB = (buffer.length / (1024 * 1024)).toFixed(2)

apps/sim/lib/knowledge/documents/document-processor.ts

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ async function parseDocument(
295295
if (isPDF && (hasAzureMistralOCR || hasMistralOCR)) {
296296
if (hasAzureMistralOCR) {
297297
logger.info(`Using Azure Mistral OCR: ${filename}`)
298-
return parseWithAzureMistralOCR(fileUrl, filename, mimeType)
298+
return parseWithAzureMistralOCR(fileUrl, filename, mimeType, userId)
299299
}
300300

301301
if (hasMistralOCR) {
@@ -305,7 +305,7 @@ async function parseDocument(
305305
}
306306

307307
logger.info(`Using file parser: ${filename}`)
308-
return parseWithFileParser(fileUrl, filename, mimeType)
308+
return parseWithFileParser(fileUrl, filename, mimeType, userId)
309309
}
310310

311311
async function handleFileForOCR(
@@ -321,7 +321,7 @@ async function handleFileForOCR(
321321
if (mimeType === 'application/pdf') {
322322
logger.info(`handleFileForOCR: Downloading external PDF to check page count`)
323323
try {
324-
const buffer = await downloadFileWithTimeout(fileUrl)
324+
const buffer = await downloadFileWithTimeout(fileUrl, userId)
325325
logger.info(`handleFileForOCR: Downloaded external PDF: ${buffer.length} bytes`)
326326
return { httpsUrl: fileUrl, buffer }
327327
} catch (error) {
@@ -340,7 +340,7 @@ async function handleFileForOCR(
340340

341341
logger.info(`Uploading "${filename}" to cloud storage for OCR`)
342342

343-
const buffer = await downloadFileWithTimeout(fileUrl)
343+
const buffer = await downloadFileWithTimeout(fileUrl, userId)
344344

345345
logger.info(`Downloaded ${filename}: ${buffer.length} bytes`)
346346

@@ -380,11 +380,11 @@ async function handleFileForOCR(
380380
}
381381
}
382382

383-
async function downloadFileWithTimeout(fileUrl: string): Promise<Buffer> {
384-
return downloadFileFromUrl(fileUrl, TIMEOUTS.FILE_DOWNLOAD)
383+
async function downloadFileWithTimeout(fileUrl: string, userId?: string): Promise<Buffer> {
384+
return downloadFileFromUrl(fileUrl, { timeoutMs: TIMEOUTS.FILE_DOWNLOAD, userId })
385385
}
386386

387-
async function downloadFileForBase64(fileUrl: string): Promise<Buffer> {
387+
async function downloadFileForBase64(fileUrl: string, userId?: string): Promise<Buffer> {
388388
if (/^data:/i.test(fileUrl)) {
389389
const [, base64Data] = fileUrl.split(',')
390390
if (!base64Data) {
@@ -393,7 +393,7 @@ async function downloadFileForBase64(fileUrl: string): Promise<Buffer> {
393393
return Buffer.from(base64Data, 'base64')
394394
}
395395
if (/^https?:\/\//i.test(fileUrl)) {
396-
return downloadFileWithTimeout(fileUrl)
396+
return downloadFileWithTimeout(fileUrl, userId)
397397
}
398398
throw new Error('Unsupported fileUrl scheme: only data: URIs and http(s):// URLs are allowed')
399399
}
@@ -468,15 +468,20 @@ async function makeOCRRequest(
468468
}
469469
}
470470

471-
async function parseWithAzureMistralOCR(fileUrl: string, filename: string, mimeType: string) {
471+
async function parseWithAzureMistralOCR(
472+
fileUrl: string,
473+
filename: string,
474+
mimeType: string,
475+
userId?: string
476+
) {
472477
validateOCRConfig(
473478
env.OCR_AZURE_API_KEY,
474479
env.OCR_AZURE_ENDPOINT,
475480
env.OCR_AZURE_MODEL_NAME,
476481
'Azure Mistral OCR'
477482
)
478483

479-
const fileBuffer = await downloadFileForBase64(fileUrl)
484+
const fileBuffer = await downloadFileForBase64(fileUrl, userId)
480485

481486
if (mimeType === 'application/pdf') {
482487
const pageCount = await getPdfPageCount(fileBuffer)
@@ -485,7 +490,7 @@ async function parseWithAzureMistralOCR(fileUrl: string, filename: string, mimeT
485490
`PDF has ${pageCount} pages, exceeds Azure OCR limit of ${MISTRAL_MAX_PAGES}. ` +
486491
`Falling back to file parser.`
487492
)
488-
return parseWithFileParser(fileUrl, filename, mimeType)
493+
return parseWithFileParser(fileUrl, filename, mimeType, userId)
489494
}
490495
logger.info(`Azure Mistral OCR: PDF page count for ${filename}: ${pageCount}`)
491496
}
@@ -529,7 +534,7 @@ async function parseWithAzureMistralOCR(fileUrl: string, filename: string, mimeT
529534
})
530535

531536
logger.info(`Falling back to file parser: ${filename}`)
532-
return parseWithFileParser(fileUrl, filename, mimeType)
537+
return parseWithFileParser(fileUrl, filename, mimeType, userId)
533538
}
534539
}
535540

@@ -589,7 +594,7 @@ async function parseWithMistralOCR(
589594
})
590595

591596
logger.info(`Falling back to file parser: ${filename}`)
592-
return parseWithFileParser(fileUrl, filename, mimeType)
597+
return parseWithFileParser(fileUrl, filename, mimeType, userId)
593598
}
594599
}
595600

@@ -773,15 +778,20 @@ async function processMistralOCRInBatches(
773778
}
774779
}
775780

776-
async function parseWithFileParser(fileUrl: string, filename: string, mimeType: string) {
781+
async function parseWithFileParser(
782+
fileUrl: string,
783+
filename: string,
784+
mimeType: string,
785+
userId?: string
786+
) {
777787
try {
778788
let content: string
779789
let metadata: FileParseMetadata = {}
780790

781791
if (/^data:/i.test(fileUrl)) {
782792
content = await parseDataURI(fileUrl, filename, mimeType)
783793
} else if (/^https?:\/\//i.test(fileUrl)) {
784-
const result = await parseHttpFile(fileUrl, filename, mimeType)
794+
const result = await parseHttpFile(fileUrl, filename, mimeType, userId)
785795
content = result.content
786796
metadata = result.metadata || {}
787797
} else {
@@ -820,9 +830,10 @@ async function parseDataURI(fileUrl: string, filename: string, mimeType: string)
820830
async function parseHttpFile(
821831
fileUrl: string,
822832
filename: string,
823-
mimeType?: string
833+
mimeType?: string,
834+
userId?: string
824835
): Promise<{ content: string; metadata?: FileParseMetadata }> {
825-
const buffer = await downloadFileWithTimeout(fileUrl)
836+
const buffer = await downloadFileWithTimeout(fileUrl, userId)
826837

827838
const extension = resolveParserExtension(filename, mimeType)
828839
const result = await parseBuffer(buffer, extension)

apps/sim/lib/knowledge/documents/service.ts

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -515,7 +515,6 @@ export async function processDocumentAsync(
515515
// KB config + workspace billing + doc tags in one JOIN (was 3 SELECTs).
516516
const contextRows = await db
517517
.select({
518-
userId: knowledgeBase.userId,
519518
workspaceId: knowledgeBase.workspaceId,
520519
chunkingConfig: knowledgeBase.chunkingConfig,
521520
embeddingModel: knowledgeBase.embeddingModel,
@@ -644,7 +643,12 @@ export async function processDocumentAsync(
644643
kbConfig.maxSize,
645644
kbConfig.overlap,
646645
kbConfig.minSize,
647-
ctx.userId,
646+
// Authorize the source file (and run OCR/processing) as the billed
647+
// actor — the uploader when known, else the workspace billed account —
648+
// the same principal embeddings are billed to. Using the KB owner here
649+
// would authorize an attacker-supplied internal fileUrl against the
650+
// owner, letting a KB write-member ingest a file only the owner can read.
651+
billingUserId,
648652
ctx.workspaceId,
649653
rawConfig?.strategy,
650654
rawConfig?.strategyOptions

apps/sim/lib/uploads/utils/file-utils.server.ts

Lines changed: 53 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
'use server'
22

3-
import type { Logger } from '@sim/logger'
3+
import { createLogger, type Logger } from '@sim/logger'
44
import { getErrorMessage } from '@sim/utils/errors'
55
import { getMaxExecutionTimeout } from '@/lib/core/execution-limits'
66
import {
@@ -25,6 +25,8 @@ import {
2525
import { verifyFileAccess } from '@/app/api/files/authorization'
2626
import type { UserFile } from '@/executor/types'
2727

28+
const logger = createLogger('FileUtilsServer')
29+
2830
/**
2931
* Result type for file input resolution
3032
*/
@@ -138,19 +140,62 @@ export async function resolveFileInputToUrl(
138140
}
139141

140142
/**
141-
* Download a file from a URL (internal or external)
142-
* For internal URLs, uses direct storage access (server-side only)
143-
* For external URLs, validates DNS/SSRF and uses secure fetch with IP pinning
143+
* Options for {@link downloadFileFromUrl}.
144+
*/
145+
export interface DownloadFileFromUrlOptions {
146+
/** Download timeout for external URLs. Defaults to the max execution timeout. */
147+
timeoutMs?: number
148+
/** Hard cap on the number of bytes read from the source. */
149+
maxBytes?: number
150+
/**
151+
* Principal the download is performed on behalf of. Required to authorize
152+
* internal (`/api/files/serve/...`) URLs: the resolved storage key is checked
153+
* with {@link verifyFileAccess} before any bytes are read. Without it, internal
154+
* URLs are rejected (fail closed) so a `/api/files/serve/` substring can never
155+
* be treated as implicitly trusted.
156+
*/
157+
userId?: string
158+
}
159+
160+
/**
161+
* Download a file from a URL (internal or external).
162+
*
163+
* For internal URLs, uses direct storage access (server-side only) after
164+
* authorizing the resolved storage key against `userId`. Context is derived
165+
* from the key via {@link inferContextFromKey}, never from a caller-controlled
166+
* `?context=` query param — trusting the param would let a private key be
167+
* labeled with a world-readable context (e.g. profile-pictures) so
168+
* {@link verifyFileAccess} short-circuits to granted while the private object is
169+
* still read. This mirrors how `/api/files/serve` resolves context.
170+
*
171+
* For external URLs, validates DNS/SSRF and uses secure fetch with IP pinning.
144172
*/
145173
export async function downloadFileFromUrl(
146174
fileUrl: string,
147-
timeoutMs = getMaxExecutionTimeout(),
148-
maxBytes?: number
175+
options: DownloadFileFromUrlOptions = {}
149176
): Promise<Buffer> {
150-
const { parseInternalFileUrl } = await import('./file-utils')
177+
const { timeoutMs = getMaxExecutionTimeout(), maxBytes, userId } = options
151178

152179
if (isInternalFileUrl(fileUrl)) {
153-
const { key, context } = parseInternalFileUrl(fileUrl)
180+
if (!userId) {
181+
logger.warn('Internal file download denied: no userId provided', { fileUrl })
182+
throw new Error('Access denied: internal file URL requires an authenticated user')
183+
}
184+
185+
const key = extractStorageKey(fileUrl)
186+
if (!key) {
187+
logger.warn('Internal file download denied: could not resolve storage key', { fileUrl })
188+
throw new Error('Access denied: could not resolve internal file key')
189+
}
190+
191+
const context = inferContextFromKey(key)
192+
193+
const hasAccess = await verifyFileAccess(key, userId, undefined, context, false)
194+
if (!hasAccess) {
195+
logger.warn('Internal file download denied: access check failed', { key, context, userId })
196+
throw new Error('Access denied: file not found or insufficient permissions')
197+
}
198+
154199
const { downloadFile } = await import('@/lib/uploads/core/storage-service')
155200
return downloadFile({ key, context, maxBytes })
156201
}

apps/sim/lib/uploads/utils/file-utils.test.ts

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,42 @@
22
* @vitest-environment node
33
*/
44
import { describe, expect, it } from 'vitest'
5-
import { isAbortError, isNetworkError } from '@/lib/uploads/utils/file-utils'
5+
import { isAbortError, isInternalFileUrl, isNetworkError } from '@/lib/uploads/utils/file-utils'
6+
7+
describe('isInternalFileUrl', () => {
8+
it('classifies relative serve paths as internal', () => {
9+
expect(isInternalFileUrl('/api/files/serve/kb/123-file.pdf')).toBe(true)
10+
expect(isInternalFileUrl('/api/files/serve/workspace/ws-1/file.txt?context=workspace')).toBe(
11+
true
12+
)
13+
})
14+
15+
it('classifies absolute serve URLs as internal regardless of host', () => {
16+
expect(isInternalFileUrl('https://www.sim.ai/api/files/serve/kb/x.pdf')).toBe(true)
17+
expect(isInternalFileUrl('http://localhost:3000/api/files/serve/blob/kb/x')).toBe(true)
18+
// Host is not used to gate (self-hosted/multi-domain); the storage sink authorizes.
19+
expect(isInternalFileUrl('https://other-host/api/files/serve/workspace/v/x')).toBe(true)
20+
})
21+
22+
it('does not match the marker outside the path (query/fragment)', () => {
23+
expect(isInternalFileUrl('https://evil.com/x?next=/api/files/serve/secret')).toBe(false)
24+
expect(isInternalFileUrl('https://evil.com/page#/api/files/serve/secret')).toBe(false)
25+
expect(isInternalFileUrl('https://evil.com/redirect?u=/api/files/serve/kb/x')).toBe(false)
26+
})
27+
28+
it('preserves traversal sequences so they survive downstream rejection', () => {
29+
// Must stay internal (not normalized away) so the parse route applies its `..` check.
30+
expect(isInternalFileUrl('https://attacker.com/api/files/serve/../../../etc/passwd')).toBe(true)
31+
expect(isInternalFileUrl('/api/files/serve/../../app.js')).toBe(true)
32+
})
33+
34+
it('returns false for non-internal and non-string inputs', () => {
35+
expect(isInternalFileUrl('https://example.com/file.pdf')).toBe(false)
36+
expect(isInternalFileUrl('data:text/plain;base64,abc')).toBe(false)
37+
// @ts-expect-error verifying runtime guard
38+
expect(isInternalFileUrl(undefined)).toBe(false)
39+
})
40+
})
641

742
describe('isAbortError', () => {
843
it('returns true for AbortError-named errors', () => {

apps/sim/lib/uploads/utils/file-utils.ts

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -534,10 +534,33 @@ export function extractStorageKey(filePath: string): string {
534534
}
535535

536536
/**
537-
* Check if a URL is an internal file serve URL
537+
* Whether a URL targets the internal file-serve endpoint (`/api/files/serve/`).
538+
*
539+
* The marker is matched only in the URL's path component, so it cannot be
540+
* smuggled through a query string or fragment (e.g.
541+
* `https://evil.com/x?next=/api/files/serve/...`) to skip DNS/SSRF validation.
542+
*
543+
* The raw path is inspected without URL normalization on purpose: callers such
544+
* as the files parse route rely on traversal sequences (`..`) surviving this
545+
* check so they are rejected downstream rather than collapsed away. A path-only
546+
* marker still classifies any host as internal (e.g.
547+
* `https://other-host/api/files/serve/<key>`); cross-tenant reads are prevented
548+
* at the storage sink by {@link verifyFileAccess}, not by host matching, which
549+
* would break self-hosted and multi-domain deployments.
538550
*/
539551
export function isInternalFileUrl(fileUrl: string): boolean {
540-
return fileUrl.includes('/api/files/serve/')
552+
if (typeof fileUrl !== 'string') {
553+
return false
554+
}
555+
556+
let path = fileUrl
557+
const scheme = /^[a-z][a-z0-9+.-]*:\/\/[^/?#]*/i.exec(path)
558+
if (scheme) {
559+
path = path.slice(scheme[0].length)
560+
}
561+
path = path.split(/[?#]/, 1)[0]
562+
563+
return path.startsWith('/api/files/serve/')
541564
}
542565

543566
/**

0 commit comments

Comments
 (0)