Skip to content

Commit 3c6ac28

Browse files
authored
Merge pull request #103 from firecrawl/mog/parse-api
feat: parse command (ENG-4830)
2 parents d0145f4 + aaca75f commit 3c6ac28

6 files changed

Lines changed: 439 additions & 1 deletion

File tree

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "firecrawl-cli",
3-
"version": "1.15.2",
3+
"version": "1.16.0",
44
"description": "Command-line interface for Firecrawl. Scrape, crawl, and extract data from any website directly from your terminal.",
55
"main": "dist/index.js",
66
"bin": {

skills/firecrawl-cli/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ Follow this escalation pattern:
6262
| AI-powered data extraction | `agent` | Need structured data from complex sites |
6363
| Interact with a page | `scrape` + `interact` | Content requires clicks, form fills, pagination, or login |
6464
| Download a site to files | `download` | Save an entire site as local files |
65+
| Parse a local file | `parse` | File on disk (PDF, DOCX, XLSX, etc.) — not a URL |
6566

6667
For detailed command reference, run `firecrawl <command> --help`.
6768

@@ -85,6 +86,7 @@ For detailed command reference, run `firecrawl <command> --help`.
8586
- **AI-powered structured extraction from complex sites** -> [firecrawl-agent](../firecrawl-agent/SKILL.md)
8687
- **Clicks, forms, login, pagination, or post-scrape browser actions** -> [firecrawl-interact](../firecrawl-interact/SKILL.md)
8788
- **Downloading a site to local files** -> [firecrawl-download](../firecrawl-download/SKILL.md)
89+
- **Parsing a local file (PDF, DOCX, XLSX, HTML, etc.)** -> [firecrawl-parse](../firecrawl-parse/SKILL.md)
8890
- **Install, auth, or setup problems** -> [rules/install.md](rules/install.md)
8991
- **Output handling and safe file-reading patterns** -> [rules/security.md](rules/security.md)
9092
- **Integrating Firecrawl into an app, adding `FIRECRAWL_API_KEY` to `.env`, or choosing endpoint usage in product code** -> use the `firecrawl-build` skills (already installed alongside this CLI skill)

skills/firecrawl-parse/SKILL.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
name: firecrawl-parse
3+
description: |
4+
Efficiently extract and convert the contents of any local file—such as PDF, DOCX, DOC, ODT, RTF, XLSX, XLS, or HTML—into clean, well-formatted markdown saved to disk. Use this skill whenever the user requests to parse, read, or extract information from a file on their computer, including phrases like “parse this PDF”, “convert this document”, “read this file”, “extract text from”, or when a local file path (not a URL) is provided. This skill offers advanced options like generating AI-powered summaries and answering questions based on the file's content. Prefer this tool over `scrape` when handling local files to deliver precise, structured outputs for downstream tasks.
5+
allowed-tools:
6+
- Bash(firecrawl *)
7+
- Bash(npx firecrawl *)
8+
---
9+
10+
# firecrawl parse
11+
12+
Turn a local document into clean markdown on disk. Supports **PDF, DOCX, DOC, ODT, RTF, XLSX, XLS, HTML/HTM/XHTML**.
13+
14+
## When to use
15+
16+
- You have a file on disk (not a URL) and want its text as markdown
17+
- User drops a PDF/DOCX and asks what it says, or to summarize it
18+
- Use `scrape` instead when the source is a URL
19+
20+
## Quick start
21+
22+
Always save to `.firecrawl/` with `-o` — parsed docs can be hundreds of KB and blow up context if streamed to stdout. Add `.firecrawl/` to `.gitignore`.
23+
24+
```bash
25+
mkdir -p .firecrawl
26+
27+
# File → markdown
28+
firecrawl parse ./paper.pdf -o .firecrawl/paper.md
29+
30+
# AI summary
31+
firecrawl parse ./paper.pdf -S -o .firecrawl/paper-summary.md
32+
33+
# Ask a question about the doc
34+
firecrawl parse ./paper.pdf -Q "What are the main conclusions?" \
35+
-o .firecrawl/paper-qa.md
36+
```
37+
38+
Then `head`, `grep`, `rg` etc., or incrementally read the file - don't load the whole thing at once.
39+
40+
## Options
41+
42+
| Option | Description |
43+
| ---------------------- | --------------------------------------- |
44+
| `-S, --summary` | AI-generated summary |
45+
| `-Q, --query <prompt>` | Ask a question about the parsed content |
46+
| `-o, --output <path>` | Output file path — **always use this** |
47+
| `-f, --format <fmt>` | `markdown` (default), `html`, `summary` |
48+
| `--timeout <ms>` | Timeout for the parse job |
49+
| `--timing` | Show request duration |
50+
51+
## Tips
52+
53+
- Quote paths with spaces: `firecrawl parse "./My Doc.pdf" -o .firecrawl/mydoc.md`.
54+
- Max upload size: **50 MB** per file.
55+
- Credits: ~1 per PDF page; HTML is 1 flat.
56+
- Check `.firecrawl/` before re-parsing the same file.
57+
- To check your credit balance (recommended for batch processing and similar workflows), use the `firecrawl credit-usage` command.
58+
59+
## See also
60+
61+
- [firecrawl-scrape](../firecrawl-scrape/SKILL.md) — same idea for URLs

src/commands/parse.ts

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
/**
2+
* Parse command implementation
3+
*
4+
* Uploads a local file to the Firecrawl /v2/parse endpoint and returns the
5+
* parsed document in the requested format(s). Supported file types:
6+
* .html, .htm, .pdf, .docx, .doc, .odt, .rtf, .xlsx, .xls
7+
*/
8+
9+
import * as fs from 'fs';
10+
import * as path from 'path';
11+
import type { FormatOption } from '@mendable/firecrawl-js';
12+
import type { ParseOptions, ParseResult } from '../types/parse';
13+
import type { ScrapeFormat } from '../types/scrape';
14+
import { getClient } from '../utils/client';
15+
import { getConfig, validateConfig } from '../utils/config';
16+
import { handleScrapeOutput } from '../utils/output';
17+
18+
const DEFAULT_API_URL = 'https://api.firecrawl.dev';
19+
20+
/** File extensions accepted by /v2/parse (mirrors the API controller). */
21+
const SUPPORTED_EXTENSIONS = new Set([
22+
'.html',
23+
'.htm',
24+
'.pdf',
25+
'.docx',
26+
'.doc',
27+
'.odt',
28+
'.rtf',
29+
'.xlsx',
30+
'.xls',
31+
]);
32+
33+
/**
34+
* Best-effort content-type lookup so the API's kind detector has a hint
35+
* even if the extension is ambiguous.
36+
*/
37+
const CONTENT_TYPE_BY_EXT: Record<string, string> = {
38+
'.html': 'text/html',
39+
'.htm': 'text/html',
40+
'.pdf': 'application/pdf',
41+
'.docx':
42+
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
43+
'.doc': 'application/msword',
44+
'.odt': 'application/vnd.oasis.opendocument.text',
45+
'.rtf': 'application/rtf',
46+
'.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
47+
'.xls': 'application/vnd.ms-excel',
48+
};
49+
50+
function outputTiming(
51+
options: ParseOptions,
52+
requestStartTime: number,
53+
requestEndTime: number,
54+
error?: Error | unknown
55+
): void {
56+
if (!options.timing) return;
57+
58+
const duration = requestEndTime - requestStartTime;
59+
const info: Record<string, string> = {
60+
file: options.file,
61+
requestTime: new Date(requestStartTime).toISOString(),
62+
duration: `${duration}ms`,
63+
status: error ? 'error' : 'success',
64+
};
65+
if (error) {
66+
info.error = error instanceof Error ? error.message : 'Unknown error';
67+
}
68+
console.error('Timing:', JSON.stringify(info, null, 2));
69+
}
70+
71+
/**
72+
* Build the `formats` array sent to the API (mirrors scrape's behavior).
73+
*/
74+
function buildFormats(options: ParseOptions): FormatOption[] {
75+
const formats: FormatOption[] = [];
76+
77+
if (options.formats && options.formats.length > 0) {
78+
formats.push(...options.formats);
79+
}
80+
81+
if (options.query) {
82+
formats.push({ type: 'query', prompt: options.query } as any);
83+
}
84+
85+
if (formats.length === 0) {
86+
formats.push('markdown');
87+
}
88+
89+
return formats;
90+
}
91+
92+
/**
93+
* Build the JSON `options` payload uploaded alongside the file.
94+
*/
95+
function buildOptionsPayload(options: ParseOptions): Record<string, unknown> {
96+
const payload: Record<string, unknown> = {
97+
formats: buildFormats(options),
98+
integration: 'cli',
99+
};
100+
101+
if (options.onlyMainContent !== undefined) {
102+
payload.onlyMainContent = options.onlyMainContent;
103+
}
104+
if (options.includeTags && options.includeTags.length > 0) {
105+
payload.includeTags = options.includeTags;
106+
}
107+
if (options.excludeTags && options.excludeTags.length > 0) {
108+
payload.excludeTags = options.excludeTags;
109+
}
110+
if (options.timeout !== undefined) {
111+
payload.timeout = options.timeout;
112+
}
113+
if (options.location) {
114+
payload.location = options.location;
115+
}
116+
117+
return payload;
118+
}
119+
120+
/**
121+
* Execute the parse command by POSTing a multipart upload to /v2/parse.
122+
*/
123+
export async function executeParse(
124+
options: ParseOptions
125+
): Promise<ParseResult> {
126+
const filePath = path.resolve(options.file);
127+
128+
if (!fs.existsSync(filePath)) {
129+
return {
130+
success: false,
131+
error: `File not found: ${options.file}`,
132+
};
133+
}
134+
135+
const stat = fs.statSync(filePath);
136+
if (!stat.isFile()) {
137+
return {
138+
success: false,
139+
error: `Not a file: ${options.file}`,
140+
};
141+
}
142+
143+
const ext = path.extname(filePath).toLowerCase();
144+
if (!SUPPORTED_EXTENSIONS.has(ext)) {
145+
return {
146+
success: false,
147+
error:
148+
`Unsupported file type "${ext || '(none)'}". ` +
149+
`Supported extensions: ${[...SUPPORTED_EXTENSIONS].join(', ')}`,
150+
};
151+
}
152+
153+
// Ensure auth/url is resolved through the same config pipeline scrape uses.
154+
if (options.apiKey || options.apiUrl) {
155+
getClient({ apiKey: options.apiKey, apiUrl: options.apiUrl });
156+
}
157+
158+
const config = getConfig();
159+
const apiKey = options.apiKey || config.apiKey;
160+
validateConfig(apiKey);
161+
162+
const apiUrl = (options.apiUrl || config.apiUrl || DEFAULT_API_URL).replace(
163+
/\/$/,
164+
''
165+
);
166+
167+
const buffer = fs.readFileSync(filePath);
168+
const filename = path.basename(filePath);
169+
const contentType = CONTENT_TYPE_BY_EXT[ext] ?? 'application/octet-stream';
170+
171+
const form = new FormData();
172+
form.append(
173+
'file',
174+
new Blob([new Uint8Array(buffer)], { type: contentType }),
175+
filename
176+
);
177+
form.append('options', JSON.stringify(buildOptionsPayload(options)));
178+
179+
const requestStartTime = Date.now();
180+
181+
try {
182+
const response = await fetch(`${apiUrl}/v2/parse`, {
183+
method: 'POST',
184+
headers: apiKey ? { Authorization: `Bearer ${apiKey}` } : undefined,
185+
body: form,
186+
});
187+
188+
const requestEndTime = Date.now();
189+
outputTiming(options, requestStartTime, requestEndTime);
190+
191+
const payload = (await response.json().catch(() => ({}))) as any;
192+
193+
if (!response.ok || payload?.success === false) {
194+
const message =
195+
payload?.error ||
196+
`HTTP ${response.status}: ${response.statusText || 'Request failed'}`;
197+
return { success: false, error: message };
198+
}
199+
200+
return {
201+
success: true,
202+
data: payload?.data ?? payload,
203+
};
204+
} catch (error) {
205+
const requestEndTime = Date.now();
206+
outputTiming(options, requestStartTime, requestEndTime, error);
207+
return {
208+
success: false,
209+
error: error instanceof Error ? error.message : 'Unknown error occurred',
210+
};
211+
}
212+
}
213+
214+
/**
215+
* Handle parse command output. Reuses the scrape output formatter since the
216+
* /v2/parse response shape matches /v2/scrape.
217+
*/
218+
export async function handleParseCommand(options: ParseOptions): Promise<void> {
219+
const result = await executeParse(options);
220+
221+
if (options.query && result.success && result.data?.answer) {
222+
const { writeOutput } = await import('../utils/output');
223+
writeOutput(result.data.answer, options.output, !!options.output);
224+
return;
225+
}
226+
227+
const effectiveFormats: ScrapeFormat[] =
228+
options.formats && options.formats.length > 0
229+
? [...options.formats]
230+
: ['markdown'];
231+
232+
handleScrapeOutput(
233+
result,
234+
effectiveFormats,
235+
options.output,
236+
options.pretty,
237+
options.json
238+
);
239+
}

0 commit comments

Comments
 (0)