Skip to content

Commit 88707a4

Browse files
feat: implement CLI v0.1.0 — full ScrapeGraph API integration
Complete rewrite from scaffolding to working CLI tool: - 10 commands: smart-scraper, search-scraper, markdownify, crawl, sitemap, scrape, agentic-scraper, generate-schema, credits, validate - SDK layer with zod validation, async polling, debug logging, elapsed time tracking - Config: env var → .env (dotenv) → ~/.scrapegraphai/config.json → interactive prompt - Syntax-highlighted JSON output via chalk - TypeScript + tsup ESM build targeting Node 22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1430251 commit 88707a4

28 files changed

Lines changed: 1424 additions & 103 deletions

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
.ai/docs
12
# Dependencies
23
node_modules/
34
npm-debug.log*
@@ -40,3 +41,6 @@ coverage/
4041
tmp/
4142
temp/
4243
*.tmp
44+
45+
# Bun
46+
bun.lock

CLAUDE.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
2+
@/README.md
3+
4+
5+
### Development
6+
7+
**important** Always update README.md after every change in the library
8+
9+
10+
@/SPEC.md

README.md

Lines changed: 222 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,255 @@
11
# ScrapeGraph CLI
22

3-
A basic command-line interface (CLI) tool built with Node.js, following best practices for CLI development.
3+
Command-line interface for [ScrapeGraph AI](https://scrapegraphai.com) — AI-powered web scraping, data extraction, search, and crawling.
44

5-
## Features
5+
## Tech Stack
66

7-
- 🎨 Beautiful terminal output with colors and boxes using `chalk` and `boxen`
8-
- 📝 Command-line argument parsing with `yargs`
9-
- 🚀 Easy to install and use globally
10-
- 🔧 Extensible architecture
7+
| Concern | Tool |
8+
|---|---|
9+
| Language | **TypeScript 5.8** |
10+
| Dev Runtime | **Bun** |
11+
| Build | **tsup** (esbuild) |
12+
| CLI Framework | **citty** (unjs) |
13+
| Prompts | **@clack/prompts** |
14+
| Styling | **chalk** v5 (ESM) |
15+
| Validation | **zod** v4 |
16+
| Env | **dotenv** |
17+
| Lint / Format | **Biome** |
18+
| Target | **Node.js 22+**, ESM-only |
1119

12-
## Installation
20+
## Setup
1321

14-
### Local Development
22+
```bash
23+
bun install
24+
```
25+
26+
## Configuration
27+
28+
The CLI needs a ScrapeGraph API key. Get one at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com).
29+
30+
Four ways to provide it (checked in order):
31+
32+
1. **Environment variable**: `export SGAI_API_KEY="sgai-..."`
33+
2. **`.env` file**: create a `.env` file in the project root with `SGAI_API_KEY=sgai-...`
34+
3. **Config file**: stored in `~/.scrapegraphai/config.json`
35+
4. **Interactive prompt**: if none of the above are set, the CLI prompts you and saves it to the config file
36+
37+
### Timeout
38+
39+
Set `SGAI_CLI_TIMEOUT_S` to override the default 120s request/polling timeout:
40+
41+
```bash
42+
export SGAI_CLI_TIMEOUT_S=300
43+
```
44+
45+
### Debug Logging
46+
47+
Set `SGAI_CLI_DEBUG=1` to enable debug logging (outputs to stderr):
48+
49+
```bash
50+
SGAI_CLI_DEBUG=1 scrapegraphai smart-scraper https://example.com -p "Extract data"
51+
```
52+
53+
## Commands
54+
55+
### `smart-scraper` — Extract structured data from a URL
56+
57+
```bash
58+
scrapegraphai smart-scraper <url> -p "Extract all product names and prices"
59+
60+
# With JSON schema
61+
scrapegraphai smart-scraper https://example.com/products -p "Extract products" \
62+
--schema '{"type":"object","properties":{"products":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}}}}'
63+
64+
# With options
65+
scrapegraphai smart-scraper https://example.com -p "Extract data" \
66+
--stealth --render-js --scrolls 10 --pages 5
67+
```
68+
69+
| Option | Description |
70+
|---|---|
71+
| `-p, --prompt` | Extraction prompt (required) |
72+
| `--schema` | Output JSON schema (JSON string) |
73+
| `--scrolls` | Infinite scroll count (0-100) |
74+
| `--pages` | Total pages to scrape (1-100) |
75+
| `--render-js` | Enable JS rendering (+1 credit) |
76+
| `--stealth` | Bypass bot detection (+4 credits) |
77+
| `--cookies` | Cookies as JSON object string |
78+
| `--headers` | Custom headers as JSON object string |
79+
| `--plain-text` | Return plain text instead of JSON |
80+
81+
### `search-scraper` — Search the web and extract data
82+
83+
```bash
84+
scrapegraphai search-scraper "What are the top Python web frameworks?"
85+
86+
# Markdown only (cheaper)
87+
scrapegraphai search-scraper "Python frameworks" --no-extraction --num-results 5
88+
```
89+
90+
| Option | Description |
91+
|---|---|
92+
| `--num-results` | Number of websites (3-20, default 3) |
93+
| `--no-extraction` | Markdown only (2 credits/site vs 10) |
94+
| `--schema` | Output JSON schema (JSON string) |
95+
| `--stealth` | Bypass bot detection (+4 credits) |
96+
| `--headers` | Custom headers as JSON object string |
97+
98+
### `markdownify` — Convert a webpage to markdown
1599

16-
1. Clone the repository:
17-
```bash
18-
git clone <repository-url>
19-
cd scrapegraph-cli
20-
```
100+
```bash
101+
scrapegraphai markdownify https://example.com/article
102+
scrapegraphai markdownify https://example.com --render-js --stealth
103+
```
104+
105+
| Option | Description |
106+
|---|---|
107+
| `--render-js` | Enable JS rendering (+1 credit) |
108+
| `--stealth` | Bypass bot detection (+4 credits) |
109+
| `--headers` | Custom headers as JSON object string |
110+
111+
### `crawl` — Crawl and extract from multiple pages
112+
113+
```bash
114+
scrapegraphai crawl https://example.com -p "Extract article titles" --max-pages 5 --depth 2
115+
116+
# Markdown only
117+
scrapegraphai crawl https://example.com --no-extraction --max-pages 10
118+
119+
# With crawl rules
120+
scrapegraphai crawl https://example.com -p "Extract data" \
121+
--rules '{"include_paths":["/blog/*"],"same_domain":true}'
122+
```
21123

22-
2. Install dependencies:
23-
```bash
24-
npm install
25-
```
124+
| Option | Description |
125+
|---|---|
126+
| `-p, --prompt` | Extraction prompt (required when extraction is on) |
127+
| `--no-extraction` | Markdown only (2 credits/page vs 10) |
128+
| `--max-pages` | Max pages to crawl (default 10) |
129+
| `--depth` | Crawl depth (default 1) |
130+
| `--schema` | Output JSON schema (JSON string) |
131+
| `--rules` | Crawl rules as JSON object string |
132+
| `--no-sitemap` | Disable sitemap-based discovery |
133+
| `--render-js` | Enable JS rendering (+1 credit/page) |
134+
| `--stealth` | Bypass bot detection (+4 credits) |
26135

27-
3. Install globally (from the project root):
28-
```bash
29-
npm install -g .
30-
```
136+
### `sitemap` — Get all URLs from a website's sitemap
31137

32-
## Usage
138+
```bash
139+
scrapegraphai sitemap https://example.com
140+
```
33141

34-
After installation, you can use the CLI from anywhere in your terminal:
142+
### `scrape` — Get raw HTML content
35143

36144
```bash
37-
# Run the CLI
38-
scrapegraphai
145+
scrapegraphai scrape https://example.com
146+
scrapegraphai scrape https://example.com --stealth --branding --country-code US
147+
```
39148

40-
# Show help
41-
scrapegraphai --help
149+
| Option | Description |
150+
|---|---|
151+
| `--render-js` | Enable JS rendering (+1 credit) |
152+
| `--stealth` | Bypass bot detection (+4 credits) |
153+
| `--branding` | Extract branding info (+2 credits) |
154+
| `--country-code` | ISO country code for geo-targeting |
155+
156+
### `agentic-scraper` — Browser automation with AI
157+
158+
```bash
159+
scrapegraphai agentic-scraper https://example.com/login \
160+
-s "Fill email with user@test.com,Fill password with pass123,Click Sign In" \
161+
--ai-extraction -p "Extract dashboard data"
42162
```
43163

44-
### Command Options
164+
| Option | Description |
165+
|---|---|
166+
| `-s, --steps` | Comma-separated browser steps |
167+
| `-p, --prompt` | Extraction prompt (with `--ai-extraction`) |
168+
| `--schema` | Output JSON schema (JSON string) |
169+
| `--ai-extraction` | Enable AI extraction after steps |
170+
| `--use-session` | Persist browser session |
45171

46-
- `--help`: Show help message
47-
- `--version`: Show version number
172+
### `generate-schema` — Generate JSON schema from a prompt
173+
174+
```bash
175+
scrapegraphai generate-schema "Schema for an e-commerce product with name, price, and reviews"
176+
```
177+
178+
| Option | Description |
179+
|---|---|
180+
| `--existing-schema` | Existing schema to modify (JSON string) |
181+
182+
### `credits` — Check credit balance
183+
184+
```bash
185+
scrapegraphai credits
186+
```
187+
188+
### `validate` — Validate your API key
189+
190+
```bash
191+
scrapegraphai validate
192+
```
48193

49194
## Project Structure
50195

51196
```
52197
scrapegraph-cli/
53-
├── bin/
54-
│ └── index.js # CLI entry point
55-
├── package.json # Project configuration
56-
└── README.md # This file
198+
├── src/
199+
│ ├── cli.ts # Entry point, citty main command + subcommands
200+
│ ├── lib/
201+
│ │ ├── env.ts # Zod-parsed env config (API key, debug, timeout)
202+
│ │ ├── folders.ts # API key resolution + interactive prompt
203+
│ │ ├── scrapegraphai.ts # SDK layer — all API functions
204+
│ │ ├── schemas.ts # Zod validation schemas
205+
│ │ └── log.ts # Syntax-highlighted JSON output
206+
│ ├── types/
207+
│ │ └── index.ts # Zod-derived types + ApiResult
208+
│ ├── commands/
209+
│ │ ├── smart-scraper.ts
210+
│ │ ├── search-scraper.ts
211+
│ │ ├── markdownify.ts
212+
│ │ ├── crawl.ts
213+
│ │ ├── sitemap.ts
214+
│ │ ├── scrape.ts
215+
│ │ ├── agentic-scraper.ts
216+
│ │ ├── generate-schema.ts
217+
│ │ ├── credits.ts
218+
│ │ └── validate.ts
219+
│ └── utils/
220+
│ └── banner.ts # ASCII banner display
221+
├── dist/ # Build output (git-ignored)
222+
│ └── cli.mjs # Bundled ESM with shebang
223+
├── package.json
224+
├── tsconfig.json
225+
├── tsup.config.ts
226+
├── biome.json
227+
└── .gitignore
57228
```
58229

59-
## Development
230+
## Scripts
60231

61-
The CLI is built using:
232+
| Script | Command | Description |
233+
|---|---|---|
234+
| `dev` | `bun run src/cli.ts` | Run CLI from TS source |
235+
| `build` | `tsup` | Bundle ESM to `dist/cli.mjs` |
236+
| `lint` | `biome check .` | Lint + format check |
237+
| `format` | `biome format . --write` | Auto-format |
238+
| `test` | `bun test` | Run tests |
239+
| `check` | `tsc --noEmit && biome check .` | Type-check + lint |
62240

63-
- **[yargs](https://www.npmjs.com/package/yargs)** - Command-line argument parsing
64-
- **[chalk](https://www.npmjs.com/package/chalk)** - Terminal string styling
65-
- **[boxen](https://www.npmjs.com/package/boxen)** - Create boxes in terminal
241+
## Output
66242

67-
## Customization
243+
All commands output pretty-printed JSON to stdout (pipeable). Errors go to stderr via `@clack/prompts`.
68244

69-
To customize the CLI:
245+
```bash
246+
# Pipe output to jq
247+
scrapegraphai credits | jq '.remaining_credits'
70248

71-
1. **Change the command name**: Edit the `bin` field in `package.json`
72-
2. **Add new options**: Modify the `yargs` configuration in `bin/index.js`
73-
3. **Update functionality**: Extend the main CLI logic in `bin/index.js`
249+
# Save to file
250+
scrapegraphai smart-scraper https://example.com -p "Extract data" > result.json
251+
```
74252

75253
## License
76254

77255
ISC
78-
79-
## Contributing
80-
81-
Contributions are welcome! Please feel free to submit a Pull Request.

SPEC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
when in trouble read Documents/scrapegraph-py for references and docs/scrapegraphai.md

bin/index.js

Lines changed: 0 additions & 32 deletions
This file was deleted.

biome.json

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"$schema": "https://biomejs.dev/schemas/1.9.4/schema.json",
3+
"organizeImports": {
4+
"enabled": true
5+
},
6+
"formatter": {
7+
"enabled": true,
8+
"indentStyle": "tab",
9+
"lineWidth": 100
10+
},
11+
"linter": {
12+
"enabled": true,
13+
"rules": {
14+
"recommended": true
15+
}
16+
},
17+
"files": {
18+
"ignore": ["node_modules", "dist", "bun.lock"]
19+
}
20+
}

0 commit comments

Comments
 (0)