|
1 | 1 | # ScrapeGraph CLI |
2 | 2 |
|
3 | | -A basic command-line interface (CLI) tool built with Node.js, following best practices for CLI development. |
| 3 | +Command-line interface for [ScrapeGraph AI](https://scrapegraphai.com) — AI-powered web scraping, data extraction, search, and crawling. |
4 | 4 |
|
5 | | -## Features |
| 5 | +## Tech Stack |
6 | 6 |
|
7 | | -- 🎨 Beautiful terminal output with colors and boxes using `chalk` and `boxen` |
8 | | -- 📝 Command-line argument parsing with `yargs` |
9 | | -- 🚀 Easy to install and use globally |
10 | | -- 🔧 Extensible architecture |
| 7 | +| Concern | Tool | |
| 8 | +|---|---| |
| 9 | +| Language | **TypeScript 5.8** | |
| 10 | +| Dev Runtime | **Bun** | |
| 11 | +| Build | **tsup** (esbuild) | |
| 12 | +| CLI Framework | **citty** (unjs) | |
| 13 | +| Prompts | **@clack/prompts** | |
| 14 | +| Styling | **chalk** v5 (ESM) | |
| 15 | +| Validation | **zod** v4 | |
| 16 | +| Env | **dotenv** | |
| 17 | +| Lint / Format | **Biome** | |
| 18 | +| Target | **Node.js 22+**, ESM-only | |
11 | 19 |
|
12 | | -## Installation |
| 20 | +## Setup |
13 | 21 |
|
14 | | -### Local Development |
| 22 | +```bash |
| 23 | +bun install |
| 24 | +``` |
| 25 | + |
| 26 | +## Configuration |
| 27 | + |
| 28 | +The CLI needs a ScrapeGraph API key. Get one at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com). |
| 29 | + |
| 30 | +Four ways to provide it (checked in order): |
| 31 | + |
| 32 | +1. **Environment variable**: `export SGAI_API_KEY="sgai-..."` |
| 33 | +2. **`.env` file**: create a `.env` file in the project root with `SGAI_API_KEY=sgai-...` |
| 34 | +3. **Config file**: stored in `~/.scrapegraphai/config.json` |
| 35 | +4. **Interactive prompt**: if none of the above are set, the CLI prompts you and saves it to the config file |
| 36 | + |
| 37 | +### Timeout |
| 38 | + |
| 39 | +Set `SGAI_CLI_TIMEOUT_S` to override the default 120s request/polling timeout: |
| 40 | + |
| 41 | +```bash |
| 42 | +export SGAI_CLI_TIMEOUT_S=300 |
| 43 | +``` |
| 44 | + |
| 45 | +### Debug Logging |
| 46 | + |
| 47 | +Set `SGAI_CLI_DEBUG=1` to enable debug logging (outputs to stderr): |
| 48 | + |
| 49 | +```bash |
| 50 | +SGAI_CLI_DEBUG=1 scrapegraphai smart-scraper https://example.com -p "Extract data" |
| 51 | +``` |
| 52 | + |
| 53 | +## Commands |
| 54 | + |
| 55 | +### `smart-scraper` — Extract structured data from a URL |
| 56 | + |
| 57 | +```bash |
| 58 | +scrapegraphai smart-scraper <url> -p "Extract all product names and prices" |
| 59 | + |
| 60 | +# With JSON schema |
| 61 | +scrapegraphai smart-scraper https://example.com/products -p "Extract products" \ |
| 62 | + --schema '{"type":"object","properties":{"products":{"type":"array","items":{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}}}}}}' |
| 63 | + |
| 64 | +# With options |
| 65 | +scrapegraphai smart-scraper https://example.com -p "Extract data" \ |
| 66 | + --stealth --render-js --scrolls 10 --pages 5 |
| 67 | +``` |
| 68 | + |
| 69 | +| Option | Description | |
| 70 | +|---|---| |
| 71 | +| `-p, --prompt` | Extraction prompt (required) | |
| 72 | +| `--schema` | Output JSON schema (JSON string) | |
| 73 | +| `--scrolls` | Infinite scroll count (0-100) | |
| 74 | +| `--pages` | Total pages to scrape (1-100) | |
| 75 | +| `--render-js` | Enable JS rendering (+1 credit) | |
| 76 | +| `--stealth` | Bypass bot detection (+4 credits) | |
| 77 | +| `--cookies` | Cookies as JSON object string | |
| 78 | +| `--headers` | Custom headers as JSON object string | |
| 79 | +| `--plain-text` | Return plain text instead of JSON | |
| 80 | + |
| 81 | +### `search-scraper` — Search the web and extract data |
| 82 | + |
| 83 | +```bash |
| 84 | +scrapegraphai search-scraper "What are the top Python web frameworks?" |
| 85 | + |
| 86 | +# Markdown only (cheaper) |
| 87 | +scrapegraphai search-scraper "Python frameworks" --no-extraction --num-results 5 |
| 88 | +``` |
| 89 | + |
| 90 | +| Option | Description | |
| 91 | +|---|---| |
| 92 | +| `--num-results` | Number of websites (3-20, default 3) | |
| 93 | +| `--no-extraction` | Markdown only (2 credits/site vs 10) | |
| 94 | +| `--schema` | Output JSON schema (JSON string) | |
| 95 | +| `--stealth` | Bypass bot detection (+4 credits) | |
| 96 | +| `--headers` | Custom headers as JSON object string | |
| 97 | + |
| 98 | +### `markdownify` — Convert a webpage to markdown |
15 | 99 |
|
16 | | -1. Clone the repository: |
17 | | - ```bash |
18 | | - git clone <repository-url> |
19 | | - cd scrapegraph-cli |
20 | | - ``` |
| 100 | +```bash |
| 101 | +scrapegraphai markdownify https://example.com/article |
| 102 | +scrapegraphai markdownify https://example.com --render-js --stealth |
| 103 | +``` |
| 104 | + |
| 105 | +| Option | Description | |
| 106 | +|---|---| |
| 107 | +| `--render-js` | Enable JS rendering (+1 credit) | |
| 108 | +| `--stealth` | Bypass bot detection (+4 credits) | |
| 109 | +| `--headers` | Custom headers as JSON object string | |
| 110 | + |
| 111 | +### `crawl` — Crawl and extract from multiple pages |
| 112 | + |
| 113 | +```bash |
| 114 | +scrapegraphai crawl https://example.com -p "Extract article titles" --max-pages 5 --depth 2 |
| 115 | + |
| 116 | +# Markdown only |
| 117 | +scrapegraphai crawl https://example.com --no-extraction --max-pages 10 |
| 118 | + |
| 119 | +# With crawl rules |
| 120 | +scrapegraphai crawl https://example.com -p "Extract data" \ |
| 121 | + --rules '{"include_paths":["/blog/*"],"same_domain":true}' |
| 122 | +``` |
21 | 123 |
|
22 | | -2. Install dependencies: |
23 | | - ```bash |
24 | | - npm install |
25 | | - ``` |
| 124 | +| Option | Description | |
| 125 | +|---|---| |
| 126 | +| `-p, --prompt` | Extraction prompt (required when extraction is on) | |
| 127 | +| `--no-extraction` | Markdown only (2 credits/page vs 10) | |
| 128 | +| `--max-pages` | Max pages to crawl (default 10) | |
| 129 | +| `--depth` | Crawl depth (default 1) | |
| 130 | +| `--schema` | Output JSON schema (JSON string) | |
| 131 | +| `--rules` | Crawl rules as JSON object string | |
| 132 | +| `--no-sitemap` | Disable sitemap-based discovery | |
| 133 | +| `--render-js` | Enable JS rendering (+1 credit/page) | |
| 134 | +| `--stealth` | Bypass bot detection (+4 credits) | |
26 | 135 |
|
27 | | -3. Install globally (from the project root): |
28 | | - ```bash |
29 | | - npm install -g . |
30 | | - ``` |
| 136 | +### `sitemap` — Get all URLs from a website's sitemap |
31 | 137 |
|
32 | | -## Usage |
| 138 | +```bash |
| 139 | +scrapegraphai sitemap https://example.com |
| 140 | +``` |
33 | 141 |
|
34 | | -After installation, you can use the CLI from anywhere in your terminal: |
| 142 | +### `scrape` — Get raw HTML content |
35 | 143 |
|
36 | 144 | ```bash |
37 | | -# Run the CLI |
38 | | -scrapegraphai |
| 145 | +scrapegraphai scrape https://example.com |
| 146 | +scrapegraphai scrape https://example.com --stealth --branding --country-code US |
| 147 | +``` |
39 | 148 |
|
40 | | -# Show help |
41 | | -scrapegraphai --help |
| 149 | +| Option | Description | |
| 150 | +|---|---| |
| 151 | +| `--render-js` | Enable JS rendering (+1 credit) | |
| 152 | +| `--stealth` | Bypass bot detection (+4 credits) | |
| 153 | +| `--branding` | Extract branding info (+2 credits) | |
| 154 | +| `--country-code` | ISO country code for geo-targeting | |
| 155 | + |
| 156 | +### `agentic-scraper` — Browser automation with AI |
| 157 | + |
| 158 | +```bash |
| 159 | +scrapegraphai agentic-scraper https://example.com/login \ |
| 160 | + -s "Fill email with user@test.com,Fill password with pass123,Click Sign In" \ |
| 161 | + --ai-extraction -p "Extract dashboard data" |
42 | 162 | ``` |
43 | 163 |
|
44 | | -### Command Options |
| 164 | +| Option | Description | |
| 165 | +|---|---| |
| 166 | +| `-s, --steps` | Comma-separated browser steps | |
| 167 | +| `-p, --prompt` | Extraction prompt (with `--ai-extraction`) | |
| 168 | +| `--schema` | Output JSON schema (JSON string) | |
| 169 | +| `--ai-extraction` | Enable AI extraction after steps | |
| 170 | +| `--use-session` | Persist browser session | |
45 | 171 |
|
46 | | -- `--help`: Show help message |
47 | | -- `--version`: Show version number |
| 172 | +### `generate-schema` — Generate JSON schema from a prompt |
| 173 | + |
| 174 | +```bash |
| 175 | +scrapegraphai generate-schema "Schema for an e-commerce product with name, price, and reviews" |
| 176 | +``` |
| 177 | + |
| 178 | +| Option | Description | |
| 179 | +|---|---| |
| 180 | +| `--existing-schema` | Existing schema to modify (JSON string) | |
| 181 | + |
| 182 | +### `credits` — Check credit balance |
| 183 | + |
| 184 | +```bash |
| 185 | +scrapegraphai credits |
| 186 | +``` |
| 187 | + |
| 188 | +### `validate` — Validate your API key |
| 189 | + |
| 190 | +```bash |
| 191 | +scrapegraphai validate |
| 192 | +``` |
48 | 193 |
|
49 | 194 | ## Project Structure |
50 | 195 |
|
51 | 196 | ``` |
52 | 197 | scrapegraph-cli/ |
53 | | -├── bin/ |
54 | | -│ └── index.js # CLI entry point |
55 | | -├── package.json # Project configuration |
56 | | -└── README.md # This file |
| 198 | +├── src/ |
| 199 | +│ ├── cli.ts # Entry point, citty main command + subcommands |
| 200 | +│ ├── lib/ |
| 201 | +│ │ ├── env.ts # Zod-parsed env config (API key, debug, timeout) |
| 202 | +│ │ ├── folders.ts # API key resolution + interactive prompt |
| 203 | +│ │ ├── scrapegraphai.ts # SDK layer — all API functions |
| 204 | +│ │ ├── schemas.ts # Zod validation schemas |
| 205 | +│ │ └── log.ts # Syntax-highlighted JSON output |
| 206 | +│ ├── types/ |
| 207 | +│ │ └── index.ts # Zod-derived types + ApiResult |
| 208 | +│ ├── commands/ |
| 209 | +│ │ ├── smart-scraper.ts |
| 210 | +│ │ ├── search-scraper.ts |
| 211 | +│ │ ├── markdownify.ts |
| 212 | +│ │ ├── crawl.ts |
| 213 | +│ │ ├── sitemap.ts |
| 214 | +│ │ ├── scrape.ts |
| 215 | +│ │ ├── agentic-scraper.ts |
| 216 | +│ │ ├── generate-schema.ts |
| 217 | +│ │ ├── credits.ts |
| 218 | +│ │ └── validate.ts |
| 219 | +│ └── utils/ |
| 220 | +│ └── banner.ts # ASCII banner display |
| 221 | +├── dist/ # Build output (git-ignored) |
| 222 | +│ └── cli.mjs # Bundled ESM with shebang |
| 223 | +├── package.json |
| 224 | +├── tsconfig.json |
| 225 | +├── tsup.config.ts |
| 226 | +├── biome.json |
| 227 | +└── .gitignore |
57 | 228 | ``` |
58 | 229 |
|
59 | | -## Development |
| 230 | +## Scripts |
60 | 231 |
|
61 | | -The CLI is built using: |
| 232 | +| Script | Command | Description | |
| 233 | +|---|---|---| |
| 234 | +| `dev` | `bun run src/cli.ts` | Run CLI from TS source | |
| 235 | +| `build` | `tsup` | Bundle ESM to `dist/cli.mjs` | |
| 236 | +| `lint` | `biome check .` | Lint + format check | |
| 237 | +| `format` | `biome format . --write` | Auto-format | |
| 238 | +| `test` | `bun test` | Run tests | |
| 239 | +| `check` | `tsc --noEmit && biome check .` | Type-check + lint | |
62 | 240 |
|
63 | | -- **[yargs](https://www.npmjs.com/package/yargs)** - Command-line argument parsing |
64 | | -- **[chalk](https://www.npmjs.com/package/chalk)** - Terminal string styling |
65 | | -- **[boxen](https://www.npmjs.com/package/boxen)** - Create boxes in terminal |
| 241 | +## Output |
66 | 242 |
|
67 | | -## Customization |
| 243 | +All commands output pretty-printed JSON to stdout (pipeable). Errors go to stderr via `@clack/prompts`. |
68 | 244 |
|
69 | | -To customize the CLI: |
| 245 | +```bash |
| 246 | +# Pipe output to jq |
| 247 | +scrapegraphai credits | jq '.remaining_credits' |
70 | 248 |
|
71 | | -1. **Change the command name**: Edit the `bin` field in `package.json` |
72 | | -2. **Add new options**: Modify the `yargs` configuration in `bin/index.js` |
73 | | -3. **Update functionality**: Extend the main CLI logic in `bin/index.js` |
| 249 | +# Save to file |
| 250 | +scrapegraphai smart-scraper https://example.com -p "Extract data" > result.json |
| 251 | +``` |
74 | 252 |
|
75 | 253 | ## License |
76 | 254 |
|
77 | 255 | ISC |
78 | | - |
79 | | -## Contributing |
80 | | - |
81 | | -Contributions are welcome! Please feel free to submit a Pull Request. |
|
0 commit comments