| layout | default |
|---|---|
| title | Chapter 1: Getting Started |
| parent | Crawl4AI Tutorial |
| nav_order | 1 |
Welcome to Crawl4AI — the open-source web crawler built specifically for feeding clean data into Large Language Models. In this chapter you will install the library, run your first crawl, and understand every field in the result object that comes back.
Traditional web scrapers return raw HTML that requires extensive post-processing before an LLM can use it. Crawl4AI takes a fundamentally different approach:
flowchart LR
A[URL] --> B[Crawl4AI]
B --> C[Browser renders page]
C --> D[Extracts main content]
D --> E[Generates clean Markdown]
E --> F[Ready for LLM / RAG]
classDef process fill:#e1f5fe,stroke:#01579b
classDef output fill:#e8f5e8,stroke:#1b5e20
class B,C,D process
class E,F output
Key advantages over generic scrapers:
- Real browser rendering — JavaScript-heavy sites work out of the box
- Automatic boilerplate removal — strips navigation, ads, footers
- Markdown-first output — headings, lists, links preserved with structure
- Async-native — built on
asynciofor high-throughput crawling - Zero configuration — sensible defaults get you started in three lines
# Install Crawl4AI from PyPI
pip install crawl4ai
# After install, set up the browser engine (downloads Chromium)
crawl4ai-setupThe crawl4ai-setup command downloads a Chromium binary via Playwright. This is a one-time step (~150 MB download).
# Install with LLM integration, PDF support, and all optional deps
pip install "crawl4ai[all]"
# Run setup
crawl4ai-setupimport crawl4ai
print(crawl4ai.__version__)# Pull the official image
docker pull unclecode/crawl4ai
# Run with default settings
docker run -p 11235:11235 unclecode/crawl4aiSee Chapter 8: Production Deployment for full Docker configuration.
Crawl4AI uses an async context manager pattern. Here is the simplest possible crawl:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Check if the crawl succeeded
if result.success:
print(result.markdown[:500])
else:
print(f"Crawl failed: {result.error_message}")
asyncio.run(main())What happens under the hood:
AsyncWebCrawler()launches a headless Chromium browserarun()navigates to the URL and waits for the page to load- The engine extracts the main content area
- Content is converted to clean markdown
- The browser stays alive for the next crawl (connection reuse)
- Exiting the context manager closes the browser
Every call to arun() returns a CrawlResult with these key fields:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# --- Status ---
print(result.success) # bool: did the crawl succeed?
print(result.status_code) # int: HTTP status code (200, 404, etc.)
print(result.error_message) # str: error details if success is False
# --- Content ---
print(result.markdown) # str: clean markdown of main content
print(result.html) # str: raw HTML of the full page
print(result.cleaned_html) # str: HTML with boilerplate removed
print(result.text) # str: plain text, no formatting
# --- Metadata ---
print(result.url) # str: final URL (after redirects)
print(result.title) # str: page <title>
print(result.links) # dict: internal and external links found
print(result.media) # dict: images, videos, audio found
# --- Extracted Data ---
print(result.extracted_content) # str: output from extraction strategyasync with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Links are categorized
for link in result.links.get("internal", []):
print(f"Internal: {link['href']} - {link['text']}")
for link in result.links.get("external", []):
print(f"External: {link['href']} - {link['text']}")
# Media assets are also extracted
for img in result.media.get("images", []):
print(f"Image: {img['src']} alt='{img.get('alt', '')}'")You can reuse the same crawler instance for multiple URLs. The browser stays warm between calls, making subsequent crawls faster:
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_multiple():
urls = [
"https://docs.python.org/3/tutorial/index.html",
"https://docs.python.org/3/tutorial/introduction.html",
"https://docs.python.org/3/tutorial/controlflow.html",
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
print(f"[OK] {result.title} — {len(result.markdown)} chars")
else:
print(f"[FAIL] {url}: {result.error_message}")
asyncio.run(crawl_multiple())For true parallel crawling (running many pages concurrently), see Chapter 7: Async & Parallel Crawling.
While defaults work for simple cases, you can tune behavior with CrawlerRunConfig:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
# Content control
word_count_threshold=10, # skip blocks with fewer words
exclude_external_links=True, # strip external links from markdown
remove_overlay_elements=True, # remove popups and modals
# Performance
page_timeout=30000, # max ms to wait for page load
verbose=True, # enable detailed logging
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown[:500])We will explore browser-level configuration in Chapter 2 and extraction strategies in Chapter 3.
Always check result.success before using the content:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://nonexistent.example.com")
if not result.success:
print(f"Status: {result.status_code}")
print(f"Error: {result.error_message}")
# Decide: retry, skip, or raise
else:
# Safe to use result.markdown, result.html, etc.
process_content(result.markdown)Common failure modes:
| Scenario | status_code |
error_message |
|---|---|---|
| DNS failure | None | Connection error details |
| HTTP 404 | 404 | Page not found |
| Timeout | None | Navigation timeout exceeded |
| JS error on page | 200 | Success (page still renders) |
# Minimal crawl
from crawl4ai import AsyncWebCrawler
import asyncio
async def quick():
async with AsyncWebCrawler() as crawler:
r = await crawler.arun(url="https://example.com")
return r.markdown if r.success else r.error_message
print(asyncio.run(quick()))You now know how to install Crawl4AI, run a basic crawl, and interpret every field in the result object. The library handles browser management, JavaScript execution, and content extraction behind a simple async API.
Key takeaways:
- Crawl4AI is async-first — use
async withandawait - The
CrawlResultobject gives you markdown, HTML, text, links, and media - Browser instances are reused across crawls within a context manager
- Always check
result.successbefore processing content
Next up: Chapter 2: Browser Engine & Crawling — learn how to configure the browser, execute JavaScript, handle authentication, and interact with dynamic pages.
Back to Tutorial Home | Next: Chapter 2: Browser Engine & Crawling