Skip to content

Commit f0ce7b2

Browse files
committed
feat: add v0.7.3 release notes, changelog updates, and documentation for new features
1 parent 21f79fe commit f0ce7b2

3 files changed

Lines changed: 341 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.7.3] - 2025-08-09
9+
10+
### Added
11+
- **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
12+
- `browser_adapter.py` with undetected Chrome integration
13+
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
14+
- Support for headless stealth mode with anti-detection techniques
15+
- Human-like behavior simulation with random mouse movements and scrolling
16+
- Comprehensive examples for anti-bot strategies and stealth crawling
17+
- Full documentation guide for undetected browser usage
18+
19+
- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
20+
- Different crawling strategies for different URL patterns in a single batch
21+
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
22+
- Lambda function matchers for complex URL logic
23+
- Mixed matchers combining strings and functions with AND/OR logic
24+
- Fallback configuration support when no patterns match
25+
- First-match-wins configuration selection with optional fallback
26+
27+
- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
28+
- New `memory_utils.py` module for memory monitoring and optimization
29+
- Real-time memory usage tracking during crawl sessions
30+
- Memory leak detection and reporting
31+
- Performance optimization recommendations
32+
- Peak memory usage analysis and efficiency metrics
33+
- Automatic cleanup suggestions for memory-intensive operations
34+
35+
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
36+
- Direct `result.tables` interface replacing generic `result.media` approach
37+
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
38+
- Enhanced table detection algorithms for better accuracy
39+
- Table metadata including source XPath and headers
40+
- Improved table structure preservation during extraction
41+
42+
- **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
43+
- Supporter ($5/month): Community support + early feature previews
44+
- Professional ($25/month): Priority support + beta access
45+
- Business ($100/month): Direct consultation + custom integrations
46+
- Enterprise ($500/month): Dedicated support + feature development
47+
- Custom arrangement options for larger organizations
48+
49+
- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
50+
- `LLM_PROVIDER` environment variable support for dynamic provider switching
51+
- `.llm.env` file support for secure configuration management
52+
- Per-request provider override capabilities in API endpoints
53+
- Support for OpenAI, Groq, and other providers without rebuilding images
54+
- Enhanced Docker documentation with deployment examples
55+
56+
### Fixed
57+
- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
58+
- **Memory Management**: Fixed memory leaks in long-running crawl sessions
59+
- **Sitemap Processing**: Improved redirect handling in sitemap fetching
60+
- **Table Extraction**: Enhanced table detection and extraction accuracy
61+
- **Error Handling**: Better error messages and recovery from network failures
62+
63+
### Changed
64+
- **Architecture Refactoring**: Major cleanup and optimization
65+
- Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
66+
- Cleaner separation of concerns in crawler architecture
67+
- Better maintainability and code organization
68+
- Preserved backward compatibility while improving performance
69+
70+
### Documentation
71+
- **Comprehensive Examples**: Added real-world URLs and practical use cases
72+
- **API Documentation**: Complete CrawlResult field documentation with all available fields
73+
- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
74+
- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
75+
- **Multi-Config Examples**: Detailed examples for URL-specific configurations
76+
- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
77+
878
## [0.7.x] - 2025-06-29
979

1080
### Added

README.md

Lines changed: 87 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@
2727

2828
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
2929

30-
[✨ Check out latest update v0.7.0](#-recent-updates)
30+
[✨ Check out latest update v0.7.3](#-recent-updates)
3131

32-
✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
32+
✨ New in v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
3333

3434
<details>
3535
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -542,7 +542,89 @@ async def test_news_crawl():
542542

543543
## ✨ Recent Updates
544544

545-
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
545+
<details>
546+
<summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
547+
548+
- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
549+
```python
550+
from crawl4ai import AsyncWebCrawler, BrowserConfig
551+
552+
browser_config = BrowserConfig(
553+
browser_type="undetected", # Use undetected Chrome
554+
headless=True, # Can run headless with stealth
555+
extra_args=[
556+
"--disable-blink-features=AutomationControlled",
557+
"--disable-web-security"
558+
]
559+
)
560+
561+
async with AsyncWebCrawler(config=browser_config) as crawler:
562+
result = await crawler.arun("https://protected-site.com")
563+
# Successfully bypass Cloudflare, Akamai, and custom bot detection
564+
```
565+
566+
- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
567+
```python
568+
from crawl4ai import CrawlerRunConfig, MatchMode
569+
570+
configs = [
571+
# Documentation sites - aggressive caching
572+
CrawlerRunConfig(
573+
url_matcher=["*docs*", "*documentation*"],
574+
cache_mode="write",
575+
markdown_generator_options={"include_links": True}
576+
),
577+
578+
# News/blog sites - fresh content
579+
CrawlerRunConfig(
580+
url_matcher=lambda url: 'blog' in url or 'news' in url,
581+
cache_mode="bypass"
582+
),
583+
584+
# Fallback for everything else
585+
CrawlerRunConfig()
586+
]
587+
588+
results = await crawler.arun_many(urls, config=configs)
589+
# Each URL gets the perfect configuration automatically
590+
```
591+
592+
- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
593+
```python
594+
from crawl4ai.memory_utils import MemoryMonitor
595+
596+
monitor = MemoryMonitor()
597+
monitor.start_monitoring()
598+
599+
results = await crawler.arun_many(large_url_list)
600+
601+
report = monitor.get_report()
602+
print(f"Peak memory: {report['peak_mb']:.1f} MB")
603+
print(f"Efficiency: {report['efficiency']:.1f}%")
604+
# Get optimization recommendations
605+
```
606+
607+
- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
608+
```python
609+
result = await crawler.arun("https://site-with-tables.com")
610+
611+
# New way - direct table access
612+
if result.tables:
613+
import pandas as pd
614+
for table in result.tables:
615+
df = pd.DataFrame(table['data'])
616+
print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
617+
```
618+
619+
- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
620+
- **🐳 Docker LLM Flexibility**: Configure providers via environment variables
621+
622+
[Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
623+
624+
</details>
625+
626+
<details>
627+
<summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
546628

547629
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
548630
```python
@@ -607,6 +689,8 @@ async def test_news_crawl():
607689

608690
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
609691

692+
</details>
693+
610694
## Version Numbering in Crawl4AI
611695

612696
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.

docs/blog/release-v0.7.3.md

Lines changed: 184 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,14 @@ Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This
88

99
## 🎯 What's New at a Glance
1010

11-
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
12-
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
13-
- **Bug Fixes**: Resolved several critical issues for better stability
14-
- **Documentation Updates**: Clearer examples and improved API documentation
11+
- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
12+
- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
13+
- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
14+
- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
15+
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
16+
- **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
17+
- **🔧 Bug Fixes**: Resolved several critical issues for better stability
18+
- **📚 Documentation Updates**: Clearer examples and improved API documentation
1519

1620
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
1721

@@ -78,6 +82,182 @@ async with AsyncWebCrawler() as crawler:
7882
- **Reduced Complexity**: No more if/else forests in your extraction code
7983
- **Better Performance**: Each URL gets exactly the processing it needs
8084

85+
## 🕵️ Undetected Browser Support: Stealth Mode Activated
86+
87+
**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
88+
89+
**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
90+
91+
### Technical Implementation
92+
93+
```python
94+
from crawl4ai import AsyncWebCrawler, BrowserConfig
95+
96+
# Enable undetected mode for stealth crawling
97+
browser_config = BrowserConfig(
98+
browser_type="undetected", # Use undetected Chrome
99+
headless=True, # Can run headless with stealth
100+
extra_args=[
101+
"--disable-blink-features=AutomationControlled",
102+
"--disable-web-security",
103+
"--disable-features=VizDisplayCompositor"
104+
]
105+
)
106+
107+
async with AsyncWebCrawler(config=browser_config) as crawler:
108+
# This will bypass most bot detection systems
109+
result = await crawler.arun("https://protected-site.com")
110+
111+
if result.success:
112+
print("✅ Successfully bypassed bot detection!")
113+
print(f"Content length: {len(result.markdown)}")
114+
```
115+
116+
**Advanced Anti-Bot Strategies:**
117+
118+
```python
119+
# Combine multiple stealth techniques
120+
from crawl4ai import CrawlerRunConfig
121+
122+
config = CrawlerRunConfig(
123+
# Random user agents and headers
124+
headers={
125+
"Accept-Language": "en-US,en;q=0.9",
126+
"Accept-Encoding": "gzip, deflate, br",
127+
"DNT": "1"
128+
},
129+
130+
# Human-like behavior simulation
131+
js_code="""
132+
// Random mouse movements
133+
const simulateHuman = () => {
134+
const event = new MouseEvent('mousemove', {
135+
clientX: Math.random() * window.innerWidth,
136+
clientY: Math.random() * window.innerHeight
137+
});
138+
document.dispatchEvent(event);
139+
};
140+
setInterval(simulateHuman, 100 + Math.random() * 200);
141+
142+
// Random scrolling
143+
const randomScroll = () => {
144+
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
145+
window.scrollTo(0, scrollY);
146+
};
147+
setTimeout(randomScroll, 500 + Math.random() * 1000);
148+
""",
149+
150+
# Delay to appear more human
151+
delay_before_return_html=2.0
152+
)
153+
154+
result = await crawler.arun("https://bot-protected-site.com", config=config)
155+
```
156+
157+
**Expected Real-World Impact:**
158+
- **Enterprise Scraping**: Access previously blocked corporate sites and databases
159+
- **Market Research**: Gather data from competitor sites with protection
160+
- **Price Monitoring**: Track e-commerce sites that block automated access
161+
- **Content Aggregation**: Collect news and social media despite anti-bot measures
162+
- **Compliance Testing**: Verify your own site's bot protection effectiveness
163+
164+
## 🧠 Memory Monitoring & Optimization
165+
166+
**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
167+
168+
**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
169+
170+
### Memory Tracking Implementation
171+
172+
```python
173+
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
174+
175+
# Monitor memory during crawling
176+
monitor = MemoryMonitor()
177+
178+
async with AsyncWebCrawler() as crawler:
179+
# Start monitoring
180+
monitor.start_monitoring()
181+
182+
# Perform memory-intensive operations
183+
results = await crawler.arun_many([
184+
"https://heavy-js-site.com",
185+
"https://large-images-site.com",
186+
"https://dynamic-content-site.com"
187+
])
188+
189+
# Get detailed memory report
190+
memory_report = monitor.get_report()
191+
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
192+
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
193+
194+
# Automatic cleanup suggestions
195+
if memory_report['peak_mb'] > 1000: # > 1GB
196+
print("💡 Consider batch size optimization")
197+
print("💡 Enable aggressive garbage collection")
198+
```
199+
200+
**Expected Real-World Impact:**
201+
- **Production Stability**: Prevent memory-related crashes in long-running services
202+
- **Cost Optimization**: Right-size server resources based on actual usage
203+
- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
204+
- **Scalability Planning**: Understand memory patterns for horizontal scaling
205+
206+
## 📊 Enhanced Table Extraction
207+
208+
**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
209+
210+
**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
211+
212+
### New Table Access Pattern
213+
214+
```python
215+
# Old way (deprecated)
216+
# tables_data = result.media.get('tables', [])
217+
218+
# New way (v0.7.3+)
219+
result = await crawler.arun("https://site-with-tables.com")
220+
221+
# Direct table access
222+
if result.tables:
223+
print(f"Found {len(result.tables)} tables")
224+
225+
# Convert to pandas DataFrame instantly
226+
import pandas as pd
227+
228+
for i, table in enumerate(result.tables):
229+
df = pd.DataFrame(table['data'])
230+
print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
231+
print(df.head())
232+
233+
# Table metadata
234+
print(f"Source: {table.get('source_xpath', 'Unknown')}")
235+
print(f"Headers: {table.get('headers', [])}")
236+
```
237+
238+
**Expected Real-World Impact:**
239+
- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
240+
- **ETL Pipelines**: Cleaner integration with data processing workflows
241+
- **Reporting**: Simplified table extraction for automated reporting systems
242+
243+
## 💰 Community Support: GitHub Sponsors
244+
245+
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
246+
247+
**Sponsorship Tiers:**
248+
- **🌱 Supporter ($5/month)**: Community support + early feature previews
249+
- **🚀 Professional ($25/month)**: Priority support + beta access
250+
- **🏢 Business ($100/month)**: Direct consultation + custom integrations
251+
- **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
252+
253+
**Why Sponsor?**
254+
- Ensure continuous development and maintenance
255+
- Get priority support and feature requests
256+
- Access to premium documentation and examples
257+
- Direct line to the development team
258+
259+
[**Become a Sponsor →**](https://github.com/sponsors/unclecode)
260+
81261
## 🐳 Docker: Flexible LLM Provider Configuration
82262

83263
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.

0 commit comments

Comments
 (0)