Created: 2025-10-10
Status: ✅ Integrated into Research Swarm
Purpose: Document LangChain web scraping and parsing capabilities
Our Web Research Swarm uses LangChain's powerful document loaders and transformers for robust web scraping and content extraction.
from langchain_community.document_loaders import AsyncHtmlLoader
# Load multiple URLs asynchronously
loader = AsyncHtmlLoader(["https://example.com/page1", "https://example.com/page2"])
docs = loader.load()Benefits:
- ✅ Async loading for multiple URLs
- ✅ Efficient batch processing
- ✅ Built-in error handling
- ✅ Returns structured Document objects
from langchain_community.document_loaders import WebBaseLoader
# Simple single URL loading
loader = WebBaseLoader("https://example.com")
docs = loader.load()Benefits:
- ✅ Simple API for single URLs
- ✅ BeautifulSoup integration
- ✅ CSS selector support
from langchain_community.document_transformers import Html2TextTransformer
# Transform HTML to clean markdown/text
html2text = Html2TextTransformer()
cleaned_docs = html2text.transform_documents(html_docs)Benefits:
- ✅ Converts HTML to clean text
- ✅ Preserves structure (headers, lists)
- ✅ Removes scripts, styles, navigation
- ✅ Markdown output option
from langchain_community.document_transformers import BeautifulSoupTransformer
# Advanced HTML parsing with selectors
bs_transformer = BeautifulSoupTransformer()
transformed = bs_transformer.transform_documents(
docs,
tags_to_extract=["article", "main", "div.content"]
)Benefits:
- ✅ CSS selector-based extraction
- ✅ Flexible tag filtering
- ✅ Preserves semantic structure
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Smart text chunking
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)Benefits:
- ✅ Semantic-aware splitting
- ✅ Configurable chunk size
- ✅ Overlap for context preservation
async def _extract_raw_content(self, result: Dict) -> str:
"""Extract content using LangChain loaders."""
if LANGCHAIN_PARSING_AVAILABLE:
# Use LangChain's async HTML loader
loader = AsyncHtmlLoader([url])
docs = loader.load()
# Transform HTML to clean text
html2text = Html2TextTransformer()
transformed_docs = html2text.transform_documents(docs)
# Extract cleaned content
content = transformed_docs[0].page_content
return self._clean_content(content)Fallback Hierarchy:
- ✅ LangChain (AsyncHtmlLoader + Html2TextTransformer)
⚠️ BeautifulSoup (Direct HTML parsing)- 🔄 Simulated (For testing without network)
from langchain_community.document_loaders import AsyncChromiumLoader
# Load JavaScript-rendered pages
loader = AsyncChromiumLoader(["https://spa-website.com"])
docs = loader.load()Setup:
pip install playwright
playwright installfrom langchain_community.document_loaders import SitemapLoader
# Scrape entire sitemap
loader = SitemapLoader("https://example.com/sitemap.xml")
docs = loader.load()from langchain_community.document_loaders import RecursiveUrlLoader
# Recursively crawl website
loader = RecursiveUrlLoader(
"https://example.com",
max_depth=2,
extractor=lambda html: BeautifulSoup(html).get_text()
)
docs = loader.load()| Component | Status | File | Notes |
|---|---|---|---|
| AsyncHtmlLoader | ✅ Integrated | content_parser_agent.py | Async URL loading |
| Html2TextTransformer | ✅ Integrated | content_parser_agent.py | HTML to clean text |
| BeautifulSoupTransformer | 📋 Available | - | For advanced parsing |
| RecursiveTextSplitter | 📋 Available | - | For chunking long content |
| AsyncChromiumLoader | 📋 Available | - | For JS-heavy sites |
| SitemapLoader | 📋 Available | - | For sitemap scraping |
| RecursiveUrlLoader | 📋 Available | - | For site crawling |
# Core LangChain web scraping
pip install langchain-community
# HTML parsing
pip install beautifulsoup4 html2text lxml
# JavaScript rendering (optional)
pip install playwright
playwright install chromium
# Advanced scraping (optional)
pip install selenium# Optional: API keys for commercial scrapers
OXYLABS_API_KEY=your_key_here
BRIGHTDATA_API_KEY=your_key_here
# User agent for polite scraping
USER_AGENT="Mozilla/5.0 (Research Bot)"- ✅ Check
robots.txtbefore scraping - ✅ Implement rate limiting (1-2 seconds between requests)
- ✅ Use meaningful User-Agent strings
- ✅ Cache results to avoid repeated requests
try:
docs = loader.load()
except Exception as e:
logger.error(f"Scraping failed: {e}")
# Fallback to cached or simulated content- ✅ Remove navigation, ads, footers
- ✅ Extract main content only
- ✅ Preserve semantic structure
- ✅ Handle multiple encodings
- ✅ Use async loaders for multiple URLs
- ✅ Batch process when possible
- ✅ Cache frequently accessed content
- ✅ Set reasonable timeouts
- LangChain Document Loaders
- LangChain Web Scraping Guide
- BeautifulSoup Documentation
- Playwright for Python
- Enhance WebSearchAgent - Integrate real search APIs (Google, Bing, DuckDuckGo)
- Add JavaScript Support - Integrate AsyncChromiumLoader for SPA sites
- Implement Caching - Cache scraped content to reduce network calls
- Add Sitemap Support - Enable full site scraping via sitemaps
- Quality Filters - Implement content quality scoring and filtering
Status: ✅ LangChain web scraping successfully integrated into Research Swarm
Next: Real search API integration and JavaScript rendering support