LangChain Web Scraping Integration

Created: 2025-10-10
Status: ✅ Integrated into Research Swarm
Purpose: Document LangChain web scraping and parsing capabilities

🎯 Overview

Our Web Research Swarm uses LangChain's powerful document loaders and transformers for robust web scraping and content extraction.

📦 LangChain Components Used

1. Document Loaders

AsyncHtmlLoader ✅ (Currently Integrated)

from langchain_community.document_loaders import AsyncHtmlLoader

# Load multiple URLs asynchronously
loader = AsyncHtmlLoader(["https://example.com/page1", "https://example.com/page2"])
docs = loader.load()

Benefits:

✅ Async loading for multiple URLs
✅ Efficient batch processing
✅ Built-in error handling
✅ Returns structured Document objects

WebBaseLoader (Alternative)

from langchain_community.document_loaders import WebBaseLoader

# Simple single URL loading
loader = WebBaseLoader("https://example.com")
docs = loader.load()

Benefits:

✅ Simple API for single URLs
✅ BeautifulSoup integration
✅ CSS selector support

2. Document Transformers

Html2TextTransformer ✅ (Currently Integrated)

from langchain_community.document_transformers import Html2TextTransformer

# Transform HTML to clean markdown/text
html2text = Html2TextTransformer()
cleaned_docs = html2text.transform_documents(html_docs)

Benefits:

✅ Converts HTML to clean text
✅ Preserves structure (headers, lists)
✅ Removes scripts, styles, navigation
✅ Markdown output option

BeautifulSoupTransformer (Available)

from langchain_community.document_transformers import BeautifulSoupTransformer

# Advanced HTML parsing with selectors
bs_transformer = BeautifulSoupTransformer()
transformed = bs_transformer.transform_documents(
    docs, 
    tags_to_extract=["article", "main", "div.content"]
)

Benefits:

✅ CSS selector-based extraction
✅ Flexible tag filtering
✅ Preserves semantic structure

3. Text Splitters

RecursiveCharacterTextSplitter (Available for Integration)

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Smart text chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)

Benefits:

✅ Semantic-aware splitting
✅ Configurable chunk size
✅ Overlap for context preservation

🏗️ Current Implementation

ContentParserAgent (agents/research/content_parser_agent.py)

async def _extract_raw_content(self, result: Dict) -> str:
    """Extract content using LangChain loaders."""
    
    if LANGCHAIN_PARSING_AVAILABLE:
        # Use LangChain's async HTML loader
        loader = AsyncHtmlLoader([url])
        docs = loader.load()
        
        # Transform HTML to clean text
        html2text = Html2TextTransformer()
        transformed_docs = html2text.transform_documents(docs)
        
        # Extract cleaned content
        content = transformed_docs[0].page_content
        return self._clean_content(content)

Fallback Hierarchy:

✅ LangChain (AsyncHtmlLoader + Html2TextTransformer)
⚠️ BeautifulSoup (Direct HTML parsing)
🔄 Simulated (For testing without network)

🚀 Advanced Features (Available for Future Integration)

1. JavaScript-Heavy Sites

AsyncChromiumLoader (Requires Playwright)

from langchain_community.document_loaders import AsyncChromiumLoader

# Load JavaScript-rendered pages
loader = AsyncChromiumLoader(["https://spa-website.com"])
docs = loader.load()

Setup:

pip install playwright
playwright install

2. Sitemap Scraping

SitemapLoader

from langchain_community.document_loaders import SitemapLoader

# Scrape entire sitemap
loader = SitemapLoader("https://example.com/sitemap.xml")
docs = loader.load()

3. Recursive Crawling

RecursiveUrlLoader

from langchain_community.document_loaders import RecursiveUrlLoader

# Recursively crawl website
loader = RecursiveUrlLoader(
    "https://example.com",
    max_depth=2,
    extractor=lambda html: BeautifulSoup(html).get_text()
)
docs = loader.load()

📊 Integration Status

Component	Status	File	Notes
AsyncHtmlLoader	✅ Integrated	content_parser_agent.py	Async URL loading
Html2TextTransformer	✅ Integrated	content_parser_agent.py	HTML to clean text
BeautifulSoupTransformer	📋 Available	-	For advanced parsing
RecursiveTextSplitter	📋 Available	-	For chunking long content
AsyncChromiumLoader	📋 Available	-	For JS-heavy sites
SitemapLoader	📋 Available	-	For sitemap scraping
RecursiveUrlLoader	📋 Available	-	For site crawling

🔧 Configuration

Install Full LangChain Web Tools

# Core LangChain web scraping
pip install langchain-community

# HTML parsing
pip install beautifulsoup4 html2text lxml

# JavaScript rendering (optional)
pip install playwright
playwright install chromium

# Advanced scraping (optional)
pip install selenium

Environment Variables

# Optional: API keys for commercial scrapers
OXYLABS_API_KEY=your_key_here
BRIGHTDATA_API_KEY=your_key_here

# User agent for polite scraping
USER_AGENT="Mozilla/5.0 (Research Bot)"

🎯 Best Practices

1. Respectful Scraping

✅ Check robots.txt before scraping
✅ Implement rate limiting (1-2 seconds between requests)
✅ Use meaningful User-Agent strings
✅ Cache results to avoid repeated requests

2. Error Handling

try:
    docs = loader.load()
except Exception as e:
    logger.error(f"Scraping failed: {e}")
    # Fallback to cached or simulated content

3. Content Quality

✅ Remove navigation, ads, footers
✅ Extract main content only
✅ Preserve semantic structure
✅ Handle multiple encodings

4. Performance

✅ Use async loaders for multiple URLs
✅ Batch process when possible
✅ Cache frequently accessed content
✅ Set reasonable timeouts

📚 Resources

🔄 Next Steps

Enhance WebSearchAgent - Integrate real search APIs (Google, Bing, DuckDuckGo)
Add JavaScript Support - Integrate AsyncChromiumLoader for SPA sites
Implement Caching - Cache scraped content to reduce network calls
Add Sitemap Support - Enable full site scraping via sitemaps
Quality Filters - Implement content quality scoring and filtering

Status: ✅ LangChain web scraping successfully integrated into Research Swarm
Next: Real search API integration and JavaScript rendering support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LangChain Web Scraping Integration

🎯 Overview

📦 LangChain Components Used

1. Document Loaders

AsyncHtmlLoader ✅ (Currently Integrated)

WebBaseLoader (Alternative)

2. Document Transformers

Html2TextTransformer ✅ (Currently Integrated)

BeautifulSoupTransformer (Available)

3. Text Splitters

RecursiveCharacterTextSplitter (Available for Integration)

🏗️ Current Implementation

ContentParserAgent (agents/research/content_parser_agent.py)

🚀 Advanced Features (Available for Future Integration)

1. JavaScript-Heavy Sites

AsyncChromiumLoader (Requires Playwright)

2. Sitemap Scraping

SitemapLoader

3. Recursive Crawling

RecursiveUrlLoader

📊 Integration Status

🔧 Configuration

Install Full LangChain Web Tools

Environment Variables

🎯 Best Practices

1. Respectful Scraping

2. Error Handling

3. Content Quality

4. Performance

📚 Resources

🔄 Next Steps

FilesExpand file tree

LANGCHAIN_WEB_SCRAPING_INTEGRATION.md

Latest commit

History

LANGCHAIN_WEB_SCRAPING_INTEGRATION.md

File metadata and controls

LangChain Web Scraping Integration

🎯 Overview

📦 LangChain Components Used

1. Document Loaders

AsyncHtmlLoader ✅ (Currently Integrated)

WebBaseLoader (Alternative)

2. Document Transformers

Html2TextTransformer ✅ (Currently Integrated)

BeautifulSoupTransformer (Available)

3. Text Splitters

RecursiveCharacterTextSplitter (Available for Integration)

🏗️ Current Implementation

ContentParserAgent (agents/research/content_parser_agent.py)

🚀 Advanced Features (Available for Future Integration)

1. JavaScript-Heavy Sites

AsyncChromiumLoader (Requires Playwright)

2. Sitemap Scraping

SitemapLoader

3. Recursive Crawling

RecursiveUrlLoader

📊 Integration Status

🔧 Configuration

Install Full LangChain Web Tools

Environment Variables

🎯 Best Practices

1. Respectful Scraping

2. Error Handling

3. Content Quality

4. Performance

📚 Resources

🔄 Next Steps