Skip to content

Commit d0d4717

Browse files
committed
Add SciELO repository implementation and documentation
- Add SciELO web scraper implementation (pygetpapers/repositories/scielo/) - Add comprehensive documentation in docs/scielo/ - Add SciELO configuration to config.ini - Add demonstration script (examples/scielo_demonstration.py) - Configure flake8 to ignore E501 (line length) errors - Move exploration files to docs/scielo/ for pedagogical value Features: - Web scraping with proper headers and rate limiting - Metadata extraction (title, authors, abstract, DOI, PDF URLs) - Article download functionality (HTML + PDFs) - Multiple output formats (CSV, HTML, XML) - Regional SciELO site support (9+ countries) - Climate change example validation Status: Basic implementation complete and tested
1 parent 240ac3f commit d0d4717

17 files changed

Lines changed: 17344 additions & 1 deletion

.flake8

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ exclude =
99
*.egg-info
1010
ignore =
1111
E203,
12-
W503
12+
W503,
13+
E501

docs/scielo/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# SciELO Repository Documentation
2+
3+
This directory contains comprehensive documentation for the SciELO repository integration with pygetpapers.
4+
5+
## Documentation Structure
6+
7+
### Core Documentation
8+
- **[Development Session](scielo-development-session.md)** - Complete development process and findings
9+
- **[Site Exploration](site-exploration.md)** - Detailed analysis of SciELO's web interface
10+
- **[Implementation Guide](implementation-guide.md)** - Technical implementation details
11+
- **[Usage Tutorial](usage-tutorial.md)** - Complete user guide and examples
12+
13+
### Technical Guides
14+
- **[Query Construction](query-construction.md)** - How to build effective search queries
15+
- **[Metadata Extraction](metadata-extraction.md)** - Understanding and processing metadata
16+
- **[Document Download](document-download.md)** - Fulltext download strategies
17+
- **[DataTables Integration](datatables-integration.md)** - Interactive HTML table generation
18+
19+
### Production Guides
20+
- **[CLI Usage](cli-usage.md)** - Command-line interface examples
21+
- **[Jupyter Integration](jupyter-integration.md)** - Google Colab and Jupyter notebook usage
22+
- **[AMI Corpus Export](ami-corpus-export.md)** - Exporting to local AMI corpus format
23+
- **[Production Setup](production-setup.md)** - Deployment and configuration
24+
25+
### Examples and Testing
26+
- **[Climate Change Examples](climate-change-examples.md)** - Real-world usage examples
27+
- **[Test Results](test-results.md)** - Comprehensive testing documentation
28+
- **[Performance Analysis](performance-analysis.md)** - Speed and efficiency metrics
29+
30+
## Quick Start
31+
32+
1. **Read the [Development Session](scielo-development-session.md)** for complete context
33+
2. **Follow the [Usage Tutorial](usage-tutorial.md)** for basic usage
34+
3. **Check [CLI Usage](cli-usage.md)** for command-line examples
35+
4. **Review [Test Results](test-results.md)** for validation
36+
37+
## Key Features
38+
39+
- **Multilingual Support**: English, Spanish, Portuguese content
40+
- **Web Scraping**: Robust scraping with rate limiting
41+
- **Selenium Support**: Dynamic content handling
42+
- **DataTables Integration**: Interactive HTML tables
43+
- **AMI Corpus Export**: Local corpus creation
44+
- **Jupyter/Colab Ready**: Notebook integration
45+
46+
## Repository Status
47+
48+
- **Development Phase**: Active development
49+
- **Target Date**: 2 days from start
50+
- **Testing Status**: Real integration tests only (no mocks)
51+
- **Documentation**: Comprehensive tutorial creation
52+
53+
---
54+
55+
*This documentation is being created as part of the SciELO repository development session for pygetpapers.*

0 commit comments

Comments
 (0)