Skip to content

Commit 1096068

Browse files
committed
feat: comprehensive repository field management and file size alerts
- Add file size alert system with configurable thresholds (100MB default) - Implement ignored fields documentation for all repositories - Create comprehensive repository documentation and metadata schemas - Deprecate arXiv and CrossRef repositories (moved to IGNORED_REPOSITORIES) - Add detailed individual repository metadata field documentation - Implement file size alerts in Streamlit UI with visual indicators - Create centralized configuration management for file size alerts - Add utility functions for file size checking and formatting - Update download tools to trigger file size alerts - Create repository summary with comparison matrix and selection guide - Document complete discussion and implementation in repository_fields.md Files added: - docs/IGNORED_REPOSITORIES.md - Repository deprecation documentation - docs/file-size-alerts.md - File size alert feature documentation - docs/repositories_summary.md - Comprehensive repository overview - docs/repository_fields.md - Complete discussion and implementation record - docs/metadata_fields/ - Individual repository field documentation - pygetpapers/core/file_size_config.py - File size alert configuration - IGNORED_FIELDS.md files for all repositories Files modified: - pygetpapers/core/download_tools.py - Added file size alert integration - pygetpapers/core/file_utils.py - Added file size utility functions - pygetpapers/streamlit_app.py - Added file size alerts to UI - docs/styleguide.md - Updated documentation This commit represents a major improvement in repository field management, user experience, and documentation completeness.
1 parent 79769cd commit 1096068

28 files changed

Lines changed: 4359 additions & 2 deletions

docs/IGNORED_REPOSITORIES.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Ignored Repositories
2+
3+
**Date:** July 28, 2025 (system date of generation)
4+
**Purpose:** Document repositories that should be ignored in pygetpapers
5+
**Scope:** Repositories that are deprecated, problematic, or no longer supported
6+
7+
## Overview
8+
9+
This document lists repositories that should be ignored in pygetpapers. These repositories may be deprecated, have technical issues, or are no longer actively maintained.
10+
11+
## Ignored Repositories
12+
13+
### 1. arXiv Repository
14+
15+
**Repository:** arxiv
16+
**Reason for Ignoring:** Deprecated in favor of more comprehensive alternatives
17+
**Status:** No longer actively maintained
18+
**Alternative:** Use BioRxiv for preprints or Europe PMC for published content
19+
20+
**Technical Issues:**
21+
- Limited metadata compared to other repositories
22+
- No structured XML support
23+
- Basic API functionality
24+
- Limited content formats
25+
26+
**Impact:** Removing arXiv reduces complexity and focuses on more feature-rich repositories.
27+
28+
### 2. CrossRef Repository
29+
30+
**Repository:** crossref
31+
**Reason for Ignoring:** Metadata-only repository with no full-text access
32+
**Status:** Limited utility for content analysis
33+
**Alternative:** Use OpenAlex for comprehensive metadata and citation data
34+
35+
**Technical Issues:**
36+
- No full-text content access
37+
- Metadata-only functionality
38+
- Limited research value for content analysis
39+
- Redundant with other metadata sources
40+
41+
**Impact:** Removing CrossRef simplifies the repository landscape and focuses on content-rich sources.
42+
43+
## Implementation Notes
44+
45+
### Repository Removal Process
46+
1. **Documentation Updates:** Update all documentation to exclude ignored repositories
47+
2. **Code Cleanup:** Remove repository-specific code and configurations
48+
3. **Testing Updates:** Update tests to exclude ignored repositories
49+
4. **User Communication:** Inform users about repository deprecation
50+
51+
### Migration Guidance
52+
- **From arXiv:** Use BioRxiv for biology preprints, Europe PMC for published content
53+
- **From CrossRef:** Use OpenAlex for comprehensive metadata and citation analysis
54+
55+
### Future Considerations
56+
- Monitor for new repositories that may replace ignored ones
57+
- Consider re-evaluating ignored repositories if they improve significantly
58+
- Maintain documentation for historical reference
59+
60+
## Related Documentation
61+
- [Repository Summary](repositories_summary.md)
62+
- [Repository Fields Schema](repository_fields_schema.md)
63+
- [File Size Alerts](file-size-alerts.md)

docs/REPOSITORIES.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Repository Output Capabilities Analysis
2+
3+
## Overview
4+
This document provides a comprehensive analysis of the 8 repositories supported by pygetpapers, including their output capabilities, rate limits, and key limitations.
5+
6+
## Summary Table
7+
8+
| Repository | API Access | Web Scraping | Metadata | PDF | XML/JATS | HTML | Figures | Tables | Supplementary | Rate Limits | Key Limitations |
9+
|------------|------------|--------------|----------|-----|----------|------|---------|--------|---------------|-------------|-----------------|
10+
| **Europe PMC** | ✅ REST API || ✅ Complete | ✅ Direct | ✅ JATS | ✅ Generated ||| ✅ ZIP/PDF/TXT | 1000/hr | Biomedical focus only |
11+
| **BioRxiv** | ✅ Limited API | ✅ Advanced | ✅ Complete | ✅ Direct || ✅ Direct |||| 1000/hr API, 1/sec scraping | API: date-only queries |
12+
| **arXiv** | ❌ Disabled | ❌ Disabled |||||||| N/A | Policy prohibits automated access |
13+
| **Crossref** | ✅ REST API || ✅ Complete ||||||| 500/hr | Metadata only, no full-text |
14+
| **OpenAlex** | ✅ REST API || ✅ Complete ||||||| 100,000/day | Metadata only, no full-text |
15+
| **Redalyc** || ✅ Advanced | ✅ Complete | ✅ Direct | ✅ Direct | ✅ Direct ||| ✅ PDF/TXT | 1/sec | Spanish/Portuguese focus |
16+
| **SciELO** || ✅ Advanced | ✅ Complete | ✅ Direct | ✅ Direct | ✅ Direct ||| ✅ PDF/TXT | 2/sec | Latin America focus |
17+
| **UPSpace** | ✅ REST API || ✅ Complete | ✅ Direct |||||| 1/sec | Institutional repository |
18+
19+
## Detailed Analysis by Repository
20+
21+
### 1. **Europe PMC**
22+
**Strengths:**
23+
- Comprehensive REST API with 1000 requests/hour
24+
- Full JATS XML support with HTML conversion
25+
- Direct PDF downloads
26+
- Rich metadata (PMID, PMCID, DOI, citations, references)
27+
- Supplementary file support (ZIP, PDF, TXT)
28+
- Biomedical focus with extensive coverage
29+
30+
**Limitations:**
31+
- Biomedical/health sciences focus only
32+
- No explicit figure/table extraction
33+
- Requires specific query formats
34+
35+
### 2. **BioRxiv**
36+
**Strengths:**
37+
- Dual approach: API + advanced web scraping
38+
- Direct HTML and PDF downloads
39+
- Complete metadata extraction
40+
- Supports both bioRxiv and medRxiv
41+
- Rich content with full-text access
42+
43+
**Limitations:**
44+
- API limited to date-based queries only
45+
- Web scraping required for text searches
46+
- No XML/JATS support
47+
- No explicit figure/table extraction
48+
49+
### 3. **arXiv**
50+
**Status: DISABLED**
51+
- Completely disabled due to arXiv's policy against automated downloads
52+
- No access to any content
53+
- Policy prohibits scraping or bulk downloads
54+
55+
### 4. **Crossref**
56+
**Strengths:**
57+
- Comprehensive metadata API (500 requests/hour)
58+
- Rich bibliographic data
59+
- DOI-based access
60+
- Multiple export formats (JSON, XML)
61+
62+
**Limitations:**
63+
- Metadata only - no full-text content
64+
- No PDF, XML, or HTML downloads
65+
- No figure/table access
66+
- No supplementary files
67+
68+
### 5. **OpenAlex**
69+
**Strengths:**
70+
- Very high rate limit (100,000 requests/day)
71+
- Comprehensive metadata with citations
72+
- Open access indicators
73+
- Rich bibliographic relationships
74+
75+
**Limitations:**
76+
- Metadata only - no full-text content
77+
- No PDF, XML, or HTML downloads
78+
- No figure/table access
79+
- No supplementary files
80+
81+
### 6. **Redalyc**
82+
**Strengths:**
83+
- Advanced web scraping capabilities
84+
- Direct PDF and XML downloads
85+
- HTML content extraction
86+
- Multilingual support (Spanish, Portuguese, English)
87+
- Supplementary file support
88+
89+
**Limitations:**
90+
- Web scraping only (no API)
91+
- Rate limited to 1 request/second
92+
- Latin American focus
93+
- No explicit figure/table extraction
94+
95+
### 7. **SciELO**
96+
**Strengths:**
97+
- Advanced web scraping with encoding detection
98+
- Direct PDF and XML downloads
99+
- HTML content extraction
100+
- Multilingual support
101+
- Supplementary file support
102+
- Latin American and African focus
103+
104+
**Limitations:**
105+
- Web scraping only (no API)
106+
- Rate limited to 2 requests/second
107+
- Regional focus
108+
- No explicit figure/table extraction
109+
110+
### 8. **UPSpace**
111+
**Strengths:**
112+
- Modern DSpace REST API
113+
- Direct PDF downloads
114+
- Rich metadata with SDG classifications
115+
- Institutional repository with academic focus
116+
- Clean JSON data structure
117+
118+
**Limitations:**
119+
- No XML/JATS support
120+
- No HTML generation
121+
- No explicit figure/table extraction
122+
- No supplementary files
123+
- Institutional focus (University of Pretoria)
124+
125+
## Key Findings
126+
127+
### **Rate Limiting Considerations**
128+
- **Most Generous**: OpenAlex (100,000/day)
129+
- **Moderate**: Europe PMC (1000/hour), Crossref (500/hour)
130+
- **Conservative**: BioRxiv (1000/hour API + 1/sec scraping), Redalyc (1/sec), UPSpace (1/sec)
131+
- **Disabled**: arXiv
132+
133+
### **Content Access Patterns**
134+
- **Full-Text Champions**: Europe PMC, BioRxiv, Redalyc, SciELO
135+
- **Metadata Only**: Crossref, OpenAlex
136+
- **Institutional**: UPSpace
137+
- **Disabled**: arXiv
138+
139+
### **Repository-Specific Quirks**
140+
1. **BioRxiv**: API vs web scraping dichotomy
141+
2. **Europe PMC**: Biomedical focus with JATS support
142+
3. **Redalyc/SciELO**: Regional focus with multilingual content
143+
4. **UPSpace**: SDG classifications as unique feature
144+
5. **Crossref/OpenAlex**: Rich metadata but no full-text
145+
6. **arXiv**: Completely disabled due to policy
146+
147+
### **Recommended Usage Strategy**
148+
- **For full-text research**: Europe PMC, BioRxiv, Redalyc, SciELO
149+
- **For metadata analysis**: Crossref, OpenAlex
150+
- **For institutional content**: UPSpace
151+
- **Avoid**: arXiv (disabled)
152+
153+
## Maintenance Notes
154+
155+
**Last Updated**: January 2025
156+
157+
**Update Frequency**: This document should be updated when:
158+
- New repositories are added
159+
- Rate limits change
160+
- API endpoints are modified
161+
- Repository policies change
162+
- New capabilities are implemented
163+
164+
**Information Sources**:
165+
- Repository configuration files (`config.ini`)
166+
- Implementation files (`*.py`)
167+
- Repository documentation
168+
- API documentation
169+
- Testing results
170+
171+
---
172+
173+
*This analysis shows that pygetpapers provides access to a diverse range of repositories with different strengths and limitations, allowing users to choose the most appropriate source based on their specific research needs.*

0 commit comments

Comments
 (0)