Skip to content

Commit 5655a76

Browse files
committed
tidying and designing amilib interface
1 parent 2a1ded3 commit 5655a76

12 files changed

Lines changed: 3386 additions & 6 deletions
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Amilib Integration Strategy
2+
3+
## Overview
4+
This document outlines the strategy for integrating amilib (a sibling library with PDF, HTML, and Wikimedia parsing capabilities) with pygetpapers v2.0.
5+
6+
## Background
7+
8+
### Amilib Capabilities
9+
Amilib provides advanced processing capabilities for:
10+
- PDF parsing and text extraction
11+
- HTML processing and cleaning
12+
- Wikimedia content extraction
13+
- Structured data extraction
14+
- Content analysis and transformation
15+
16+
### Integration Challenge
17+
The challenge was to integrate amilib's capabilities without:
18+
- Creating tight coupling between libraries
19+
- Complicating pygetpapers' core functionality
20+
- Requiring all users to install amilib
21+
- Compromising pygetpapers' simplicity
22+
23+
## Architectural Decision: Optional Post-Processing Enhancement
24+
25+
### Core Decision
26+
**Treat amilib as an optional post-processing enhancement rather than a core dependency**
27+
28+
### Rationale
29+
30+
#### 1. Maintain Core Simplicity
31+
- Pygetpapers remains focused on paper discovery and download
32+
- Core functionality doesn't require amilib
33+
- Users can start with basic functionality and add processing later
34+
35+
#### 2. Independent Development Cycles
36+
- Both libraries can evolve independently
37+
- Changes in amilib don't break pygetpapers
38+
- Each library can optimize for its specific use case
39+
40+
#### 3. User Choice
41+
- Users can choose their processing depth
42+
- Lightweight users aren't forced to install heavy dependencies
43+
- Advanced users can leverage full amilib capabilities
44+
45+
#### 4. File System as Integration Boundary
46+
- Simple, reliable integration mechanism
47+
- No complex API coupling
48+
- Easy to debug and troubleshoot
49+
- Supports batch processing workflows
50+
51+
## Implementation Strategy
52+
53+
### 1. File System Integration
54+
```
55+
pygetpapers_output/
56+
├── repository/
57+
│ ├── query/
58+
│ │ ├── paper_id/
59+
│ │ │ ├── metadata.json
60+
│ │ │ ├── fulltext.pdf
61+
│ │ │ ├── fulltext.html
62+
│ │ │ └── processed/
63+
│ │ │ ├── extracted_text.txt
64+
│ │ │ ├── structured_data.json
65+
│ │ │ └── analysis_results.json
66+
```
67+
68+
### 2. Integration Endpoints
69+
Define clear integration points between the libraries:
70+
71+
#### Input Interface
72+
- Pygetpapers provides standardized file structure
73+
- Consistent metadata format (JSON)
74+
- Predictable file naming conventions
75+
- Clear content type indicators
76+
77+
#### Output Interface
78+
- Amilib processes files in place
79+
- Creates `processed/` subdirectories
80+
- Maintains original files
81+
- Provides processing metadata
82+
83+
### 3. Configuration-Driven Integration
84+
```yaml
85+
# Example integration configuration
86+
amilib_integration:
87+
enabled: true
88+
processing_pipeline:
89+
- pdf_text_extraction
90+
- html_cleaning
91+
- metadata_enhancement
92+
output_formats:
93+
- txt
94+
- json
95+
- html
96+
batch_size: 100
97+
```
98+
99+
## Integration Workflows
100+
101+
### 1. Basic Workflow (Pygetpapers Only)
102+
```bash
103+
pygetpapers --query "climate change" --repository biorxiv --limit 10
104+
```
105+
- Downloads papers to file system
106+
- Creates metadata.json files
107+
- Provides basic content access
108+
109+
### 2. Enhanced Workflow (With Amilib)
110+
```bash
111+
# Step 1: Download papers
112+
pygetpapers --query "climate change" --repository biorxiv --limit 10
113+
114+
# Step 2: Process with amilib
115+
amilib_process --input-dir pygetpapers_output --pipeline full
116+
```
117+
- Downloads papers (same as basic workflow)
118+
- Processes PDFs for text extraction
119+
- Enhances metadata with extracted information
120+
- Creates structured data outputs
121+
122+
### 3. Integrated Workflow (Future)
123+
```bash
124+
pygetpapers --query "climate change" --repository biorxiv --limit 10 --post-process amilib
125+
```
126+
- Single command execution
127+
- Automatic amilib processing
128+
- Unified output structure
129+
130+
## Technical Implementation
131+
132+
### 1. Schema Definition
133+
Define clear schemas for integration:
134+
135+
#### Pygetpapers Output Schema
136+
```json
137+
{
138+
"paper_id": "string",
139+
"repository": "string",
140+
"query": "string",
141+
"download_timestamp": "ISO8601",
142+
"files": {
143+
"metadata": "metadata.json",
144+
"pdf": "fulltext.pdf",
145+
"html": "fulltext.html"
146+
},
147+
"processing_status": {
148+
"amilib_processed": false,
149+
"processing_timestamp": null
150+
}
151+
}
152+
```
153+
154+
#### Amilib Processing Schema
155+
```json
156+
{
157+
"processing_metadata": {
158+
"amilib_version": "string",
159+
"processing_timestamp": "ISO8601",
160+
"pipeline_steps": ["step1", "step2"],
161+
"processing_status": "success|failed|partial"
162+
},
163+
"extracted_content": {
164+
"text": "extracted_text.txt",
165+
"structured_data": "structured_data.json",
166+
"analysis": "analysis_results.json"
167+
}
168+
}
169+
```
170+
171+
### 2. Error Handling
172+
- Graceful degradation when amilib unavailable
173+
- Clear error messages for missing dependencies
174+
- Fallback to basic pygetpapers functionality
175+
- Processing status tracking
176+
177+
### 3. Performance Considerations
178+
- Batch processing capabilities
179+
- Incremental processing (skip already processed files)
180+
- Parallel processing where appropriate
181+
- Resource usage monitoring
182+
183+
## Future Evolution
184+
185+
### Phase 1: File System Integration (Current)
186+
- Basic file system integration
187+
- Manual workflow coordination
188+
- Clear separation of concerns
189+
190+
### Phase 2: Enhanced Integration (Future)
191+
- Optional CLI integration
192+
- Configuration-driven processing
193+
- Unified output formats
194+
195+
### Phase 3: Deep Integration (Future)
196+
- Shared configuration management
197+
- Integrated processing pipelines
198+
- Unified user interface
199+
200+
## Benefits of This Approach
201+
202+
### 1. Modularity
203+
- Each library maintains its core purpose
204+
- Easy to understand and maintain
205+
- Clear boundaries and responsibilities
206+
207+
### 2. Flexibility
208+
- Users can choose their processing depth
209+
- Supports various use cases
210+
- Enables experimentation and customization
211+
212+
### 3. Reliability
213+
- Simple integration mechanism
214+
- Easy to debug and troubleshoot
215+
- Robust error handling
216+
217+
### 4. Scalability
218+
- Supports large-scale processing
219+
- Enables batch operations
220+
- Efficient resource utilization
221+
222+
## Migration Path
223+
224+
### For Existing Users
225+
- No breaking changes to pygetpapers
226+
- Gradual adoption of amilib features
227+
- Backward compatibility maintained
228+
229+
### For New Users
230+
- Start with basic pygetpapers functionality
231+
- Add amilib processing as needed
232+
- Clear upgrade path available
233+
234+
## Conclusion
235+
236+
The file system-based integration strategy provides a robust, flexible, and maintainable approach to combining pygetpapers and amilib capabilities. This approach:
237+
238+
- Maintains the simplicity of pygetpapers
239+
- Provides powerful processing capabilities through amilib
240+
- Enables independent development and evolution
241+
- Supports various user needs and use cases
242+
243+
The clear separation of concerns and file system integration boundary ensures that both libraries can evolve independently while providing users with powerful, integrated workflows when needed.

0 commit comments

Comments
 (0)