Skip to content

Commit 330283f

Browse files
committed
flkae8 and blabk
1 parent 446a803 commit 330283f

14 files changed

Lines changed: 4604 additions & 121 deletions

LOG.md

Lines changed: 158 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,150 @@
11
# Pygetpapers Streamlit UI Development Log
22

3+
## Latest Updates
4+
5+
### 2024-07-08: Automatic Corpus Detection
6+
- **Added automatic corpus detection**: The app now automatically scans for existing pygetpapers output directories on startup
7+
- **Smart directory recognition**: Detects corpora by looking for characteristic files (eupmc_results.json, europe_pmc.csv, etc.) and paper ID patterns (PMC, arXiv:, etc.)
8+
- **Metadata extraction**: Automatically extracts corpus information including API type, paper count, creation date, and original query
9+
- **Manual refresh**: Added "🔄 Refresh Corpus List" button to manually scan for newly downloaded corpora
10+
- **Session state integration**: Auto-detected corpora are seamlessly integrated into the existing corpus management system
11+
- **Cross-version support**: Feature implemented in both main and no-dependencies Streamlit apps
12+
13+
### 2024-07-08: Progress Display Improvements
14+
- **Complete tqdm filtering**: Completely removed tqdm progress bars from both stdout and stderr output display
15+
- **Clean output display**: Filtered output now shows only meaningful information without redundant progress bars
16+
- **Meaningful activity display**: Enhanced recent activity section to show only meaningful output lines
17+
- **Smart filtering**: Automatically filters out debug messages, timestamps, and other noise
18+
- **Better color coding**: Improved visual feedback with color-coded messages (green for success, red for errors, etc.)
19+
- **Fallback indicators**: Added fallback messages when no meaningful output is available
20+
- **Cross-version support**: Applied filtering to both main and no-dependencies Streamlit apps
21+
22+
### 2024-07-08: Quick Stats Fix
23+
- **Fixed stats updating**: Quick stats now properly update when papers are downloaded
24+
- **Auto-detection stats**: Stats are updated when existing corpora are auto-detected on startup
25+
- **Accurate paper counts**: Corpus entries now use actual downloaded papers instead of requested limit
26+
- **Debug information**: Added debug output to show stats updates for troubleshooting
27+
- **Cross-version support**: Applied fixes to both main and no-dependencies Streamlit apps
28+
29+
### 2024-07-08: Comprehensive HTML Management System & Naming Convention
30+
- **Multi-source HTML support**: Support for HTML from multiple sources with clear naming convention
31+
- **HTML file types**:
32+
- `fulltext.raw.html` (provided by publisher/repo)
33+
- `fulltext.xml.html` (converted from XML using JATS4R or Simple HTML Converter)
34+
- `fulltext.pdf.html` (converted from PDF)
35+
- `fulltext.doc.html` (converted from DOC)
36+
- `html_with_ids.html` (cleaned and enhanced for analysis)
37+
- **JATS4R integration**: XML to HTML conversion with automatic setup and batch processing
38+
- **PDF/DOC conversion**: Multiple converter support (pdfplumber, PyMuPDF, pandoc, etc.)
39+
- **HTML enhancement**: Automatic creation of enhanced HTML with section IDs and cleaned structure
40+
- **CLI commands**:
41+
- `--fulltext_html`: Convert XML to HTML during download
42+
- `--convert_html`: Retrospective XML to HTML conversion
43+
- `--process_html`: Process all HTML files in corpus (convert PDFs/DOCs, create enhanced versions)
44+
- `--enhance_html`: Create enhanced HTML from existing HTML files
45+
- **Priority system**: Enhanced > XML > Raw > PDF > DOC (best HTML selection)
46+
- **Naming convention**: HTML files are named with their source type (e.g., `fulltext.xml.html` for XML conversions)
47+
- **Data tables integration**: Updated to show all HTML file types and enhanced HTML status
48+
- **Cross-version support**: Full functionality in main app, placeholder in no-dependencies version
49+
50+
### 2024-07-08: Download Safety Measures
51+
- **Enhanced validation**: Added comprehensive validation to compare requested vs actual papers downloaded
52+
- **Safety limits**: Implemented hard limits (max 200 papers) and warnings (above 100 papers) in the UI
53+
- **Debug output**: Added detailed command execution logging for troubleshooting
54+
- **Error handling**: Improved error messages and warnings for limit violations
55+
- **Cross-version support**: All safety measures implemented in both Streamlit app versions
56+
57+
### 2024-07-08: Progress Tracking Enhancement
58+
- **Real-time progress**: Implemented live progress tracking with subprocess output capture
59+
- **Dynamic progress bar**: Beautiful gradient progress bar with animated emojis and color transitions
60+
- **File type counters**: Real-time counters for JSON, XML, PDF, and supplementary files
61+
- **Operation indicators**: Dynamic operation emojis (🔍 Searching, 📥 Downloading, 💾 Writing)
62+
- **Recent activity display**: Color-coded recent output with styled containers
63+
- **Cross-version support**: Progress tracking implemented in both main and no-dependencies apps
64+
65+
### 2024-07-08: Repository Naming and Paper Count Improvements
66+
- **Repository naming**: Now uses directory names as default repository names for better identification
67+
- **Paper count distinction**: Clear distinction between downloaded papers and total available papers in repository
68+
- **Download progress tracking**: Shows download percentage (e.g., "50 / 100 papers" with 50% progress)
69+
- **Enhanced statistics**: Separate metrics for downloaded vs total available papers
70+
- **Multiple instance management**: Improved run script to handle port conflicts and manage multiple Streamlit instances
71+
- **Cross-version support**: All improvements implemented in both main and no-dependencies versions
72+
73+
### 2024-07-07: Streamlit UI Development
74+
- **Initial implementation**: Created comprehensive Streamlit web interface for pygetpapers
75+
- **Repository support**: Full support for Europe PMC, arXiv, Crossref, OpenAlex, bioRxiv, medRxiv, Rxivist
76+
- **Query builder**: Advanced query builder with Boolean operators and field-specific search
77+
- **Corpus management**: Complete corpus management with statistics and visualization
78+
- **Data tables**: Interactive HTML tables with datatables integration
79+
- **Figures gallery**: Automatic figure extraction and thumbnail generation
80+
- **Fulltext search**: Advanced search within paper content
81+
- **Corpus comparison**: Multi-corpus comparison and overlap analysis
82+
- **Export functionality**: CSV export and data visualization
83+
- **Settings and help**: Comprehensive settings and help documentation
84+
85+
### 2024-07-07: CI/CD and Import Fixes
86+
- **Fixed import errors**: Resolved subprocess path issues and sys.path hacks
87+
- **CI configuration**: Added pytest-cov installation and proper test configuration
88+
- **Test fixes**: Fixed Arxiv API usage and zip file handling tests
89+
- **Formatting**: Applied Black formatting to test files
90+
- **Streamlit CI**: Added syntax checking without server startup to prevent hanging
91+
- **CLI availability**: Fixed CLI command availability in CI environment
92+
93+
## Technical Details
94+
95+
### Progress Tracking Implementation
96+
- Uses `subprocess.Popen` with real-time output capture
97+
- Parses progress information from pygetpapers output
98+
- Updates Streamlit components with `placeholder.empty()` for smooth animations
99+
- Handles both tqdm progress bars and custom output parsing
100+
- Implements graceful error handling and timeout management
101+
102+
### Corpus Detection Algorithm
103+
- Scans current directory for pygetpapers output patterns
104+
- Identifies characteristic files (eupmc_results.json, europe_pmc.csv, etc.)
105+
- Recognizes paper ID patterns (PMC, arXiv:, doi_)
106+
- Extracts metadata from JSON files when available
107+
- Parses timestamps from directory names for creation dates
108+
- Integrates seamlessly with existing session state management
109+
110+
### Safety Measures
111+
- Validates actual vs requested paper counts
112+
- Implements UI limits (100 warning, 200 hard stop)
113+
- Provides detailed debug output for troubleshooting
114+
- Shows clear error messages for limit violations
115+
- Maintains user control while preventing accidental large downloads
116+
117+
## Migration Guide
118+
119+
### For Existing Users
120+
1. **Automatic Detection**: Existing corpora will be automatically detected on next app startup
121+
2. **Manual Refresh**: Use "🔄 Refresh Corpus List" button to detect new downloads
122+
3. **Progress Tracking**: New downloads will show enhanced progress display
123+
4. **Safety Limits**: Be aware of new download limits (max 200 papers)
124+
125+
### For New Users
126+
1. **Installation**: Follow standard pygetpapers installation
127+
2. **Streamlit Setup**: Install Streamlit and run `streamlit run streamlit_app.py`
128+
3. **First Use**: Download papers through the web interface
129+
4. **Corpus Management**: Use the Corpus Manager to view and analyze downloads
130+
131+
## Future Enhancements
132+
133+
### Planned Features
134+
- **Batch operations**: Select multiple papers for bulk operations
135+
- **Advanced filtering**: Filter papers by date, journal, author, etc.
136+
- **Citation export**: Export citations in various formats (BibTeX, EndNote, etc.)
137+
- **Collaborative features**: Share corpora and queries with other users
138+
- **Advanced analytics**: More sophisticated corpus analysis and visualization
139+
- **API integration**: Direct integration with external analysis tools
140+
141+
### Technical Improvements
142+
- **Performance optimization**: Faster corpus scanning and data loading
143+
- **Memory management**: Better handling of large corpora
144+
- **Caching**: Implement intelligent caching for frequently accessed data
145+
- **Offline mode**: Support for offline corpus analysis
146+
- **Mobile optimization**: Better mobile device support
147+
3148
## Team Feedback - [Date]
4149

5150
### Issues to Address:
@@ -12,6 +157,7 @@
12157
- ❌ Highlight the query box for better visibility
13158
- ❌ Reduce default paper downloading from current limit to 10
14159
- ❌ Create version without external dependencies (no plotly requirement)
160+
- ❌ Add visual progress indicators for downloads
15161

16162
3. **Functionality Issues:**
17163
- ❌ Plot doesn't show numbers of papers and years
@@ -37,6 +183,7 @@
37183
- ✅ Highlighted query box with prominent styling
38184
- ✅ Created migration guide (MIGRATION_GUIDE.md)
39185
- ✅ Created no-dependencies version (streamlit_app_no_deps.py)
186+
- ✅ Added real-time progress tracking for downloads
40187

41188
### In Progress 🔄
42189
- Fixing journal name display issue
@@ -56,14 +203,21 @@
56203
- Corpus location: ✅ Files saved to `{repo_name}_{timestamp}` in current directory (e.g., `europe_pmc_20250121_143022`)
57204
- Journal name issue: ✅ Fixed - was looking for `journalTitle` but should look for `journalInfo.journal.title`
58205
- Query box: ✅ Highlighted with prominent styling and help text
206+
- Progress tracking: ✅ Real-time progress indicators showing:
207+
- Current operation (Searching, Downloading, Writing)
208+
- Overall progress bar with paper count
209+
- File type counters (JSON, XML, PDF, Supplementary)
210+
- Recent output lines
211+
- Works in both main and no-dependencies versions
59212

60213
## Next Steps:
61214
1. Remove `# noqa` comments
62215
2. Create migration guide
63216
3. Highlight query box
64217
4. Reduce default limit to 10
65218
5. Create no-dependencies version
66-
6. Fix plot functionality
67-
7. Fix journal name display
68-
8. Document corpus location
69-
9. Plan filter implementation
219+
6. Add progress tracking for downloads
220+
7. Fix plot functionality
221+
8. Fix journal name display
222+
9. Document corpus location
223+
10. Plan filter implementation

datatables_integration.py

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,14 +167,14 @@ def create_papers_table(
167167
# Extract key information
168168
title = metadata.get("title", paper["directory"])
169169
authors = metadata.get("authorString", "Unknown")
170-
170+
171171
# Extract journal name from nested structure
172172
journal = "Unknown"
173173
if "journalInfo" in metadata and "journal" in metadata["journalInfo"]:
174174
journal = metadata["journalInfo"]["journal"].get("title", "Unknown")
175175
elif "journalTitle" in metadata:
176176
journal = metadata["journalTitle"]
177-
177+
178178
doi = metadata.get("doi", "")
179179
pmid = metadata.get("pmid", "")
180180
pmcid = metadata.get("pmcid", "")
@@ -185,6 +185,13 @@ def create_papers_table(
185185
has_pdf = any("fulltext.pdf" in f for f in paper["files"])
186186
has_supp = any("supplementary" in f for f in paper["files"])
187187

188+
# Check HTML file types
189+
has_raw_html = any("fulltext.raw.html" in f for f in paper["files"])
190+
has_xml_html = any("fulltext.xml.html" in f for f in paper["files"])
191+
has_pdf_html = any("fulltext.pdf.html" in f for f in paper["files"])
192+
has_doc_html = any("fulltext.doc.html" in f for f in paper["files"])
193+
has_enhanced_html = any("html_with_ids.html" in f for f in paper["files"])
194+
188195
# Create hyperlinks
189196
doi_link = f"https://doi.org/{doi}" if doi else ""
190197
pmid_link = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/" if pmid else ""
@@ -223,6 +230,18 @@ def create_papers_table(
223230
"XML": "✅" if has_xml else "❌",
224231
"PDF": "✅" if has_pdf else "❌",
225232
"Suppl": "✅" if has_supp else "❌",
233+
"HTML": (
234+
"✅"
235+
if (
236+
has_raw_html
237+
or has_xml_html
238+
or has_pdf_html
239+
or has_doc_html
240+
or has_enhanced_html
241+
)
242+
else "❌"
243+
),
244+
"Enhanced": "✅" if has_enhanced_html else "❌",
226245
"Files": len(paper["files"]),
227246
}
228247
table_data.append(row)
@@ -449,7 +468,7 @@ def export_table_to_csv(
449468
journal = metadata["journalInfo"]["journal"].get("title", "")
450469
elif "journalTitle" in metadata:
451470
journal = metadata["journalTitle"]
452-
471+
453472
row = {
454473
"ID": paper["directory"],
455474
"Title": metadata.get("title", ""),
@@ -466,6 +485,21 @@ def export_table_to_csv(
466485
"Has_Supplementary": any(
467486
"supplementary" in f for f in paper["files"]
468487
),
488+
"Has_Raw_HTML": any(
489+
"fulltext.raw.html" in f for f in paper["files"]
490+
),
491+
"Has_XML_HTML": any(
492+
"fulltext.xml.html" in f for f in paper["files"]
493+
),
494+
"Has_PDF_HTML": any(
495+
"fulltext.pdf.html" in f for f in paper["files"]
496+
),
497+
"Has_DOC_HTML": any(
498+
"fulltext.doc.html" in f for f in paper["files"]
499+
),
500+
"Has_Enhanced_HTML": any(
501+
"html_with_ids.html" in f for f in paper["files"]
502+
),
469503
"File_Count": len(paper["files"]),
470504
}
471505
csv_data.append(row)

0 commit comments

Comments
 (0)