|
1 | 1 | # Pygetpapers Streamlit UI Development Log |
2 | 2 |
|
| 3 | +## Latest Updates |
| 4 | + |
| 5 | +### 2024-07-08: Automatic Corpus Detection |
| 6 | +- **Added automatic corpus detection**: The app now automatically scans for existing pygetpapers output directories on startup |
| 7 | +- **Smart directory recognition**: Detects corpora by looking for characteristic files (eupmc_results.json, europe_pmc.csv, etc.) and paper ID patterns (PMC, arXiv:, etc.) |
| 8 | +- **Metadata extraction**: Automatically extracts corpus information including API type, paper count, creation date, and original query |
| 9 | +- **Manual refresh**: Added "🔄 Refresh Corpus List" button to manually scan for newly downloaded corpora |
| 10 | +- **Session state integration**: Auto-detected corpora are seamlessly integrated into the existing corpus management system |
| 11 | +- **Cross-version support**: Feature implemented in both main and no-dependencies Streamlit apps |
| 12 | + |
| 13 | +### 2024-07-08: Progress Display Improvements |
| 14 | +- **Complete tqdm filtering**: Completely removed tqdm progress bars from both stdout and stderr output display |
| 15 | +- **Clean output display**: Filtered output now shows only meaningful information without redundant progress bars |
| 16 | +- **Meaningful activity display**: Enhanced recent activity section to show only meaningful output lines |
| 17 | +- **Smart filtering**: Automatically filters out debug messages, timestamps, and other noise |
| 18 | +- **Better color coding**: Improved visual feedback with color-coded messages (green for success, red for errors, etc.) |
| 19 | +- **Fallback indicators**: Added fallback messages when no meaningful output is available |
| 20 | +- **Cross-version support**: Applied filtering to both main and no-dependencies Streamlit apps |
| 21 | + |
| 22 | +### 2024-07-08: Quick Stats Fix |
| 23 | +- **Fixed stats updating**: Quick stats now properly update when papers are downloaded |
| 24 | +- **Auto-detection stats**: Stats are updated when existing corpora are auto-detected on startup |
| 25 | +- **Accurate paper counts**: Corpus entries now use actual downloaded papers instead of requested limit |
| 26 | +- **Debug information**: Added debug output to show stats updates for troubleshooting |
| 27 | +- **Cross-version support**: Applied fixes to both main and no-dependencies Streamlit apps |
| 28 | + |
| 29 | +### 2024-07-08: Comprehensive HTML Management System & Naming Convention |
| 30 | +- **Multi-source HTML support**: Support for HTML from multiple sources with clear naming convention |
| 31 | +- **HTML file types**: |
| 32 | + - `fulltext.raw.html` (provided by publisher/repo) |
| 33 | + - `fulltext.xml.html` (converted from XML using JATS4R or Simple HTML Converter) |
| 34 | + - `fulltext.pdf.html` (converted from PDF) |
| 35 | + - `fulltext.doc.html` (converted from DOC) |
| 36 | + - `html_with_ids.html` (cleaned and enhanced for analysis) |
| 37 | +- **JATS4R integration**: XML to HTML conversion with automatic setup and batch processing |
| 38 | +- **PDF/DOC conversion**: Multiple converter support (pdfplumber, PyMuPDF, pandoc, etc.) |
| 39 | +- **HTML enhancement**: Automatic creation of enhanced HTML with section IDs and cleaned structure |
| 40 | +- **CLI commands**: |
| 41 | + - `--fulltext_html`: Convert XML to HTML during download |
| 42 | + - `--convert_html`: Retrospective XML to HTML conversion |
| 43 | + - `--process_html`: Process all HTML files in corpus (convert PDFs/DOCs, create enhanced versions) |
| 44 | + - `--enhance_html`: Create enhanced HTML from existing HTML files |
| 45 | +- **Priority system**: Enhanced > XML > Raw > PDF > DOC (best HTML selection) |
| 46 | +- **Naming convention**: HTML files are named with their source type (e.g., `fulltext.xml.html` for XML conversions) |
| 47 | +- **Data tables integration**: Updated to show all HTML file types and enhanced HTML status |
| 48 | +- **Cross-version support**: Full functionality in main app, placeholder in no-dependencies version |
| 49 | + |
| 50 | +### 2024-07-08: Download Safety Measures |
| 51 | +- **Enhanced validation**: Added comprehensive validation to compare requested vs actual papers downloaded |
| 52 | +- **Safety limits**: Implemented hard limits (max 200 papers) and warnings (above 100 papers) in the UI |
| 53 | +- **Debug output**: Added detailed command execution logging for troubleshooting |
| 54 | +- **Error handling**: Improved error messages and warnings for limit violations |
| 55 | +- **Cross-version support**: All safety measures implemented in both Streamlit app versions |
| 56 | + |
| 57 | +### 2024-07-08: Progress Tracking Enhancement |
| 58 | +- **Real-time progress**: Implemented live progress tracking with subprocess output capture |
| 59 | +- **Dynamic progress bar**: Beautiful gradient progress bar with animated emojis and color transitions |
| 60 | +- **File type counters**: Real-time counters for JSON, XML, PDF, and supplementary files |
| 61 | +- **Operation indicators**: Dynamic operation emojis (🔍 Searching, 📥 Downloading, 💾 Writing) |
| 62 | +- **Recent activity display**: Color-coded recent output with styled containers |
| 63 | +- **Cross-version support**: Progress tracking implemented in both main and no-dependencies apps |
| 64 | + |
| 65 | +### 2024-07-08: Repository Naming and Paper Count Improvements |
| 66 | +- **Repository naming**: Now uses directory names as default repository names for better identification |
| 67 | +- **Paper count distinction**: Clear distinction between downloaded papers and total available papers in repository |
| 68 | +- **Download progress tracking**: Shows download percentage (e.g., "50 / 100 papers" with 50% progress) |
| 69 | +- **Enhanced statistics**: Separate metrics for downloaded vs total available papers |
| 70 | +- **Multiple instance management**: Improved run script to handle port conflicts and manage multiple Streamlit instances |
| 71 | +- **Cross-version support**: All improvements implemented in both main and no-dependencies versions |
| 72 | + |
| 73 | +### 2024-07-07: Streamlit UI Development |
| 74 | +- **Initial implementation**: Created comprehensive Streamlit web interface for pygetpapers |
| 75 | +- **Repository support**: Full support for Europe PMC, arXiv, Crossref, OpenAlex, bioRxiv, medRxiv, Rxivist |
| 76 | +- **Query builder**: Advanced query builder with Boolean operators and field-specific search |
| 77 | +- **Corpus management**: Complete corpus management with statistics and visualization |
| 78 | +- **Data tables**: Interactive HTML tables with datatables integration |
| 79 | +- **Figures gallery**: Automatic figure extraction and thumbnail generation |
| 80 | +- **Fulltext search**: Advanced search within paper content |
| 81 | +- **Corpus comparison**: Multi-corpus comparison and overlap analysis |
| 82 | +- **Export functionality**: CSV export and data visualization |
| 83 | +- **Settings and help**: Comprehensive settings and help documentation |
| 84 | + |
| 85 | +### 2024-07-07: CI/CD and Import Fixes |
| 86 | +- **Fixed import errors**: Resolved subprocess path issues and sys.path hacks |
| 87 | +- **CI configuration**: Added pytest-cov installation and proper test configuration |
| 88 | +- **Test fixes**: Fixed Arxiv API usage and zip file handling tests |
| 89 | +- **Formatting**: Applied Black formatting to test files |
| 90 | +- **Streamlit CI**: Added syntax checking without server startup to prevent hanging |
| 91 | +- **CLI availability**: Fixed CLI command availability in CI environment |
| 92 | + |
| 93 | +## Technical Details |
| 94 | + |
| 95 | +### Progress Tracking Implementation |
| 96 | +- Uses `subprocess.Popen` with real-time output capture |
| 97 | +- Parses progress information from pygetpapers output |
| 98 | +- Updates Streamlit components with `placeholder.empty()` for smooth animations |
| 99 | +- Handles both tqdm progress bars and custom output parsing |
| 100 | +- Implements graceful error handling and timeout management |
| 101 | + |
| 102 | +### Corpus Detection Algorithm |
| 103 | +- Scans current directory for pygetpapers output patterns |
| 104 | +- Identifies characteristic files (eupmc_results.json, europe_pmc.csv, etc.) |
| 105 | +- Recognizes paper ID patterns (PMC, arXiv:, doi_) |
| 106 | +- Extracts metadata from JSON files when available |
| 107 | +- Parses timestamps from directory names for creation dates |
| 108 | +- Integrates seamlessly with existing session state management |
| 109 | + |
| 110 | +### Safety Measures |
| 111 | +- Validates actual vs requested paper counts |
| 112 | +- Implements UI limits (100 warning, 200 hard stop) |
| 113 | +- Provides detailed debug output for troubleshooting |
| 114 | +- Shows clear error messages for limit violations |
| 115 | +- Maintains user control while preventing accidental large downloads |
| 116 | + |
| 117 | +## Migration Guide |
| 118 | + |
| 119 | +### For Existing Users |
| 120 | +1. **Automatic Detection**: Existing corpora will be automatically detected on next app startup |
| 121 | +2. **Manual Refresh**: Use "🔄 Refresh Corpus List" button to detect new downloads |
| 122 | +3. **Progress Tracking**: New downloads will show enhanced progress display |
| 123 | +4. **Safety Limits**: Be aware of new download limits (max 200 papers) |
| 124 | + |
| 125 | +### For New Users |
| 126 | +1. **Installation**: Follow standard pygetpapers installation |
| 127 | +2. **Streamlit Setup**: Install Streamlit and run `streamlit run streamlit_app.py` |
| 128 | +3. **First Use**: Download papers through the web interface |
| 129 | +4. **Corpus Management**: Use the Corpus Manager to view and analyze downloads |
| 130 | + |
| 131 | +## Future Enhancements |
| 132 | + |
| 133 | +### Planned Features |
| 134 | +- **Batch operations**: Select multiple papers for bulk operations |
| 135 | +- **Advanced filtering**: Filter papers by date, journal, author, etc. |
| 136 | +- **Citation export**: Export citations in various formats (BibTeX, EndNote, etc.) |
| 137 | +- **Collaborative features**: Share corpora and queries with other users |
| 138 | +- **Advanced analytics**: More sophisticated corpus analysis and visualization |
| 139 | +- **API integration**: Direct integration with external analysis tools |
| 140 | + |
| 141 | +### Technical Improvements |
| 142 | +- **Performance optimization**: Faster corpus scanning and data loading |
| 143 | +- **Memory management**: Better handling of large corpora |
| 144 | +- **Caching**: Implement intelligent caching for frequently accessed data |
| 145 | +- **Offline mode**: Support for offline corpus analysis |
| 146 | +- **Mobile optimization**: Better mobile device support |
| 147 | + |
3 | 148 | ## Team Feedback - [Date] |
4 | 149 |
|
5 | 150 | ### Issues to Address: |
|
12 | 157 | - ❌ Highlight the query box for better visibility |
13 | 158 | - ❌ Reduce default paper downloading from current limit to 10 |
14 | 159 | - ❌ Create version without external dependencies (no plotly requirement) |
| 160 | + - ❌ Add visual progress indicators for downloads |
15 | 161 |
|
16 | 162 | 3. **Functionality Issues:** |
17 | 163 | - ❌ Plot doesn't show numbers of papers and years |
|
37 | 183 | - ✅ Highlighted query box with prominent styling |
38 | 184 | - ✅ Created migration guide (MIGRATION_GUIDE.md) |
39 | 185 | - ✅ Created no-dependencies version (streamlit_app_no_deps.py) |
| 186 | +- ✅ Added real-time progress tracking for downloads |
40 | 187 |
|
41 | 188 | ### In Progress 🔄 |
42 | 189 | - Fixing journal name display issue |
|
56 | 203 | - Corpus location: ✅ Files saved to `{repo_name}_{timestamp}` in current directory (e.g., `europe_pmc_20250121_143022`) |
57 | 204 | - Journal name issue: ✅ Fixed - was looking for `journalTitle` but should look for `journalInfo.journal.title` |
58 | 205 | - Query box: ✅ Highlighted with prominent styling and help text |
| 206 | +- Progress tracking: ✅ Real-time progress indicators showing: |
| 207 | + - Current operation (Searching, Downloading, Writing) |
| 208 | + - Overall progress bar with paper count |
| 209 | + - File type counters (JSON, XML, PDF, Supplementary) |
| 210 | + - Recent output lines |
| 211 | + - Works in both main and no-dependencies versions |
59 | 212 |
|
60 | 213 | ## Next Steps: |
61 | 214 | 1. Remove `# noqa` comments |
62 | 215 | 2. Create migration guide |
63 | 216 | 3. Highlight query box |
64 | 217 | 4. Reduce default limit to 10 |
65 | 218 | 5. Create no-dependencies version |
66 | | -6. Fix plot functionality |
67 | | -7. Fix journal name display |
68 | | -8. Document corpus location |
69 | | -9. Plan filter implementation |
| 219 | +6. Add progress tracking for downloads |
| 220 | +7. Fix plot functionality |
| 221 | +8. Fix journal name display |
| 222 | +9. Document corpus location |
| 223 | +10. Plan filter implementation |
0 commit comments