petermr
diff --git a/‎LOG.md‎
Lines changed: 158 additions & 4 deletions b/‎LOG.md‎
Lines changed: 158 additions & 4 deletions
diff --git a/‎datatables_integration.py‎
Lines changed: 37 additions & 3 deletions b/‎datatables_integration.py‎
Lines changed: 37 additions & 3 deletions
@@ -1,5 +1,150 @@
 # Pygetpapers Streamlit UI Development Log
 
+## Latest Updates
+
+### 2024-07-08: Automatic Corpus Detection
+- **Added automatic corpus detection**: The app now automatically scans for existing pygetpapers output directories on startup
+- **Smart directory recognition**: Detects corpora by looking for characteristic files (eupmc_results.json, europe_pmc.csv, etc.) and paper ID patterns (PMC, arXiv:, etc.)
+- **Metadata extraction**: Automatically extracts corpus information including API type, paper count, creation date, and original query
+- **Manual refresh**: Added "🔄 Refresh Corpus List" button to manually scan for newly downloaded corpora
+- **Session state integration**: Auto-detected corpora are seamlessly integrated into the existing corpus management system
+- **Cross-version support**: Feature implemented in both main and no-dependencies Streamlit apps
+
+### 2024-07-08: Progress Display Improvements
+- **Complete tqdm filtering**: Completely removed tqdm progress bars from both stdout and stderr output display
+- **Clean output display**: Filtered output now shows only meaningful information without redundant progress bars
+- **Meaningful activity display**: Enhanced recent activity section to show only meaningful output lines
+- **Smart filtering**: Automatically filters out debug messages, timestamps, and other noise
+- **Better color coding**: Improved visual feedback with color-coded messages (green for success, red for errors, etc.)
+- **Fallback indicators**: Added fallback messages when no meaningful output is available
+- **Cross-version support**: Applied filtering to both main and no-dependencies Streamlit apps
+
+### 2024-07-08: Quick Stats Fix
+- **Fixed stats updating**: Quick stats now properly update when papers are downloaded
+- **Auto-detection stats**: Stats are updated when existing corpora are auto-detected on startup
+- **Accurate paper counts**: Corpus entries now use actual downloaded papers instead of requested limit
+- **Debug information**: Added debug output to show stats updates for troubleshooting
+- **Cross-version support**: Applied fixes to both main and no-dependencies Streamlit apps
+
+### 2024-07-08: Comprehensive HTML Management System & Naming Convention
+- **Multi-source HTML support**: Support for HTML from multiple sources with clear naming convention
+- **HTML file types**: 
+  - `fulltext.raw.html` (provided by publisher/repo)
+  - `fulltext.xml.html` (converted from XML using JATS4R or Simple HTML Converter)
+  - `fulltext.pdf.html` (converted from PDF)
+  - `fulltext.doc.html` (converted from DOC)
+  - `html_with_ids.html` (cleaned and enhanced for analysis)
+- **JATS4R integration**: XML to HTML conversion with automatic setup and batch processing
+- **PDF/DOC conversion**: Multiple converter support (pdfplumber, PyMuPDF, pandoc, etc.)
+- **HTML enhancement**: Automatic creation of enhanced HTML with section IDs and cleaned structure
+- **CLI commands**:
+  - `--fulltext_html`: Convert XML to HTML during download
+  - `--convert_html`: Retrospective XML to HTML conversion
+  - `--process_html`: Process all HTML files in corpus (convert PDFs/DOCs, create enhanced versions)
+  - `--enhance_html`: Create enhanced HTML from existing HTML files
+- **Priority system**: Enhanced > XML > Raw > PDF > DOC (best HTML selection)
+- **Naming convention**: HTML files are named with their source type (e.g., `fulltext.xml.html` for XML conversions)
+- **Data tables integration**: Updated to show all HTML file types and enhanced HTML status
+- **Cross-version support**: Full functionality in main app, placeholder in no-dependencies version
+
+### 2024-07-08: Download Safety Measures
+- **Enhanced validation**: Added comprehensive validation to compare requested vs actual papers downloaded
+- **Safety limits**: Implemented hard limits (max 200 papers) and warnings (above 100 papers) in the UI
+- **Debug output**: Added detailed command execution logging for troubleshooting
+- **Error handling**: Improved error messages and warnings for limit violations
+- **Cross-version support**: All safety measures implemented in both Streamlit app versions
+
+### 2024-07-08: Progress Tracking Enhancement
+- **Real-time progress**: Implemented live progress tracking with subprocess output capture
+- **Dynamic progress bar**: Beautiful gradient progress bar with animated emojis and color transitions
+- **File type counters**: Real-time counters for JSON, XML, PDF, and supplementary files
+- **Operation indicators**: Dynamic operation emojis (🔍 Searching, 📥 Downloading, 💾 Writing)
+- **Recent activity display**: Color-coded recent output with styled containers
+- **Cross-version support**: Progress tracking implemented in both main and no-dependencies apps
+
+### 2024-07-08: Repository Naming and Paper Count Improvements
+- **Repository naming**: Now uses directory names as default repository names for better identification
+- **Paper count distinction**: Clear distinction between downloaded papers and total available papers in repository
+- **Download progress tracking**: Shows download percentage (e.g., "50 / 100 papers" with 50% progress)
+- **Enhanced statistics**: Separate metrics for downloaded vs total available papers
+- **Multiple instance management**: Improved run script to handle port conflicts and manage multiple Streamlit instances
+- **Cross-version support**: All improvements implemented in both main and no-dependencies versions
+
+### 2024-07-07: Streamlit UI Development
+- **Initial implementation**: Created comprehensive Streamlit web interface for pygetpapers
+- **Repository support**: Full support for Europe PMC, arXiv, Crossref, OpenAlex, bioRxiv, medRxiv, Rxivist
+- **Query builder**: Advanced query builder with Boolean operators and field-specific search
+- **Corpus management**: Complete corpus management with statistics and visualization
+- **Data tables**: Interactive HTML tables with datatables integration
+- **Figures gallery**: Automatic figure extraction and thumbnail generation
+- **Fulltext search**: Advanced search within paper content
+- **Corpus comparison**: Multi-corpus comparison and overlap analysis
+- **Export functionality**: CSV export and data visualization
+- **Settings and help**: Comprehensive settings and help documentation
+
+### 2024-07-07: CI/CD and Import Fixes
+- **Fixed import errors**: Resolved subprocess path issues and sys.path hacks
+- **CI configuration**: Added pytest-cov installation and proper test configuration
+- **Test fixes**: Fixed Arxiv API usage and zip file handling tests
+- **Formatting**: Applied Black formatting to test files
+- **Streamlit CI**: Added syntax checking without server startup to prevent hanging
+- **CLI availability**: Fixed CLI command availability in CI environment
+
+## Technical Details
+
+### Progress Tracking Implementation
+- Uses `subprocess.Popen` with real-time output capture
+- Parses progress information from pygetpapers output
+- Updates Streamlit components with `placeholder.empty()` for smooth animations
+- Handles both tqdm progress bars and custom output parsing
+- Implements graceful error handling and timeout management
+
+### Corpus Detection Algorithm
+- Scans current directory for pygetpapers output patterns
+- Identifies characteristic files (eupmc_results.json, europe_pmc.csv, etc.)
+- Recognizes paper ID patterns (PMC, arXiv:, doi_)
+- Extracts metadata from JSON files when available
+- Parses timestamps from directory names for creation dates
+- Integrates seamlessly with existing session state management
+
+### Safety Measures
+- Validates actual vs requested paper counts
+- Implements UI limits (100 warning, 200 hard stop)
+- Provides detailed debug output for troubleshooting
+- Shows clear error messages for limit violations
+- Maintains user control while preventing accidental large downloads
+
+## Migration Guide
+
+### For Existing Users
+1. **Automatic Detection**: Existing corpora will be automatically detected on next app startup
+2. **Manual Refresh**: Use "🔄 Refresh Corpus List" button to detect new downloads
+3. **Progress Tracking**: New downloads will show enhanced progress display
+4. **Safety Limits**: Be aware of new download limits (max 200 papers)
+
+### For New Users
+1. **Installation**: Follow standard pygetpapers installation
+2. **Streamlit Setup**: Install Streamlit and run `streamlit run streamlit_app.py`
+3. **First Use**: Download papers through the web interface
+4. **Corpus Management**: Use the Corpus Manager to view and analyze downloads
+
+## Future Enhancements
+
+### Planned Features
+- **Batch operations**: Select multiple papers for bulk operations
+- **Advanced filtering**: Filter papers by date, journal, author, etc.
+- **Citation export**: Export citations in various formats (BibTeX, EndNote, etc.)
+- **Collaborative features**: Share corpora and queries with other users
+- **Advanced analytics**: More sophisticated corpus analysis and visualization
+- **API integration**: Direct integration with external analysis tools
+
+### Technical Improvements
+- **Performance optimization**: Faster corpus scanning and data loading
+- **Memory management**: Better handling of large corpora
+- **Caching**: Implement intelligent caching for frequently accessed data
+- **Offline mode**: Support for offline corpus analysis
+- **Mobile optimization**: Better mobile device support
+
 ## Team Feedback - [Date]
 
 ### Issues to Address:
@@ -12,6 +157,7 @@
    - ❌ Highlight the query box for better visibility
    - ❌ Reduce default paper downloading from current limit to 10
    - ❌ Create version without external dependencies (no plotly requirement)
+   - ❌ Add visual progress indicators for downloads
 
 3. **Functionality Issues:**
    - ❌ Plot doesn't show numbers of papers and years
@@ -37,6 +183,7 @@
 - ✅ Highlighted query box with prominent styling
 - ✅ Created migration guide (MIGRATION_GUIDE.md)
 - ✅ Created no-dependencies version (streamlit_app_no_deps.py)
+- ✅ Added real-time progress tracking for downloads
 
 ### In Progress 🔄
 - Fixing journal name display issue
@@ -56,14 +203,21 @@
 - Corpus location: ✅ Files saved to `{repo_name}_{timestamp}` in current directory (e.g., `europe_pmc_20250121_143022`)
 - Journal name issue: ✅ Fixed - was looking for `journalTitle` but should look for `journalInfo.journal.title`
 - Query box: ✅ Highlighted with prominent styling and help text
+- Progress tracking: ✅ Real-time progress indicators showing:
+  - Current operation (Searching, Downloading, Writing)
+  - Overall progress bar with paper count
+  - File type counters (JSON, XML, PDF, Supplementary)
+  - Recent output lines
+  - Works in both main and no-dependencies versions
 
 ## Next Steps:
 1. Remove `# noqa` comments
 2. Create migration guide
 3. Highlight query box
 4. Reduce default limit to 10
 5. Create no-dependencies version
-6. Fix plot functionality
-7. Fix journal name display
-8. Document corpus location
-9. Plan filter implementation 
+6. Add progress tracking for downloads
+7. Fix plot functionality
+8. Fix journal name display
+9. Document corpus location
+10. Plan filter implementation 
@@ -167,14 +167,14 @@ def create_papers_table(
             # Extract key information
             title = metadata.get("title", paper["directory"])
             authors = metadata.get("authorString", "Unknown")
-            
+
             # Extract journal name from nested structure
             journal = "Unknown"
             if "journalInfo" in metadata and "journal" in metadata["journalInfo"]:
                 journal = metadata["journalInfo"]["journal"].get("title", "Unknown")
             elif "journalTitle" in metadata:
                 journal = metadata["journalTitle"]
-            
+
             doi = metadata.get("doi", "")
             pmid = metadata.get("pmid", "")
             pmcid = metadata.get("pmcid", "")
@@ -185,6 +185,13 @@ def create_papers_table(
             has_pdf = any("fulltext.pdf" in f for f in paper["files"])
             has_supp = any("supplementary" in f for f in paper["files"])
 
+            # Check HTML file types
+            has_raw_html = any("fulltext.raw.html" in f for f in paper["files"])
+            has_xml_html = any("fulltext.xml.html" in f for f in paper["files"])
+            has_pdf_html = any("fulltext.pdf.html" in f for f in paper["files"])
+            has_doc_html = any("fulltext.doc.html" in f for f in paper["files"])
+            has_enhanced_html = any("html_with_ids.html" in f for f in paper["files"])
+
             # Create hyperlinks
             doi_link = f"https://doi.org/{doi}" if doi else ""
             pmid_link = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/" if pmid else ""
@@ -223,6 +230,18 @@ def create_papers_table(
                 "XML": "✅" if has_xml else "❌",
                 "PDF": "✅" if has_pdf else "❌",
                 "Suppl": "✅" if has_supp else "❌",
+                "HTML": (
+                    "✅"
+                    if (
+                        has_raw_html
+                        or has_xml_html
+                        or has_pdf_html
+                        or has_doc_html
+                        or has_enhanced_html
+                    )
+                    else "❌"
+                ),
+                "Enhanced": "✅" if has_enhanced_html else "❌",
                 "Files": len(paper["files"]),
             }
             table_data.append(row)
@@ -449,7 +468,7 @@ def export_table_to_csv(
                     journal = metadata["journalInfo"]["journal"].get("title", "")
                 elif "journalTitle" in metadata:
                     journal = metadata["journalTitle"]
-                
+
                 row = {
                     "ID": paper["directory"],
                     "Title": metadata.get("title", ""),
@@ -466,6 +485,21 @@ def export_table_to_csv(
                     "Has_Supplementary": any(
                         "supplementary" in f for f in paper["files"]
                     ),
+                    "Has_Raw_HTML": any(
+                        "fulltext.raw.html" in f for f in paper["files"]
+                    ),
+                    "Has_XML_HTML": any(
+                        "fulltext.xml.html" in f for f in paper["files"]
+                    ),
+                    "Has_PDF_HTML": any(
+                        "fulltext.pdf.html" in f for f in paper["files"]
+                    ),
+                    "Has_DOC_HTML": any(
+                        "fulltext.doc.html" in f for f in paper["files"]
+                    ),
+                    "Has_Enhanced_HTML": any(
+                        "html_with_ids.html" in f for f in paper["files"]
+                    ),
                     "File_Count": len(paper["files"]),
                 }
                 csv_data.append(row)