added pdf_hyperlink_adder.py

petermr · petermr · commit e782363f18a9 · 2025-10-07T09:49:45.000+01:00
diff --git a/docs/metadata_fields/biorxiv_metadata_fields.md b/docs/metadata_fields/biorxiv_metadata_fields.md
@@ -220,6 +220,40 @@ for paper in results["papers"]:
 
 ## Implementation Notes
 
+### Repository Structure and Naming Conventions
+
+#### Directory Structure
+```
+{user_repo_directory}/
+├── datatables.html                    # Main DataTables interface
+├── lantana_papers_data.json           # Paper metadata
+└── pdfs/
+    ├── {paper_id_1}/
+    │   ├── fulltext.pdf               # Original PDF (manually downloaded)
+    │   ├── fulltext.html              # Downloaded HTML content
+    │   └── fulltext.pdf.html          # PDF converted to HTML (derived)
+    ├── {paper_id_2}/
+    │   ├── fulltext.pdf
+    │   ├── fulltext.html
+    │   └── fulltext.pdf.html
+    └── ...
+```
+
+#### Naming Conventions
+- **Repository Directory**: User-defined (e.g., `./examples/lantana_biorxiv/`)
+- **Article Subdirectories**: Named by paper ID (e.g., `/292722/`, `/126490/`)
+- **Content Files**: Reserved names indicating type and format:
+  - `fulltext.pdf` - Original PDF content
+  - `fulltext.html` - Downloaded HTML content
+  - `fulltext.pdf.html` - PDF converted to HTML (derived content)
+- **DataTables Location**: `{repo_directory}/datatables.html`
+
+#### PDF Download Workflow
+1. **DataTables Interface**: Located at `{repo_directory}/datatables.html`
+2. **PDF Links**: Point to BioRxiv repository URLs
+3. **Manual Download**: User clicks PDF cells and saves to `/{paper_id}/fulltext.pdf`
+4. **Local Detection**: DataTables shows green "Local PDF" when file exists
+
 ### Web Scraping Selectors
 - **Title:** `span.highwire-cite-title a.highwire-cite-linked-title`
 - **Authors:** `div.highwire-cite-authors span.highwire-citation-author`
diff --git a/docs/repository_fields_schema.md b/docs/repository_fields_schema.md
@@ -44,7 +44,41 @@ Fields that require additional web scraping, file downloads, or extended API cal
 }
 ```
 
-**Notes:** BioRxiv uses web scraping for full text extraction and PDF downloads.
+### Repository Structure and Naming Conventions
+
+#### Directory Structure
+```
+{user_repo_directory}/
+├── datatables.html                    # Main DataTables interface
+├── {repository}_papers_data.json      # Paper metadata
+└── pdfs/
+    ├── {paper_id_1}/
+    │   ├── fulltext.pdf               # Original PDF (manually downloaded)
+    │   ├── fulltext.html              # Downloaded HTML content
+    │   └── fulltext.pdf.html          # PDF converted to HTML (derived)
+    ├── {paper_id_2}/
+    │   ├── fulltext.pdf
+    │   ├── fulltext.html
+    │   └── fulltext.pdf.html
+    └── ...
+```
+
+#### Naming Conventions
+- **Repository Directory**: User-defined (e.g., `./examples/lantana_biorxiv/`)
+- **Article Subdirectories**: Named by paper ID (e.g., `/292722/`, `/126490/`)
+- **Content Files**: Reserved names indicating type and format:
+  - `fulltext.pdf` - Original PDF content
+  - `fulltext.html` - Downloaded HTML content
+  - `fulltext.pdf.html` - PDF converted to HTML (derived content)
+- **DataTables Location**: `{repo_directory}/datatables.html`
+
+#### PDF Download Workflow
+1. **DataTables Interface**: Located at `{repo_directory}/datatables.html`
+2. **PDF Links**: Point to BioRxiv repository URLs
+3. **Manual Download**: User clicks PDF cells and saves to `/{paper_id}/fulltext.pdf`
+4. **Local Detection**: DataTables shows green "Local PDF" when file exists
+
+**Notes:** BioRxiv uses web scraping for full text extraction and PDF downloads. Due to Cloudflare protection, PDF downloads require manual intervention through the DataTables interface.
 
 ---
 
diff --git a/docs/styleguide.md b/docs/styleguide.md
@@ -163,6 +163,15 @@ This document records coding and naming conventions for the pygetpapers project.
 
 **Rationale**: This protocol ensures we can always revert to a working state and validates that changes don't break existing functionality.
 
+### STYLE: Never use destructive commands without explicit approval
+
+- ✅ **Good**: Use `git clean -n` to preview what would be deleted, then ask for approval
+- ✅ **Good**: Commit important work before using any destructive commands
+- ❌ **Bad**: Using `git clean -fd`, `rm -rf`, or other destructive commands without understanding consequences
+- ❌ **Bad**: Using force flags (`-f`) without checking what will be affected
+
+**Rationale**: Destructive commands can permanently delete hours of work. Always preview, understand, and get explicit approval before using them.
+
 ---
 
 *This style guide will be updated as new conventions are established.*
@@ -199,6 +208,12 @@ This document records coding and naming conventions for the pygetpapers project.
 **Why it's wrong:** User couldn't see exactly what would change
 **Impact:** Reduces transparency and user control
 
+#### **5. Used Destructive Git Clean Command (CRITICAL VIOLATION)**
+**Violation:** Used `git clean -fd` without understanding its destructive nature
+**Style Guide Rule Violated:** "Never use destructive commands without explicit approval"
+**Why it's wrong:** `git clean -fd` removes ALL untracked files and directories permanently
+**Impact:** Lost hours of work creating AtPoE package files, had to recreate everything
+
 ### 🛡️ Prevention Plan for Future
 
 #### **Before Any Code Changes:**
@@ -210,9 +225,16 @@ This document records coding and naming conventions for the pygetpapers project.
 #### **Path and Directory Rules:**
 1. **NEVER use `sys.path` or path manipulation**
 2. **ONLY work within current workspace directory**
-3. **Use proper package installation** - `pip install -e .`
+3. **Use proper package installation** - `pip install -e .**
 4. **Use relative paths within workspace**
 
+#### **Git Safety Rules:**
+1. **NEVER use `git clean -fd` without explicit user approval**
+2. **ALWAYS check what files will be deleted before using destructive commands**
+3. **Use `git clean -n` first to see what would be deleted (dry run)**
+4. **NEVER use force flags (`-f`) without understanding the consequences**
+5. **ALWAYS commit important work before using any destructive git commands**
+
 #### **Development Protocol Checklist:**
 - [ ] Propose changes first
 - [ ] Get explicit user approval
diff --git a/pdf_hyperlink_adder.py b/pdf_hyperlink_adder.py
@@ -0,0 +1,203 @@
+#!/usr/bin/env python3
+"""
+PDF Hyperlink Adder
+
+This script adds hyperlinks and visual cues to a PDF based on a word list.
+It processes a large PDF (e.g., 350 pages) and adds blue underlined text with tooltips
+for each matching word from the provided list.
+
+Requirements:
+    pip install PyMuPDF pdfplumber
+
+Usage:
+    python pdf_hyperlink_adder.py input.pdf word_list.csv output.pdf
+"""
+
+import fitz  # PyMuPDF
+import csv
+import re
+import sys
+from pathlib import Path
+from typing import Dict, List, Tuple, Optional
+
+class PDFHyperlinkAdder:
+    def __init__(self, input_pdf: str, word_list_file: str, output_pdf: str):
+        self.input_pdf = input_pdf
+        self.word_list_file = word_list_file
+        self.output_pdf = output_pdf
+        self.word_links: Dict[str, str] = {}
+        self.processed_words = 0
+        self.total_matches = 0
+        
+    def load_word_list(self) -> None:
+        """Load the word list with hyperlinks from CSV file"""
+        print(f"📖 Loading word list from {self.word_list_file}...")
+        
+        with open(self.word_list_file, 'r', encoding='utf-8') as f:
+            reader = csv.reader(f)
+            for row in reader:
+                if len(row) >= 2:
+                    word = row[0].strip().lower()
+                    link = row[1].strip()
+                    self.word_links[word] = link
+                    
+        print(f"✅ Loaded {len(self.word_links)} words with hyperlinks")
+        
+    def find_word_instances(self, doc: fitz.Document) -> List[Tuple[int, str, fitz.Rect, str]]:
+        """Find all instances of words in the PDF with their positions"""
+        print("🔍 Searching for word instances...")
+        
+        word_instances = []
+        
+        for page_num in range(len(doc)):
+            page = doc[page_num]
+            
+            # Get text blocks with positioning
+            text_dict = page.get_text("dict")
+            
+            for block in text_dict["blocks"]:
+                if "lines" in block:
+                    for line in block["lines"]:
+                        for span in line["spans"]:
+                            text = span["text"]
+                            bbox = fitz.Rect(span["bbox"])
+                            
+                            # Check each word in our list
+                            for word, link in self.word_links.items():
+                                # Use case-insensitive search
+                                pattern = re.compile(r'\b' + re.escape(word) + r'\b', re.IGNORECASE)
+                                matches = pattern.finditer(text)
+                                
+                                for match in matches:
+                                    # Calculate position of this specific word
+                                    start_pos = match.start()
+                                    end_pos = match.end()
+                                    
+                                    # Calculate the bbox for this specific word
+                                    char_width = bbox.width / len(text)
+                                    word_bbox = fitz.Rect(
+                                        bbox.x0 + start_pos * char_width,
+                                        bbox.y0,
+                                        bbox.x0 + end_pos * char_width,
+                                        bbox.y1
+                                    )
+                                    
+                                    word_instances.append((page_num, word, word_bbox, link))
+                                    self.total_matches += 1
+                                    
+        print(f"✅ Found {self.total_matches} word instances across {len(doc)} pages")
+        return word_instances
+    
+    def add_hyperlinks_and_styling(self, doc: fitz.Document, word_instances: List[Tuple[int, str, fitz.Rect, str]]) -> None:
+        """Add hyperlinks and visual styling to the PDF"""
+        print("🎨 Adding hyperlinks and styling...")
+        
+        for page_num, word, bbox, link in word_instances:
+            page = doc[page_num]
+            
+            # Add hyperlink annotation
+            link_annot = page.add_link_annot(bbox, uri=link)
+            
+            # Add tooltip (using annotation title)
+            link_annot.set_info(title=f"Click to visit: {word}")
+            
+            # Add visual styling - blue underline
+            # Note: PyMuPDF doesn't directly modify text color, but we can add visual indicators
+            # We'll add a small blue rectangle under the text as a visual cue
+            underline_rect = fitz.Rect(bbox.x0, bbox.y1 - 1, bbox.x1, bbox.y1)
+            page.draw_rect(underline_rect, color=(0, 0, 1), width=1)  # Blue underline
+            
+            self.processed_words += 1
+            
+            if self.processed_words % 100 == 0:
+                print(f"   Processed {self.processed_words}/{self.total_matches} words...")
+    
+    def process_pdf(self) -> None:
+        """Main processing function"""
+        print(f"📄 Processing PDF: {self.input_pdf}")
+        print(f"📝 Word list: {self.word_list_file}")
+        print(f"💾 Output: {self.output_pdf}")
+        print("-" * 50)
+        
+        # Load word list
+        self.load_word_list()
+        
+        # Open PDF
+        doc = fitz.open(self.input_pdf)
+        print(f"📖 PDF opened: {len(doc)} pages")
+        
+        # Find all word instances
+        word_instances = self.find_word_instances(doc)
+        
+        if not word_instances:
+            print("❌ No matching words found in the PDF")
+            return
+        
+        # Add hyperlinks and styling
+        self.add_hyperlinks_and_styling(doc, word_instances)
+        
+        # Save the modified PDF
+        doc.save(self.output_pdf)
+        doc.close()
+        
+        print("-" * 50)
+        print(f"✅ Processing complete!")
+        print(f"📊 Statistics:")
+        print(f"   Total words processed: {self.processed_words}")
+        print(f"   Total matches found: {self.total_matches}")
+        print(f"   Output saved to: {self.output_pdf}")
+
+def create_sample_word_list(filename: str = "word_list.csv") -> None:
+    """Create a sample word list CSV file for testing"""
+    sample_words = [
+        ["python", "https://python.org"],
+        ["programming", "https://en.wikipedia.org/wiki/Programming"],
+        ["algorithm", "https://en.wikipedia.org/wiki/Algorithm"],
+        ["database", "https://en.wikipedia.org/wiki/Database"],
+        ["machine learning", "https://en.wikipedia.org/wiki/Machine_learning"],
+        ["artificial intelligence", "https://en.wikipedia.org/wiki/Artificial_intelligence"],
+        ["data science", "https://en.wikipedia.org/wiki/Data_science"],
+        ["web development", "https://en.wikipedia.org/wiki/Web_development"],
+        ["cloud computing", "https://en.wikipedia.org/wiki/Cloud_computing"],
+        ["cybersecurity", "https://en.wikipedia.org/wiki/Computer_security"]
+    ]
+    
+    with open(filename, 'w', newline='', encoding='utf-8') as f:
+        writer = csv.writer(f)
+        writer.writerow(["word", "hyperlink"])
+        writer.writerows(sample_words)
+    
+    print(f"📝 Created sample word list: {filename}")
+
+def main():
+    if len(sys.argv) != 4:
+        print("Usage: python pdf_hyperlink_adder.py input.pdf word_list.csv output.pdf")
+        print("\nExample:")
+        print("  python pdf_hyperlink_adder.py document.pdf word_list.csv document_with_links.pdf")
+        print("\nTo create a sample word list:")
+        print("  python pdf_hyperlink_adder.py --create-sample")
+        return
+    
+    if sys.argv[1] == "--create-sample":
+        create_sample_word_list()
+        return
+    
+    input_pdf = sys.argv[1]
+    word_list_file = sys.argv[2]
+    output_pdf = sys.argv[3]
+    
+    # Check if files exist
+    if not Path(input_pdf).exists():
+        print(f"❌ Input PDF not found: {input_pdf}")
+        return
+    
+    if not Path(word_list_file).exists():
+        print(f"❌ Word list file not found: {word_list_file}")
+        return
+    
+    # Process the PDF
+    adder = PDFHyperlinkAdder(input_pdf, word_list_file, output_pdf)
+    adder.process_pdf()
+
+if __name__ == "__main__":
+    main() 
diff --git a/requirements_pdf.txt b/requirements_pdf.txt
@@ -0,0 +1,2 @@
+PyMuPDF>=1.23.0
+pdfplumber>=0.9.0 

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+PyMuPDF>=1.23.0`
	`2`	`+pdfplumber>=0.9.0`