fixed url encode issue with special characters

itskavin · itskavin · commit b91c87ba8cb4 · 2025-09-24T17:47:31.000+05:30
diff --git a/.env.example b/.env.example
@@ -72,10 +72,10 @@ FFMPEG_PRESENTATION_MERGE=false
 # OPTIONAL FEATURES (LEGACY SUPPORT)
 # ===============================================
 
-# For selective content downloads, use the JSON file created from Thinki Parser. 
-# Copy the file to the Thinkifi Downloader root folder.
-# Specify the file name below. Ex. COURSE_DATA_FILE="modified-course.json" 
-# COURSE_DATA_FILE=""
+# For selective content downloads, use a custom JSON file listing only the lessons you want to download.
+# See SELECTIVE_DOWNLOAD.md for details and examples.
+# Specify the file name below. Ex. COURSE_DATA_FILE="my-selective-lessons.json"
+COURSE_DATA_FILE=""
 
 # Set to true to download all available video formats/qualities
 # Warning: This will significantly increase download size and time
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,119 @@
+# Thinkific Downloader - AI Agent Instructions
+
+## Architecture Overview
+
+This is a modern Python 3.8+ application that downloads educational content from Thinkific-based learning platforms. The architecture follows a modular design with clear separation of concerns:
+
+- **Entry Points**: `thinkificdownloader.py` (legacy) and `python -m thinkific_downloader` (preferred)
+- **Core Engine**: `thinkific_downloader/downloader.py` - main orchestrator with global state management
+- **Specialized Downloaders**: `wistia_downloader.py` for Wistia video content, standard HTTP for other media
+- **Download Management**: `download_manager.py` with parallel processing, rate limiting, and resume capabilities
+- **Configuration**: Environment-driven via `.env` files with `config.py` handling validation
+
+## Critical Developer Workflows
+
+### Running the Application
+```bash
+# Preferred modern approach
+python -m thinkific_downloader
+
+# Legacy approach (still supported)
+python thinkificdownloader.py
+
+# Docker deployment
+docker-compose up
+```
+
+### Environment Setup
+- **Authentication is MANDATORY** - requires `COOKIE_DATA` and `CLIENT_DATE` from browser DevTools
+- Configuration lives in `.env` files (workspace root takes precedence over package directory)
+- See `ENV_SETUP.md` for detailed browser cookie extraction process
+- Missing auth data causes immediate `SystemExit` with clear error message
+
+### Package Management
+- Minimal dependencies: `requests`, `rich`, `tqdm` (see `requirements.txt`)
+- Optional extras: `brotli` support, `beautifulsoup4` for enhanced parsing
+- Multi-stage Docker build optimizes for Alpine Linux + FFmpeg
+
+## Project-Specific Patterns
+
+### Global State Management
+The `downloader.py` module uses globals to mirror PHP-style behavior:
+```python
+# Critical singletons initialized in init_settings()
+SETTINGS: Optional[Settings] = None
+DOWNLOAD_MANAGER: Optional[DownloadManager] = None
+COURSE_CONTENTS: List[Dict[str, Any]] = []
+```
+
+### Authentication Headers Pattern
+All HTTP requests use consistent Thinkific-compatible headers:
+```python
+# Standard auth header pattern used throughout
+request_headers = {
+    'x-thinkific-client-date': SETTINGS.client_date,
+    'cookie': SETTINGS.cookie_data,
+    'User-Agent': USER_AGENT,  # Chrome-based UA string
+}
+```
+
+### Filename Sanitization
+File naming follows strict cross-platform rules via `file_utils.py`:
+- Reserved characters (`<>:"/\|?*`) replaced with hyphens
+- Unicode escapes properly decoded
+- UTF-8 byte limits enforced (255 bytes max)
+- Filename beautification (lowercase, dash consolidation)
+
+### Progress Monitoring
+Rich terminal UI with custom columns for download status:
+- `QueuedSpeedColumn` shows "Queued" instead of unrealistic speeds
+- `QueuedTimeColumn` handles pending downloads gracefully
+- Progress state persists across application restarts
+
+## Integration Points
+
+### Wistia Video Processing
+Special handling for Wistia-hosted videos via regex extraction:
+```python
+# Extract Wistia ID from JSONP URLs
+VIDEO_PROXY_JSONP_ID_PATTERN = re.compile(r"medias/(\w+)\.jsonp")
+```
+- Supports compressed responses (brotli/gzip/deflate)
+- Falls back to first available quality if requested quality unavailable
+- Quality options: 720p (default), 1080p, etc.
+
+### Rate Limiting & Concurrency
+Token bucket rate limiter prevents server overload:
+- Configurable concurrent downloads (default: 3 threads)
+- Exponential backoff with jitter for failed requests
+- Optional bandwidth limiting via `RATE_LIMIT_MB_S`
+
+### Resume Functionality
+Atomic progress tracking across application restarts:
+- Download state persisted in JSON files
+- Cross-platform safe backup system
+- Partial download detection and continuation
+
+## Docker Considerations
+
+- Multi-stage build (builder + runtime) for minimal image size
+- Non-root user (`thinkific`) for security
+- FFmpeg included for presentation merging when `FFMPEG_PRESENTATION_MERGE=true`
+- Volume mounting for persistent downloads: `./downloads:/app/downloads`
+
+## Key Configuration Options
+
+Essential `.env` variables (see `.env.example`):
+- `COURSE_LINK`: Target course URL
+- `COOKIE_DATA`: Browser session cookies (required)
+- `CLIENT_DATE`: API timestamp (required)
+- `CONCURRENT_DOWNLOADS`: Parallel download threads (1-10)
+- `VIDEO_DOWNLOAD_QUALITY`: Preferred video quality
+- `RESUME_PARTIAL`: Enable download resumption (default: true)
+
+## Testing & Debugging
+
+- Set `DEBUG=true` for verbose logging
+- Progress validation via Rich UI real-time monitoring
+- File integrity checking with size/checksum validation
+- Network retry logic automatically handles transient failures
diff --git a/.gitignore b/.gitignore
@@ -23,6 +23,7 @@ wheels/
 .installed.cfg
 *.egg
 MANIFEST
+php sample/
 
 # PyInstaller
 *.manifest
diff --git a/ENV_SETUP.md b/ENV_SETUP.md
@@ -225,6 +225,7 @@ If authentication works, you should see course information being fetched. If not
 - 🐛 **Issues**: [Report Problems](https://github.com/ByteTrix/Thinkific-Downloader/issues)
 - 💬 **Questions**: [Community Discussions](https://github.com/ByteTrix/Thinkific-Downloader/discussions)
 - 📚 **Documentation**: [Main README](README.md)
+- 🎯 **Selective Downloads**: [Download Specific Lessons Only](SELECTIVE_DOWNLOAD.md)
 
 ---
 
diff --git a/README.md b/README.md
@@ -167,6 +167,8 @@ python thinkificdownloader.py
 
 > 👨‍💻 **Developer?** Visit [**DEVELOPMENT.md**](DEVELOPMENT.md) for architecture overview, API reference, and contribution guidelines.
 
+> 🎯 **Want to download specific lessons only?** See our [**SELECTIVE_DOWNLOAD.md**](SELECTIVE_DOWNLOAD.md) guide for downloading individual chapters or lessons instead of the entire course.
+
 ## ⚙️ **Enhanced Configuration**
 
 **🚨 BEFORE YOU START:** Follow our **[Complete Environment Setup Guide](ENV_SETUP.md)** for step-by-step instructions on extracting authentication data from your browser.
diff --git a/SELECTIVE_DOWNLOAD.md b/SELECTIVE_DOWNLOAD.md
@@ -0,0 +1,166 @@
+# 🎯 Selective Download Guide
+
+This guide explains how to download only specific lessons from a Thinkific course instead of downloading the entire course.
+
+## Overview
+
+The Thinkific Downloader supports selective downloads using a JSON configuration file that specifies exactly which lessons you want to download. This is useful when you only need certain chapters or want to avoid downloading large files you don't need.
+
+## Methods Available
+
+### Method 1: Using Environment Variable (Recommended)
+
+1. **Create or obtain a course data JSON file** (see options below)
+2. **Set the file path in your `.env` file:**
+   ```bash
+   COURSE_DATA_FILE="my-selective-lessons.json"
+   ```
+3. **Run the downloader:**
+   ```bash
+   python -m thinkific_downloader
+   ```
+
+### Method 2: Using Command Line Flag
+
+You can specify the JSON file directly via command line:
+```bash
+python -m thinkific_downloader --json my-selective-lessons.json
+```
+
+### Method 3: Docker with Selective Downloads
+
+```bash
+# Set in your .env file
+COURSE_DATA_FILE="selective-lessons.json"
+
+# Run with Docker
+docker-compose up
+```
+
+## Creating Your Selective JSON File
+
+### Option A: From Existing Progress File
+
+If you've already run the downloader once:
+
+1. **Copy the generated progress file:**
+   ```bash
+   # Windows
+   copy "downloads\\your-course-name\\.thinkific_progress.json" "selective-lessons.json"
+   
+   # Linux/Mac
+   cp "downloads/your-course-name/.thinkific_progress.json" "selective-lessons.json"
+   ```
+
+2. **Edit the file** to remove unwanted lessons from the `download_tasks` array
+
+3. **Keep the structure intact** - only modify the contents of the arrays
+
+### Option B: Create from Scratch
+
+Create a JSON file with this structure:
+
+```json
+{
+  "analyzed_chapters": ["chapter_1", "chapter_3", "chapter_5"],
+  "download_tasks": [
+    {
+      "url": "https://embed-ssl.wistia.com/deliveries/video-id-here.bin",
+      "dest_path": "1. chapter-name\\1.lesson-name\\lesson-file.mp4",
+      "content_type": "video"
+    },
+    {
+      "url": "https://course-files.thinkific.com/document.pdf",
+      "dest_path": "1. chapter-name\\2.document-lesson\\document.pdf",
+      "content_type": "document"
+    }
+  ]
+}
+```
+
+## Understanding the JSON Structure
+
+### `analyzed_chapters`
+- Array of chapter IDs that have been processed
+- Format: `["chapter_1", "chapter_2", "chapter_N"]`
+- Used to track which chapters have been analyzed
+
+### `download_tasks`
+Each task has three required fields:
+
+- **`url`**: Direct download URL for the content
+- **`dest_path`**: Local file path where content will be saved
+- **`content_type`**: Type of content (`video`, `document`, `html`, `audio`)
+
+## Content Types Supported
+
+| Content Type | Description | File Extensions |
+|--------------|-------------|-----------------|
+| `video` | Video lessons (Wistia, MP4) | `.mp4`, `.mov`, `.avi` |
+| `document` | PDF documents, slides | `.pdf`, `.ppt`, `.pptx` |
+| `html` | Text lessons, notes | `.html` |
+| `audio` | Audio files | `.mp3`, `.m4a`, `.wav` |
+
+## Example Workflows
+
+### Download Only Videos from Specific Chapters
+
+1. Run the full download once to get the complete JSON
+2. Copy the progress file to `videos-only.json`
+3. Edit to keep only entries where `content_type` is `"video"`
+4. Remove chapters you don't want from `analyzed_chapters`
+5. Set `COURSE_DATA_FILE="videos-only.json"` in `.env`
+
+### Download Only First 5 Lessons
+
+1. Get the complete JSON file
+2. Keep only the first 5 entries in `download_tasks`
+3. Update `analyzed_chapters` to match
+4. Use the modified file
+
+### Skip Large Video Files
+
+1. Edit the JSON to remove large video entries
+2. Keep documents, text, and smaller videos
+3. Use the filtered JSON for download
+
+## Tips and Best Practices
+
+### File Path Management
+- Use forward slashes (`/`) or double backslashes (`\\\\`) in paths
+- Ensure destination directories exist or will be created
+- Keep the original folder structure for organization
+
+### Performance Optimization
+- Fewer tasks = faster completion
+- Remove unnecessary content types to save bandwidth
+- Use `CONCURRENT_DOWNLOADS=1` for selective downloads to avoid rate limiting
+
+### Validation
+- Ensure all URLs are accessible
+- Verify file paths are valid for your operating system
+- Test with a small subset first
+
+## Troubleshooting
+
+### "File not found" Error
+```
+COURSE_DATA_FILE env var not set.
+```
+**Solution:** Ensure the JSON file exists in the project root directory and the path in `.env` is correct.
+
+### "Invalid JSON" Error
+**Solution:** Validate your JSON syntax using an online JSON validator or text editor with JSON support.
+
+### Missing Downloads
+**Solution:** Check that the `url` fields in your JSON are still valid and accessible.
+
+### Permission Errors
+**Solution:** Ensure the destination directories are writable and you have sufficient disk space.
+
+## Getting Help
+
+- See [ENV_SETUP.md](ENV_SETUP.md) for authentication setup
+- See [README.md](README.md) for general usage
+- Enable `DEBUG=true` for detailed logging
+- Check the generated progress files for examples of proper JSON structure
diff --git a/thinkific_downloader/config.py b/thinkific_downloader/config.py
@@ -68,17 +68,30 @@ def from_env(cls):
         resume_partial = os.getenv('RESUME_PARTIAL', 'true').lower() in ('1', 'true', 'yes', 'on')
         debug = os.getenv('DEBUG', 'false').lower() in ('1', 'true', 'yes', 'on')
         
+        # Clean cookie data to remove Unicode characters that cause encoding issues
+        if cookie_data:
+            # Replace problematic Unicode characters with safe alternatives
+            cookie_data = cookie_data.replace('\u2026', '...')  # Replace horizontal ellipsis
+            cookie_data = cookie_data.replace('\u2018', "'")    # Replace left single quotation mark
+            cookie_data = cookie_data.replace('\u2019', "'")    # Replace right single quotation mark
+            cookie_data = cookie_data.replace('\u201c', '"')    # Replace left double quotation mark
+            cookie_data = cookie_data.replace('\u201d', '"')    # Replace right double quotation mark
+            cookie_data = cookie_data.replace('\u2013', '-')    # Replace en dash
+            cookie_data = cookie_data.replace('\u2014', '-')    # Replace em dash
+            # Remove any remaining non-ASCII characters
+            cookie_data = ''.join(c for c in cookie_data if ord(c) < 128)
+
         # Validation
         if not client_date or not cookie_data:
             raise SystemExit('Cookie data and Client Date not set. Use the ReadMe file first before using this script.')
-            
+
         # Basic directory permissions check
         cwd = Path.cwd()
         if not os.access(cwd, os.W_OK):
             raise SystemExit('Current directory is not writable.')
         return cls(
-            client_date=client_date, 
-            cookie_data=cookie_data, 
+            client_date=client_date,
+            cookie_data=cookie_data,
             video_download_quality=video_download_quality,
             output_dir=output_dir,
             ffmpeg_presentation_merge=ffmpeg_merge,
diff --git a/thinkific_downloader/downloader.py b/thinkific_downloader/downloader.py
diff --git a/thinkific_downloader/wistia_downloader.py b/thinkific_downloader/wistia_downloader.py