Skip to content

Commit b91c87b

Browse files
committed
fixed url encode issue with special characters
1 parent dce671c commit b91c87b

9 files changed

Lines changed: 452 additions & 71 deletions

File tree

.env.example

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,10 @@ FFMPEG_PRESENTATION_MERGE=false
7272
# OPTIONAL FEATURES (LEGACY SUPPORT)
7373
# ===============================================
7474

75-
# For selective content downloads, use the JSON file created from Thinki Parser.
76-
# Copy the file to the Thinkifi Downloader root folder.
77-
# Specify the file name below. Ex. COURSE_DATA_FILE="modified-course.json"
78-
# COURSE_DATA_FILE=""
75+
# For selective content downloads, use a custom JSON file listing only the lessons you want to download.
76+
# See SELECTIVE_DOWNLOAD.md for details and examples.
77+
# Specify the file name below. Ex. COURSE_DATA_FILE="my-selective-lessons.json"
78+
COURSE_DATA_FILE=""
7979

8080
# Set to true to download all available video formats/qualities
8181
# Warning: This will significantly increase download size and time

.github/copilot-instructions.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Thinkific Downloader - AI Agent Instructions
2+
3+
## Architecture Overview
4+
5+
This is a modern Python 3.8+ application that downloads educational content from Thinkific-based learning platforms. The architecture follows a modular design with clear separation of concerns:
6+
7+
- **Entry Points**: `thinkificdownloader.py` (legacy) and `python -m thinkific_downloader` (preferred)
8+
- **Core Engine**: `thinkific_downloader/downloader.py` - main orchestrator with global state management
9+
- **Specialized Downloaders**: `wistia_downloader.py` for Wistia video content, standard HTTP for other media
10+
- **Download Management**: `download_manager.py` with parallel processing, rate limiting, and resume capabilities
11+
- **Configuration**: Environment-driven via `.env` files with `config.py` handling validation
12+
13+
## Critical Developer Workflows
14+
15+
### Running the Application
16+
```bash
17+
# Preferred modern approach
18+
python -m thinkific_downloader
19+
20+
# Legacy approach (still supported)
21+
python thinkificdownloader.py
22+
23+
# Docker deployment
24+
docker-compose up
25+
```
26+
27+
### Environment Setup
28+
- **Authentication is MANDATORY** - requires `COOKIE_DATA` and `CLIENT_DATE` from browser DevTools
29+
- Configuration lives in `.env` files (workspace root takes precedence over package directory)
30+
- See `ENV_SETUP.md` for detailed browser cookie extraction process
31+
- Missing auth data causes immediate `SystemExit` with clear error message
32+
33+
### Package Management
34+
- Minimal dependencies: `requests`, `rich`, `tqdm` (see `requirements.txt`)
35+
- Optional extras: `brotli` support, `beautifulsoup4` for enhanced parsing
36+
- Multi-stage Docker build optimizes for Alpine Linux + FFmpeg
37+
38+
## Project-Specific Patterns
39+
40+
### Global State Management
41+
The `downloader.py` module uses globals to mirror PHP-style behavior:
42+
```python
43+
# Critical singletons initialized in init_settings()
44+
SETTINGS: Optional[Settings] = None
45+
DOWNLOAD_MANAGER: Optional[DownloadManager] = None
46+
COURSE_CONTENTS: List[Dict[str, Any]] = []
47+
```
48+
49+
### Authentication Headers Pattern
50+
All HTTP requests use consistent Thinkific-compatible headers:
51+
```python
52+
# Standard auth header pattern used throughout
53+
request_headers = {
54+
'x-thinkific-client-date': SETTINGS.client_date,
55+
'cookie': SETTINGS.cookie_data,
56+
'User-Agent': USER_AGENT, # Chrome-based UA string
57+
}
58+
```
59+
60+
### Filename Sanitization
61+
File naming follows strict cross-platform rules via `file_utils.py`:
62+
- Reserved characters (`<>:"/\|?*`) replaced with hyphens
63+
- Unicode escapes properly decoded
64+
- UTF-8 byte limits enforced (255 bytes max)
65+
- Filename beautification (lowercase, dash consolidation)
66+
67+
### Progress Monitoring
68+
Rich terminal UI with custom columns for download status:
69+
- `QueuedSpeedColumn` shows "Queued" instead of unrealistic speeds
70+
- `QueuedTimeColumn` handles pending downloads gracefully
71+
- Progress state persists across application restarts
72+
73+
## Integration Points
74+
75+
### Wistia Video Processing
76+
Special handling for Wistia-hosted videos via regex extraction:
77+
```python
78+
# Extract Wistia ID from JSONP URLs
79+
VIDEO_PROXY_JSONP_ID_PATTERN = re.compile(r"medias/(\w+)\.jsonp")
80+
```
81+
- Supports compressed responses (brotli/gzip/deflate)
82+
- Falls back to first available quality if requested quality unavailable
83+
- Quality options: 720p (default), 1080p, etc.
84+
85+
### Rate Limiting & Concurrency
86+
Token bucket rate limiter prevents server overload:
87+
- Configurable concurrent downloads (default: 3 threads)
88+
- Exponential backoff with jitter for failed requests
89+
- Optional bandwidth limiting via `RATE_LIMIT_MB_S`
90+
91+
### Resume Functionality
92+
Atomic progress tracking across application restarts:
93+
- Download state persisted in JSON files
94+
- Cross-platform safe backup system
95+
- Partial download detection and continuation
96+
97+
## Docker Considerations
98+
99+
- Multi-stage build (builder + runtime) for minimal image size
100+
- Non-root user (`thinkific`) for security
101+
- FFmpeg included for presentation merging when `FFMPEG_PRESENTATION_MERGE=true`
102+
- Volume mounting for persistent downloads: `./downloads:/app/downloads`
103+
104+
## Key Configuration Options
105+
106+
Essential `.env` variables (see `.env.example`):
107+
- `COURSE_LINK`: Target course URL
108+
- `COOKIE_DATA`: Browser session cookies (required)
109+
- `CLIENT_DATE`: API timestamp (required)
110+
- `CONCURRENT_DOWNLOADS`: Parallel download threads (1-10)
111+
- `VIDEO_DOWNLOAD_QUALITY`: Preferred video quality
112+
- `RESUME_PARTIAL`: Enable download resumption (default: true)
113+
114+
## Testing & Debugging
115+
116+
- Set `DEBUG=true` for verbose logging
117+
- Progress validation via Rich UI real-time monitoring
118+
- File integrity checking with size/checksum validation
119+
- Network retry logic automatically handles transient failures

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ wheels/
2323
.installed.cfg
2424
*.egg
2525
MANIFEST
26+
php sample/
2627

2728
# PyInstaller
2829
*.manifest

ENV_SETUP.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,7 @@ If authentication works, you should see course information being fetched. If not
225225
- 🐛 **Issues**: [Report Problems](https://github.com/ByteTrix/Thinkific-Downloader/issues)
226226
- 💬 **Questions**: [Community Discussions](https://github.com/ByteTrix/Thinkific-Downloader/discussions)
227227
- 📚 **Documentation**: [Main README](README.md)
228+
- 🎯 **Selective Downloads**: [Download Specific Lessons Only](SELECTIVE_DOWNLOAD.md)
228229

229230
---
230231

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ python thinkificdownloader.py
167167
168168
> 👨‍💻 **Developer?** Visit [**DEVELOPMENT.md**](DEVELOPMENT.md) for architecture overview, API reference, and contribution guidelines.
169169
170+
> 🎯 **Want to download specific lessons only?** See our [**SELECTIVE_DOWNLOAD.md**](SELECTIVE_DOWNLOAD.md) guide for downloading individual chapters or lessons instead of the entire course.
171+
170172
## ⚙️ **Enhanced Configuration**
171173

172174
**🚨 BEFORE YOU START:** Follow our **[Complete Environment Setup Guide](ENV_SETUP.md)** for step-by-step instructions on extracting authentication data from your browser.

SELECTIVE_DOWNLOAD.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# 🎯 Selective Download Guide
2+
3+
This guide explains how to download only specific lessons from a Thinkific course instead of downloading the entire course.
4+
5+
## Overview
6+
7+
The Thinkific Downloader supports selective downloads using a JSON configuration file that specifies exactly which lessons you want to download. This is useful when you only need certain chapters or want to avoid downloading large files you don't need.
8+
9+
## Methods Available
10+
11+
### Method 1: Using Environment Variable (Recommended)
12+
13+
1. **Create or obtain a course data JSON file** (see options below)
14+
2. **Set the file path in your `.env` file:**
15+
```bash
16+
COURSE_DATA_FILE="my-selective-lessons.json"
17+
```
18+
3. **Run the downloader:**
19+
```bash
20+
python -m thinkific_downloader
21+
```
22+
23+
### Method 2: Using Command Line Flag
24+
25+
You can specify the JSON file directly via command line:
26+
```bash
27+
python -m thinkific_downloader --json my-selective-lessons.json
28+
```
29+
30+
### Method 3: Docker with Selective Downloads
31+
32+
```bash
33+
# Set in your .env file
34+
COURSE_DATA_FILE="selective-lessons.json"
35+
36+
# Run with Docker
37+
docker-compose up
38+
```
39+
40+
## Creating Your Selective JSON File
41+
42+
### Option A: From Existing Progress File
43+
44+
If you've already run the downloader once:
45+
46+
1. **Copy the generated progress file:**
47+
```bash
48+
# Windows
49+
copy "downloads\\your-course-name\\.thinkific_progress.json" "selective-lessons.json"
50+
51+
# Linux/Mac
52+
cp "downloads/your-course-name/.thinkific_progress.json" "selective-lessons.json"
53+
```
54+
55+
2. **Edit the file** to remove unwanted lessons from the `download_tasks` array
56+
57+
3. **Keep the structure intact** - only modify the contents of the arrays
58+
59+
### Option B: Create from Scratch
60+
61+
Create a JSON file with this structure:
62+
63+
```json
64+
{
65+
"analyzed_chapters": ["chapter_1", "chapter_3", "chapter_5"],
66+
"download_tasks": [
67+
{
68+
"url": "https://embed-ssl.wistia.com/deliveries/video-id-here.bin",
69+
"dest_path": "1. chapter-name\\1.lesson-name\\lesson-file.mp4",
70+
"content_type": "video"
71+
},
72+
{
73+
"url": "https://course-files.thinkific.com/document.pdf",
74+
"dest_path": "1. chapter-name\\2.document-lesson\\document.pdf",
75+
"content_type": "document"
76+
}
77+
]
78+
}
79+
```
80+
81+
## Understanding the JSON Structure
82+
83+
### `analyzed_chapters`
84+
- Array of chapter IDs that have been processed
85+
- Format: `["chapter_1", "chapter_2", "chapter_N"]`
86+
- Used to track which chapters have been analyzed
87+
88+
### `download_tasks`
89+
Each task has three required fields:
90+
91+
- **`url`**: Direct download URL for the content
92+
- **`dest_path`**: Local file path where content will be saved
93+
- **`content_type`**: Type of content (`video`, `document`, `html`, `audio`)
94+
95+
## Content Types Supported
96+
97+
| Content Type | Description | File Extensions |
98+
|--------------|-------------|-----------------|
99+
| `video` | Video lessons (Wistia, MP4) | `.mp4`, `.mov`, `.avi` |
100+
| `document` | PDF documents, slides | `.pdf`, `.ppt`, `.pptx` |
101+
| `html` | Text lessons, notes | `.html` |
102+
| `audio` | Audio files | `.mp3`, `.m4a`, `.wav` |
103+
104+
## Example Workflows
105+
106+
### Download Only Videos from Specific Chapters
107+
108+
1. Run the full download once to get the complete JSON
109+
2. Copy the progress file to `videos-only.json`
110+
3. Edit to keep only entries where `content_type` is `"video"`
111+
4. Remove chapters you don't want from `analyzed_chapters`
112+
5. Set `COURSE_DATA_FILE="videos-only.json"` in `.env`
113+
114+
### Download Only First 5 Lessons
115+
116+
1. Get the complete JSON file
117+
2. Keep only the first 5 entries in `download_tasks`
118+
3. Update `analyzed_chapters` to match
119+
4. Use the modified file
120+
121+
### Skip Large Video Files
122+
123+
1. Edit the JSON to remove large video entries
124+
2. Keep documents, text, and smaller videos
125+
3. Use the filtered JSON for download
126+
127+
## Tips and Best Practices
128+
129+
### File Path Management
130+
- Use forward slashes (`/`) or double backslashes (`\\\\`) in paths
131+
- Ensure destination directories exist or will be created
132+
- Keep the original folder structure for organization
133+
134+
### Performance Optimization
135+
- Fewer tasks = faster completion
136+
- Remove unnecessary content types to save bandwidth
137+
- Use `CONCURRENT_DOWNLOADS=1` for selective downloads to avoid rate limiting
138+
139+
### Validation
140+
- Ensure all URLs are accessible
141+
- Verify file paths are valid for your operating system
142+
- Test with a small subset first
143+
144+
## Troubleshooting
145+
146+
### "File not found" Error
147+
```
148+
COURSE_DATA_FILE env var not set.
149+
```
150+
**Solution:** Ensure the JSON file exists in the project root directory and the path in `.env` is correct.
151+
152+
### "Invalid JSON" Error
153+
**Solution:** Validate your JSON syntax using an online JSON validator or text editor with JSON support.
154+
155+
### Missing Downloads
156+
**Solution:** Check that the `url` fields in your JSON are still valid and accessible.
157+
158+
### Permission Errors
159+
**Solution:** Ensure the destination directories are writable and you have sufficient disk space.
160+
161+
## Getting Help
162+
163+
- See [ENV_SETUP.md](ENV_SETUP.md) for authentication setup
164+
- See [README.md](README.md) for general usage
165+
- Enable `DEBUG=true` for detailed logging
166+
- Check the generated progress files for examples of proper JSON structure

thinkific_downloader/config.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,17 +68,30 @@ def from_env(cls):
6868
resume_partial = os.getenv('RESUME_PARTIAL', 'true').lower() in ('1', 'true', 'yes', 'on')
6969
debug = os.getenv('DEBUG', 'false').lower() in ('1', 'true', 'yes', 'on')
7070

71+
# Clean cookie data to remove Unicode characters that cause encoding issues
72+
if cookie_data:
73+
# Replace problematic Unicode characters with safe alternatives
74+
cookie_data = cookie_data.replace('\u2026', '...') # Replace horizontal ellipsis
75+
cookie_data = cookie_data.replace('\u2018', "'") # Replace left single quotation mark
76+
cookie_data = cookie_data.replace('\u2019', "'") # Replace right single quotation mark
77+
cookie_data = cookie_data.replace('\u201c', '"') # Replace left double quotation mark
78+
cookie_data = cookie_data.replace('\u201d', '"') # Replace right double quotation mark
79+
cookie_data = cookie_data.replace('\u2013', '-') # Replace en dash
80+
cookie_data = cookie_data.replace('\u2014', '-') # Replace em dash
81+
# Remove any remaining non-ASCII characters
82+
cookie_data = ''.join(c for c in cookie_data if ord(c) < 128)
83+
7184
# Validation
7285
if not client_date or not cookie_data:
7386
raise SystemExit('Cookie data and Client Date not set. Use the ReadMe file first before using this script.')
74-
87+
7588
# Basic directory permissions check
7689
cwd = Path.cwd()
7790
if not os.access(cwd, os.W_OK):
7891
raise SystemExit('Current directory is not writable.')
7992
return cls(
80-
client_date=client_date,
81-
cookie_data=cookie_data,
93+
client_date=client_date,
94+
cookie_data=cookie_data,
8295
video_download_quality=video_download_quality,
8396
output_dir=output_dir,
8497
ffmpeg_presentation_merge=ffmpeg_merge,

0 commit comments

Comments
 (0)