Seattle-Source-Ranker supports multi-token rotation to bypass GitHub API rate limits. With multiple tokens, you can significantly increase collection throughput and handle large-scale data collection efficiently.
- Automatically loads multiple tokens from
.env.tokens - Thread-safe round-robin rotation
- Supports environment variables and configuration files
- Dynamic token selection based on rate limit status
- Uses REST API with token rotation
- Automatic fallback when tokens are rate-limited
- Individual repo API calls for topics collection
- 8 workers × 2 concurrency = 16 parallel tasks
- Batch processing with Celery + Redis
- Automatic worker management
- Progress monitoring and retry logic
- Support for large-scale collections (~430K+ projects)
- Automatically loads all tokens from
.env.tokens - Passes tokens to all workers
- Creates logs in project root directory
- API Limit: 5,000 requests/hour
- Collection Speed: ~50 users/minute
- Large collection (30,000 users): ~10 hours
- API Limit: 15,000 requests/hour (3× improvement)
- Collection Speed: ~150 users/minute
- Large collection (30,000 users): ~3.5 hours [OK]
- Suitable for GitHub Actions (6-hour limit)
- Workers: ~8 workers × ~2 concurrency
- Users Processed: ~28K+
- Total Time: ~60-90 minutes for full collection
- Success Rate: ~99%+
- Data Generated: ~8K+ paginated JSON files
Create .env.tokens in the project root:
# .env.tokens
GITHUB_TOKEN_1=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
GITHUB_TOKEN_2=ghp_yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
GITHUB_TOKEN_3=ghp_zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzImportant: This file is already in .gitignore and won't be committed to Git.
- Go to: https://github.com/settings/tokens
- Click "Generate new token (classic)"
- Set scopes:
public_repo,read:user - Copy token and add to
.env.tokens - Repeat for each additional token (use different GitHub accounts if needed)
# Count configured tokens
grep -c "^GITHUB_TOKEN_[0-9]=" .env.tokens
# Should show: 3 (or your configured number)
# Test tokens via collection script
python3 -c "from seattle_source_ranker.tokens import get_token_manager; tm = get_token_manager(); print(f'{len(tm.get_all_tokens())} tokens loaded')"# 1. Start Redis
redis-server --daemonize yes
# 2. Start workers (auto-loads tokens)
bash scripts/start_workers.sh
# 3. Collect 100 projects
python3 -m seattle_source_ranker.collector.distributed_collector --target 100 --max-users 50 --batch-size 10python3 -m seattle_source_ranker.collector.distributed_collector \
--target 1000 \
--max-users 5000 \
--batch-size 50python3 -m seattle_source_ranker.collector.distributed_collector \
--target 1000000 \
--max-users 30000 \
--batch-size 10After collection completes:
python3 scripts/generate_frontend_data.pyThis creates ~8K+ paginated JSON files in frontend/public/pages/
# View active workers
ps aux | grep celery
# Check worker logs
tail -f logs/worker1.log
tail -f logs/worker*.log # All workersThe collector shows real-time progress:
- [OK] Completed batches
- [STATS] Total projects collected
- ⏱️ Elapsed time and ETA
- [RETRY] Success/failure rates
bash scripts/stop_workers.sh
# or
pkill -f 'celery.*collection_worker'# Check file permissions
ls -la .env.tokens
# Manually test
export $(grep -v '^#' .env.tokens | xargs)
echo $GITHUB_TOKEN_1# Ensure you're in project root directory
cd <project-root>
bash scripts/start_workers.sh- Each token has independent rate limit (5,000/hour)
- 3 tokens = 15,000 requests/hour total
- System automatically waits when all tokens are rate-limited
- Consider adding more tokens or increasing delays
# View detailed logs
tail -f logs/worker1.log
# Restart workers
bash scripts/stop_workers.sh
bash scripts/start_workers.sh# Check Redis status
redis-cli ping
# Should return: PONG
# Restart Redis
sudo systemctl restart redisEdit scripts/start_workers.sh to change number of workers (default: 8)
Edit worker startup commands:
--concurrency=2 # Change to 3, 4, etc.Edit src/seattle_source_ranker/collector/collection_worker.py:
time.sleep(0.05) # Adjust between requests (default: 50ms)- Start Small: Test with ~100-1000 projects before large collections
- Monitor Logs: Watch for errors or rate limit warnings
- Backup Data: Collection data is saved incrementally
- Use Multiple Tokens: 3+ tokens recommended for large collections
- Be Patient: Large collections (~430K+ projects) take ~60-90 minutes
- Check Disk Space: ~430K projects ≈ ~200-300MB raw data + frontend files
- Python 3.11+
- Redis 6.0+
- 8GB+ RAM (for large collections)
- 5GB+ disk space
- Stable internet connection
- Multiple GitHub accounts (for multiple tokens)
- Troubleshooting Guide - Common issues and solutions
- Changelog - Version history and release notes
- Contributing - How to contribute to this project
- GitHub API Documentation - REST API reference