We need to scrape user data from an API endpoint where:
- User IDs are sequential integers starting from 1
- Not all IDs exist (we get 404 errors for missing IDs)
- We don't know the maximum ID in advance
- We want to avoid unnecessary API calls
For ID = 1 to 300:
Fetch API data for ID
Handle success or 404 error
Issues:
- Fixed range requires manual updates
- May stop too early (missing valid IDs beyond 300)
- May make unnecessary calls (if max ID is less than 300)
Stop scraping when we encounter a configurable number of consecutive 404 errors, assuming there are no more valid IDs beyond that point.
1. Set starting ID = 1
2. Set consecutive_404_count = 0
3. Set MAX_CONSECUTIVE_404 = 20 // Configurable threshold
4. While True:
a. Fetch data for current ID
b. If successful (200):
- Store data in database
- Reset consecutive_404_count = 0
- Increment ID
c. If failed (404):
- Increment consecutive_404_count
- Increment ID
d. If consecutive_404_count >= MAX_CONSECUTIVE_404:
- Stop scraping (we've likely reached the end)
- Report statistics
5. Done
ID: 1 2 3 4 5 6 ... 148 149 150 151 ... 189 190 191 192 ... 220
✓ ✓ ✓ ✓ ✓ ✓ ... ✓ ✓ ✗ ✓ ... ✓ ✓ ✗ ✗ ... ✗
^ ^ ^
| | |
Reset counter 404 count = 2 404 count = 20 → STOP
- Automatic Discovery: No need to manually set max ID
- Handles Gaps: Resets counter when finding valid IDs after 404s
- Efficient: Stops automatically when reaching the end
- Configurable: Adjust
MAX_CONSECUTIVE_404based on ID sparsity
| Parameter | Default | Description |
|---|---|---|
START_ID |
1 | First ID to check |
MAX_CONSECUTIVE_404 |
20 | Number of consecutive 404s before stopping |
RATE_LIMIT_DELAY |
0.5 | Seconds to wait between requests |
- Sparse IDs (many gaps): Use higher value (30-50)
- Dense IDs (few gaps): Use lower value (10-20)
- Current data (from 1-200 test):
- Longest gap observed: ~10 consecutive 404s (IDs 191-200)
- Recommended: 20 (provides 2x safety margin)
- All IDs fail: Stops after MAX_CONSECUTIVE_404 attempts
- Intermittent 404s: Counter resets on each success
- Network errors: Treated as failures but logged separately
- API rate limiting: Configurable delay between requests
Scenario: Max ID is 250, with 30 failed IDs scattered throughout
| Approach | API Calls | Explanation |
|---|---|---|
| Fixed range (1-300) | 300 | Continues past actual max |
| Fixed range (1-200) | 200 | Stops too early, misses 50 IDs |
| Adaptive (MAX_404=20) | 270 | Stops at ID 270 (250 + 20 consecutive fails) |
- ✅ No manual range updates needed
- ✅ Discovers new IDs automatically as they're added
- ✅ Handles sparse ID distributions gracefully
- ✅ Self-optimizing based on actual data
- ✅ Easy to tune via configuration