Skip to content

Latest commit

 

History

History
112 lines (82 loc) · 3.5 KB

File metadata and controls

112 lines (82 loc) · 3.5 KB

API Scraping Algorithm - Auto-Detection of Maximum ID

Problem Statement

We need to scrape user data from an API endpoint where:

  • User IDs are sequential integers starting from 1
  • Not all IDs exist (we get 404 errors for missing IDs)
  • We don't know the maximum ID in advance
  • We want to avoid unnecessary API calls

Current Approach (Inefficient)

For ID = 1 to 300:
    Fetch API data for ID
    Handle success or 404 error

Issues:

  • Fixed range requires manual updates
  • May stop too early (missing valid IDs beyond 300)
  • May make unnecessary calls (if max ID is less than 300)

Proposed Algorithm: Adaptive Range with Consecutive Failure Detection

Core Concept

Stop scraping when we encounter a configurable number of consecutive 404 errors, assuming there are no more valid IDs beyond that point.

Algorithm Flow

1. Set starting ID = 1
2. Set consecutive_404_count = 0
3. Set MAX_CONSECUTIVE_404 = 20  // Configurable threshold

4. While True:
    a. Fetch data for current ID

    b. If successful (200):
        - Store data in database
        - Reset consecutive_404_count = 0
        - Increment ID

    c. If failed (404):
        - Increment consecutive_404_count
        - Increment ID

    d. If consecutive_404_count >= MAX_CONSECUTIVE_404:
        - Stop scraping (we've likely reached the end)
        - Report statistics

5. Done

Visual Example

ID:  1  2  3  4  5  6 ... 148 149 150 151 ... 189 190 191 192 ... 220
     ✓  ✓  ✓  ✓  ✓  ✓ ... ✓   ✓   ✗   ✓   ... ✓   ✓   ✗   ✗   ... ✗
                                    ^                   ^           ^
                                    |                   |           |
                              Reset counter      404 count = 2   404 count = 20 → STOP

Advantages

  1. Automatic Discovery: No need to manually set max ID
  2. Handles Gaps: Resets counter when finding valid IDs after 404s
  3. Efficient: Stops automatically when reaching the end
  4. Configurable: Adjust MAX_CONSECUTIVE_404 based on ID sparsity

Configuration Parameters

Parameter Default Description
START_ID 1 First ID to check
MAX_CONSECUTIVE_404 20 Number of consecutive 404s before stopping
RATE_LIMIT_DELAY 0.5 Seconds to wait between requests

Choosing MAX_CONSECUTIVE_404

  • Sparse IDs (many gaps): Use higher value (30-50)
  • Dense IDs (few gaps): Use lower value (10-20)
  • Current data (from 1-200 test):
    • Longest gap observed: ~10 consecutive 404s (IDs 191-200)
    • Recommended: 20 (provides 2x safety margin)

Edge Cases Handled

  1. All IDs fail: Stops after MAX_CONSECUTIVE_404 attempts
  2. Intermittent 404s: Counter resets on each success
  3. Network errors: Treated as failures but logged separately
  4. API rate limiting: Configurable delay between requests

Performance Comparison

Scenario: Max ID is 250, with 30 failed IDs scattered throughout

Approach API Calls Explanation
Fixed range (1-300) 300 Continues past actual max
Fixed range (1-200) 200 Stops too early, misses 50 IDs
Adaptive (MAX_404=20) 270 Stops at ID 270 (250 + 20 consecutive fails)

Implementation Benefits

  • ✅ No manual range updates needed
  • ✅ Discovers new IDs automatically as they're added
  • ✅ Handles sparse ID distributions gracefully
  • ✅ Self-optimizing based on actual data
  • ✅ Easy to tune via configuration