Skip to content

CodeChef HTML scraper breaks silently on any frontend layout change #216

@amaydixit11

Description

@amaydixit11

Bug Description

The CodeChef leaderboard relies entirely on HTML scraping (BeautifulSoup) which breaks whenever CodeChef changes their page layout. There's no fallback mechanism for when parsing fails.

Affected Files

  • api/leaderboard/views.py:332-386 (CodechefLeaderboard.get_codechef_data)
  • api/leaderboard/management/commands/update_db.py:64-122 (codechef_user_update)

Problem

# views.py:340-348
rating_div = soup.find("div", class_="rating-number")
if not rating_div:
    if "user not found" in response.text.lower():
        return "NOT_FOUND"
    logger.error(f"CodeChef parsing error for {username}: rating_div not found")
    return "TRANSIENT_ERROR"

# views.py:350-363
instance["highest_rating"] = (
    container_highest_rating.find_next("small")
    .text.split()[-1]
    .rstrip(")")
)
container_ranks = soup.find("div", class_="rating-ranks")
ranks = container_ranks.find_all("a")
instance["global_rank"] = ranks[0].strong.text

This scraper depends on:

  • div.rating-number — CSS class can change anytime
  • div.rating-header + next small tag — fragile DOM structure
  • div.rating-ranks + ranks[0].strong + ranks[1].strong — assumes exact structure
  • img.profileImage[-1]["src"] — assumes profile images exist in DOM

Any CodeChef frontend update silently returns "TRANSIENT_ERROR" for ALL users, leaving no data visible on the leaderboard.

Steps to Reproduce

  1. If CodeChef ships any frontend update that changes class names
  2. All CodeChef data fails to refresh
  3. Users see stale or missing data with no indication why

Proposed Fix

  1. Short term: Add more robust error handling with partial data returns:
try:
    rating_div = soup.find("div", class_="rating-number")
    if rating_div:
        instance["rating"] = int(rating_div.text)
    else:
        # Try alternative selectors
        rating_span = soup.select_one("span.rating, .rating-value")
        if rating_span:
            instance["rating"] = int(rating_span.text.strip())
        else:
            return "TRANSIENT_ERROR"
except Exception:
    logger.error(f"Failed to parse rating for {username}")
    return "TRANSIENT_ERROR"
  1. Long term: Use CodeChef's official API if available: https://www.codechef.com/api/rankings/rating or check if they've released a proper API since this scraper was written.

  2. Monitoring: Add a health check that alerts when multiple consecutive users fail to parse, so maintainers know scraping is broken.

Severity

MEDIUM — Silent failures with no user notification. Affects all CodeChef leaderboard users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    advancedComplex issues requiring experienced contributorsbugbug d73a4a 'Something isn't working'securitysecurity d73a4a 'Security vulnerability'

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions