Skip to content

fix(glassdoor): fix CSRF token URL (404) and non-fatal GraphQL error handling#347

Open
EnxhiT wants to merge 1 commit into
speedyapply:mainfrom
EnxhiT:main
Open

fix(glassdoor): fix CSRF token URL (404) and non-fatal GraphQL error handling#347
EnxhiT wants to merge 1 commit into
speedyapply:mainfrom
EnxhiT:main

Conversation

@EnxhiT
Copy link
Copy Markdown

@EnxhiT EnxhiT commented Mar 19, 2026

Problem

Two bugs in jobspy/glassdoor/__init__.py cause Glassdoor to return 0 results regardless of location or search term.

Bug 1 — CSRF token URL returns 404

_get_csrf_token() fetches /Job/computer-science-jobs.htm to extract the CSRF token. This URL now returns a 404 after Glassdoor's migration to Next.js. Without a valid token, the fallback token is used but the subsequent _get_location() call also fails with a 403 because the session is not properly initialized.

Fix: Fetch the homepage (/) instead, which reliably returns the token.

Bug 2 — Non-fatal GraphQL errors abort all results

_fetch_jobs_page() raises ValueError("Error encountered in API response") if the GraphQL response contains any errors key. In practice, Glassdoor commonly returns non-critical 503 sub-errors on peripheral fields like jobsPageSeoData (SEO metadata) while the actual jobListings data is fully intact.

This causes the scraper to discard all 30 job results on every page.

Fix: Only treat errors on the jobListings path (excluding jobsPageSeoData) as fatal.

Verification

Tested locally against glassdoor.es with location="Spain", search_term="engineer":

  • Before fix: 0 results, ERROR - Glassdoor: Error encountered in API response
  • After fix: 30 jobs returned, totalJobsCount: 7576

Related issues: #279, #270, #273

Two bugs prevented Glassdoor from returning any results:

1. _get_csrf_token() was fetching /Job/computer-science-jobs.htm which
   now returns 404 after Glassdoor's Next.js migration. Changed to fetch
   the homepage (/) which reliably returns the token.

2. _fetch_jobs_page() treated any "errors" key in the GraphQL response
   as fatal, dropping all job results. Glassdoor commonly returns non-
   critical 503s on peripheral fields (e.g. jobsPageSeoData) while the
   actual jobListings data is intact. Now only errors on the jobListings
   path itself are treated as fatal.

Verified: 30 jobs returned for Spain/engineer with both fixes applied.
@EnxhiT EnxhiT requested a review from cullenwatson as a code owner March 19, 2026 00:58
@Astidor
Copy link
Copy Markdown

Astidor commented Mar 23, 2026

I also encountered the same issue and found the same Bug 2. Another bug would also be error 400 and it apparently has to do with glassdoor changing how the graph behaves.

ERROR - JobSpy:Glassdoor - Glassdoor response status code 400
ERROR - JobSpy:Glassdoor - Glassdoor: location not parsed

@joshhaaronn
Copy link
Copy Markdown

The CSRF token URL fix here is solid and worth keeping. However, PR #350 covers the other two bugs (HTTP 400 from unencoded location params, and the GraphQL errors false-positive) more completely and with simpler logic. Would be great if the CSRF fix from this PR got folded into #350 so there's one clean merge rather than two overlapping fixes.

feiyangliu2023 pushed a commit to feiyangliu2023/JobSpyFeiyang that referenced this pull request May 9, 2026
The previous commit only improved diagnostics — sources still returned
nothing. This commit makes them actually work.

glassdoor (upstream PRs speedyapply#347 + speedyapply#350):
- CSRF URL `/Job/computer-science-jobs.htm` 404s after the Next.js
  migration. Switch to homepage `/`, which reliably returns the
  embedded `"token":"..."` payload.
- `findPopularLocationAjax.htm?term=...` raw-interpolated locations
  with commas/spaces, causing HTTP 400 ("location not parsed"). Wrap
  the term in `urllib.parse.quote`. This was the actual root cause of
  every 0-row Glassdoor call — the city-only stripping in monitor's
  `run_search` only narrowed the pain, didn't fix it (Glassdoor still
  needs encoding even on bare city names with no comma).
- The `/graph` endpoint regularly returns 30 valid jobs alongside
  non-fatal `errors` on peripheral fields like `jobsPageSeoData`. Old
  code raised on ANY `errors` key and discarded everything. Only fail
  when `data.jobListings` is missing.
- `_get_location` now raises GlassdoorException with the underlying
  HTTP status code instead of returning `(None, None)` and losing
  context.
- Per-instance header dict so concurrent scrapes don't trample each
  other's CSRF token / UA, and so `_fetch_job_description` reads this
  scrape's token rather than whatever the most recent Glassdoor
  instance left in the module-global.

bayt:
- Slugify now folds accented Latin chars to ASCII before stripping
  non-ASCII (`ingénieur logiciel` → `ingenieur-logiciel` rather than
  `ingnieur-logiciel`).

google:
- The job-listing wrapper key (`520084652`) is a Google-internal
  function ID that gets rotated on redeploy; old parser failed
  immediately when it changed. Make the parser tolerate rotation:
  try the known keys first, then fall back to scanning every
  9-digit-key payload whose first record structurally matches a job
  shape (title/company/location strings + nested URL list). Log the
  fallback key when it fires so we can bake it back into the known
  list.
- `find_job_info` (used by paginated responses) gets the same
  treatment — if no known key matches, walk the JSON looking for
  arrays whose first item shape-matches a job record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants