Florida Scraper Class#1961
Conversation
Might be abstract nonsense, but we'll see
Adds 16 async unittest cases (httpx.MockTransport) covering RequestManager lifecycle, handler hook ordering, RateLimit (including concurrent serialization), and Retry. Fixes two bugs surfaced by the new tests: - ScheduledRequest.reset() cancelled response_future unconditionally, killing listen handlers that were already awaiting it on the first attempt (and silently disabling Retry). Now only swaps the future when it's already done. - make_default_logger(__name__) was using the module name as a log file path, dropping a file in the repo root each run. Matches every other call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds FloridaScraper built on RequestManager, covering courts and case metadata enumeration, case-list pagination with binary date-range subdivision under the 10k-result API cap, and per-case fetch. backfill() yields CaseRefs for downstream ingestion. Design choices baked in: - ScrapedCourtExternalID IntEnum names the 7 in-scope appellate courts. The API silently adding ID 8 should not start scraping a court we haven't audited; adding one here is an explicit decision. - Single-day >10k overflow raises InsanityException (matches Texas TAMES). Soft-warning would silently truncate by filedDate asc, which is a data-integrity bug in a historical backfill. - _write_json uses compact separators, no indent, no sort_keys. Optimizes for archive storage; files can be reformatted locally if inspection is needed. Tests: 13 cases covering URL/path helpers, datetime formatting, the filename-safe sanitizer, and the scraper's core flows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scraper no longer persists anything. Case-list and per-case payloads are yielded to the caller (CourtListener) which is responsible for storage. Drops _write_json, _fs_safe, _case_dir, the output_root field, and the Path argument on run_backfill. The /courts and per-court reference metadata endpoints (casepartysubtypes, casecategories, docketentrysubtypes) are hit at most once each per scraper instance via lazy in-memory caches. New dataclasses: - CourtMetadata bundles the three reference payloads for one court. - CaseData bundles a CaseRef with every parsed payload (case, hearings, parties, docket entries, and docket-entry documents keyed by docketEntryUUID). backfill() now yields CaseData instead of CaseRef. CaseRef survives as the discovery handle from enumerate_cases. run_backfill keeps its int count return and discards payloads; callers wanting the data should drive FloridaScraper.backfill() directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five bugs surfaced while updating tests after the 04cdc9f cleanup: - fetch_case_data looked up case.court_id (the validator-rewritten juriscraper id like "flca04") in FLORIDA_COURT_EXTERNAL_ID_MAP, which is keyed by the API's external id ("5"). The lookup never hit, so backfill(full_scrape=True) raised ValueError on every case. Switched to FloridaCourtID(case.court_id) and added a distinct error for known-but-uncached courts. - _fetch_paginated initialised `page = 0` and never reassigned it; only the unused `next_page` sentinel incremented. Multi-page responses looped forever fetching page 0. Dropped the dead variable. - enumerate_cases broke on page_number >= total_pages, firing one page late. Now breaks on page_number + 1 >= total_pages. - typing.override only exists in stdlib from 3.12+, but requires-python is ">=3.10". Imported from typing_extensions instead in common.py, cases.py, docket_entries.py, documents.py, and parties.py. tox -e py310 could not collect the florida tests before this change. - __init__.py still re-exported CaseData, CaseRef, and run_backfill — symbols removed in b93e970 and 04cdc9f. Cleaned up. Test updates remove a sys.modules/importlib workaround that the test file used to dodge the broken __init__.py, and simplify the _parse_case helper that was mutating court_id to work around the fetch_case_data bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I'm always forgetting the changelog :( |
albertisfu
left a comment
There was a problem hiding this comment.
Thanks @MorganBennetDev, this looks good. I have a few comments and suggestions.
Additionally, one architectural concern is the introduction of a new RequestManager class. My concern is that we'll end up maintaining similar code and different approaches across our state scrapers. For TAMES, we already have ScraperRequestManager in BaseStateScraper.py and RateLimitedRequestManager in cl/scrapers/management/commands/back_scrape_dockets.py.
Is this new RequestManager very different from the one used for TAMES? How hard would it be, and what downsides would we have, if we instead extended and reused the TAMES ScraperRequestManager / RateLimitedRequestManager for Florida? We could even move RateLimitedRequestManager to Juriscraper if that makes things easier.
| totals = {r.page.total_elements for r in results} | ||
| if len(totals) > 1: | ||
| logger.error( | ||
| f"Paginated fetch returned different totalElements across fetches ({totals})." |
There was a problem hiding this comment.
Can we log more params here that would help us identify where this issue happened? For example, the endpoint and other useful params, so we can try to replicate the request manually and check if there is any inconsistency that we should fix.
Also, is it ok to still return results here, or would it be better to just raise the error and return?
There was a problem hiding this comment.
Also, is it ok to still return results here, or would it be better to just raise the error and return?
I'm working under the assumption that if the number of results changes while we're iterating through the pages, it means that something was uploaded that matches our search. This means that if we encounter this error, we should have partial results but will need to redo for the date range in question to get the complete set of results (will add that to log).
There's also the chance that a docket was removed while we're iterating, but I'm assuming that case is too rare to care about.
| raise ValueError( | ||
| "Unknown court id %r. Call fetch_courts() first." % court_id | ||
| ) | ||
| court_external_id = str(court_metadata.court.external_identifier) |
There was a problem hiding this comment.
This court_external_id is a string here, but it's an integer on this line:
It might not be an issue, but I just wanted to confirm which one is correct so we can keep it consistent.
There was a problem hiding this comment.
Should be an int in both places to stay consistent with courts.py
| case_uuid = output_case.case_uuid | ||
|
|
||
| FloridaCaseInfoParser.populate_transfers(output_case) | ||
|
|
There was a problem hiding this comment.
Is this endpoint supposed to fetch the case detail data?
I'm seeing that we have the /courts/{court}/cms/cases/{case} endpoint defined in FloridaCaseInfoParser, but I don't see it being used here or anywhere else to fetch the case detail data.
There was a problem hiding this comment.
Yes, I forgot that there's more data when we fetch the case directly.
|
|
||
| de_parser = FloridaDocketEntryListParser(court_id=court_id.value) | ||
| parties_parser = FloridaPartyListParser(court_id=court_id.value) | ||
|
|
There was a problem hiding this comment.
It seems that we're also missing fetching the /courts/{court}/cms/cases/{case}/hearings endpoint?
There was a problem hiding this comment.
This will necessitate adding a field to FloridaCase to nicely capture hearings. Adding.
| def __del__(self) -> None: | ||
| """Clean up allocated resources. Cancels the request loop if it's running | ||
| and closes the client if it's open.""" | ||
| if hasattr(self, "loop_future") and self._loop_future: |
There was a problem hiding this comment.
Shouldn't be _loop_future here?
| if hasattr(self, "loop_future") and self._loop_future: | |
| if hasattr(self, "_loop_future") and self._loop_future: |
| # Best-effort async cleanup; callers should prefer `await close()`. | ||
| try: | ||
| loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| try: | ||
| asyncio.run(client.aclose()) | ||
| except Exception: | ||
| pass | ||
| else: | ||
| _ = loop.create_task(client.aclose()) | ||
|
|
There was a problem hiding this comment.
Is this really necessary? I think this could lead to masking a different bug. Could we drop this best-effort path and instead only log a warning/error if the client is not closed?
| self, manager: "RequestManager", request: ScheduledRequest | ||
| ) -> None: | ||
| """Ensure that requests aren't sent faster than the rate limit.""" | ||
| _ = await self._lock.acquire() |
There was a problem hiding this comment.
Is this lock required?
From RequestManager._loop:
It's awaiting every request, so it seems that before_send is only executed once at a time?
There was a problem hiding this comment.
I got a bit too excited about async on this PR, which is why this is here. I was thinking of a situation with many request managers on different threads sharing one rate-limit instance, but realistically that's out of scope. Will simplify this.
| while remaining_tries > 0: | ||
| try: | ||
| _ = await request.response() | ||
| except Exception: |
There was a problem hiding this comment.
Is this too broad?
I'm seeing that within send() you're doing _ = response.raise_for_status(), so that means Retry.listen() will catch any HTTP exception, including 4xx errors that might not need to be retried. What if we only retry on transient errors like 5xx, network errors, timeouts, etc.
There was a problem hiding this comment.
This should be narrower, yes. Was an oversight on my end.
|
@albertisfu Answers to some of your questions. I may have been confused about some stuff we had discussed about scrapers.
I was under the impression that we were neglecting the duplication concern for Florida and New York as part of a scraper experimentation phase and we were going to go back after the experimentation phase and clean everything up. I may have gotten the wrong impression from our discussion though.
The main difference is that where the TAMES manager uses an
We would need to reimplement rate-limiting and retries for Florida somehow (probably a wrapper on As a sidenote, now that I'm looking at this code after spending some time away from it, it is a bit complex for what it's trying to achieve. If you're okay with keeping the multiple |
The Florida case-list endpoint now has a trailing slash (sourced from FloridaCaseListParser.endpoint), and fetch_case_data hits the case-detail endpoint before parties/entries. Unrouted requests previously fell through to the recorder's 404, which the Retry handler used to swallow; it now propagates 4xx errors so the URL mismatches surface as test failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yeah, this is correct. For New York and Florida, we decided to use different approaches because Brennan is introducing a new scraper architecture based on the Kent driver. This differs from our traditional approach, like the one used in TAMES and the current Florida work. So yes, we'll have two different approaches, and after the New York scrapers are implemented, we'll analyze what works best by comparing them against the approach used in Texas and Florida. From my understanding, and based on the code in Texas and Florida, both scrapers follow similar approaches. So, if possible, we should try to unify the foundations of those scrapers so they share at least the same basic principles. That way, we'll only have two approaches to maintain in the future: the traditional one (this) and the new one introduced for New York (using the Kent driver).
Ok I see the benefit of having multiple
I think it's worth trying to unify the approaches for this and Texas. Are you suggesting that we can improve the manager used in Texas to support multiple What I would like to see is for the base classes for the Texas and Florida scrapers to be the same, since this will be the "traditional" approach that we use to compare against the new scraper approach introduced in New York. That way, when we need to decide which architecture is better for future states, we can compare against the best unified "traditional" approach that incorporates the best ideas we've developed so far from both Texas and Florida. This would also simplify code maintainability as we introduce new features or fix potential bugs. Could you describe a bit what the plan would be if we decide to extend and reuse the classes in |
Okay, thank you for the clarification. I thought it was: traditional approach, Kent, whatever Morgan comes up with. This makes more sense.
To extend If we want a request manager that accepts multiple callbacks, that means that |
Considering we'll need to keep Florida requests controlled by a global rate limit. Will we have any benefit of making Florida scraper async? Do you think we can lose any advantage vs a sync Florida scraper? I'm thinking that if one day we need to make scraper async Kent, will be a best fit for that work as Brennan mentioned there is a native async driver on it.
If we drop async support and keep the multiple handlers, would it be possible to extend
So after this change, I imagine we would be able to use
And we would probably also be able to remove What do you think? |
We lose the ability to execute pre- and post-request logic concurrently with requests. So any requests, parsing, and archiving would all block each other. In my thinking, the big disadvantage of having Florida be sync is that to archive raw responses, we would either have to block the next request until our IO is done, or archive everything at the end (which would be extremely brittle). I don't know how fast saving something to S3 is, but it would be nice to not have to worry about that regardless of our RPS ratelimit.
Keeping the request manager sync and moving Florida to be fully sync would allow Texas to stay the same, yes.
I think this all sounds good and plausible to implement. I do think we can still have retries work using the same handler pattern even if everything's sync. This would let us keep the
Looking at |
|
Thanks. As per our conversation, since we’re considering storing the content in S3, and in Florida this could become a frequent and intensive operation due to the multiple endpoints we need to scrape, this process could become a real bottleneck. Because of that, we decided to keep the async approach so we can run pre and post-request logic concurrently. We’ll just simplify the code and prune functionalities that might not be needed right now for Florida. |
… Future ScheduledRequest constructed httpx.Request directly, bypassing the AsyncClient base_url / default header / cookie / param merge that build_request() performs. httpx.send() sends pre-built requests as-is, so requests to relative paths like "/courts" failed with "unknown url type". Route through self.build_request() and wrap the result via a new ScheduledRequest.from_httpx_request classmethod. FloridaScraper.courts was a @cached_property over an async def, which cached the coroutine — not the resolved dict. The second await raised "cannot reuse already awaited coroutine". Replace with a property that lazily creates and caches an asyncio.Future, which is re-awaitable and naturally collapses concurrent first-callers onto a single fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a scraper for the Florida courts site and a configurable request management class.
Summary
scraper.pyRequestManager.pyUnsure how much this one overlaps with Kent, but since we're still in the experimentation phase with scrapers, I figured it would be okay.
cases.py,common.py,docket_entries.py,parties.py, anddocuments.pycommon.py.cases.pytocourts.pybecause it makes more sense there.metadata.pyNotes
When we integrate this into CourtListener, we can achieve response archiving (as discussed in the planning issue) by passing a custom
RequestHandlerto the scraper which will save responses to S3. We can also set it up to handle pulling responses from S3 instead of the network if we do a two-phase backfill (case list and then case metadata).