Skip to content

Florida Scraper Class#1961

Open
MorganBennetDev wants to merge 27 commits into
mainfrom
morgan/florida-scraper
Open

Florida Scraper Class#1961
MorganBennetDev wants to merge 27 commits into
mainfrom
morgan/florida-scraper

Conversation

@MorganBennetDev
Copy link
Copy Markdown
Contributor

Adds a scraper for the Florida courts site and a configurable request management class.

Summary

scraper.py

  • Add the new async scraper class.

RequestManager.py

  • Add a configurable request manager class with rate limits and retries

Unsure how much this one overlaps with Kent, but since we're still in the experimentation phase with scrapers, I figured it would be okay.

cases.py, common.py, docket_entries.py, parties.py, and documents.py

  • Adds a supplementary method to parsers for paginated endpoints to keep the pagination data when returning results. Achieved by extending an ABC in common.py.
  • Move court ID map from cases.py to courts.py because it makes more sense there.

metadata.py

  • Miscellaneous Pydantic validation classes for metadata endpoints.

Notes

When we integrate this into CourtListener, we can achieve response archiving (as discussed in the planning issue) by passing a custom RequestHandler to the scraper which will save responses to S3. We can also set it up to handle pulling responses from S3 instead of the network if we do a two-phase backfill (case list and then case metadata).

MorganBennetDev and others added 12 commits May 11, 2026 18:18
Might be abstract nonsense, but we'll see
Adds 16 async unittest cases (httpx.MockTransport) covering RequestManager
lifecycle, handler hook ordering, RateLimit (including concurrent
serialization), and Retry.

Fixes two bugs surfaced by the new tests:
- ScheduledRequest.reset() cancelled response_future unconditionally, killing
  listen handlers that were already awaiting it on the first attempt (and
  silently disabling Retry). Now only swaps the future when it's already done.
- make_default_logger(__name__) was using the module name as a log file path,
  dropping a file in the repo root each run. Matches every other call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds FloridaScraper built on RequestManager, covering courts and case
metadata enumeration, case-list pagination with binary date-range
subdivision under the 10k-result API cap, and per-case fetch. backfill()
yields CaseRefs for downstream ingestion.

Design choices baked in:
- ScrapedCourtExternalID IntEnum names the 7 in-scope appellate courts.
  The API silently adding ID 8 should not start scraping a court we
  haven't audited; adding one here is an explicit decision.
- Single-day >10k overflow raises InsanityException (matches Texas
  TAMES). Soft-warning would silently truncate by filedDate asc, which
  is a data-integrity bug in a historical backfill.
- _write_json uses compact separators, no indent, no sort_keys.
  Optimizes for archive storage; files can be reformatted locally if
  inspection is needed.

Tests: 13 cases covering URL/path helpers, datetime formatting, the
filename-safe sanitizer, and the scraper's core flows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scraper no longer persists anything. Case-list and per-case payloads
are yielded to the caller (CourtListener) which is responsible for
storage. Drops _write_json, _fs_safe, _case_dir, the output_root field,
and the Path argument on run_backfill.

The /courts and per-court reference metadata endpoints
(casepartysubtypes, casecategories, docketentrysubtypes) are hit at most
once each per scraper instance via lazy in-memory caches.

New dataclasses:
- CourtMetadata bundles the three reference payloads for one court.
- CaseData bundles a CaseRef with every parsed payload (case, hearings,
  parties, docket entries, and docket-entry documents keyed by
  docketEntryUUID).

backfill() now yields CaseData instead of CaseRef. CaseRef survives as
the discovery handle from enumerate_cases. run_backfill keeps its int
count return and discards payloads; callers wanting the data should
drive FloridaScraper.backfill() directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five bugs surfaced while updating tests after the 04cdc9f cleanup:

- fetch_case_data looked up case.court_id (the validator-rewritten
  juriscraper id like "flca04") in FLORIDA_COURT_EXTERNAL_ID_MAP, which
  is keyed by the API's external id ("5"). The lookup never hit, so
  backfill(full_scrape=True) raised ValueError on every case. Switched
  to FloridaCourtID(case.court_id) and added a distinct error for
  known-but-uncached courts.
- _fetch_paginated initialised `page = 0` and never reassigned it; only
  the unused `next_page` sentinel incremented. Multi-page responses
  looped forever fetching page 0. Dropped the dead variable.
- enumerate_cases broke on page_number >= total_pages, firing one page
  late. Now breaks on page_number + 1 >= total_pages.
- typing.override only exists in stdlib from 3.12+, but
  requires-python is ">=3.10". Imported from typing_extensions instead
  in common.py, cases.py, docket_entries.py, documents.py, and
  parties.py. tox -e py310 could not collect the florida tests before
  this change.
- __init__.py still re-exported CaseData, CaseRef, and run_backfill —
  symbols removed in b93e970 and 04cdc9f. Cleaned up.

Test updates remove a sys.modules/importlib workaround that the test
file used to dodge the broken __init__.py, and simplify the
_parse_case helper that was mutating court_id to work around the
fetch_case_data bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MorganBennetDev
Copy link
Copy Markdown
Contributor Author

I'm always forgetting the changelog :(

Copy link
Copy Markdown
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MorganBennetDev, this looks good. I have a few comments and suggestions.

Additionally, one architectural concern is the introduction of a new RequestManager class. My concern is that we'll end up maintaining similar code and different approaches across our state scrapers. For TAMES, we already have ScraperRequestManager in BaseStateScraper.py and RateLimitedRequestManager in cl/scrapers/management/commands/back_scrape_dockets.py.

Is this new RequestManager very different from the one used for TAMES? How hard would it be, and what downsides would we have, if we instead extended and reused the TAMES ScraperRequestManager / RateLimitedRequestManager for Florida? We could even move RateLimitedRequestManager to Juriscraper if that makes things easier.

totals = {r.page.total_elements for r in results}
if len(totals) > 1:
logger.error(
f"Paginated fetch returned different totalElements across fetches ({totals})."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log more params here that would help us identify where this issue happened? For example, the endpoint and other useful params, so we can try to replicate the request manually and check if there is any inconsistency that we should fix.

Also, is it ok to still return results here, or would it be better to just raise the error and return?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it ok to still return results here, or would it be better to just raise the error and return?

I'm working under the assumption that if the number of results changes while we're iterating through the pages, it means that something was uploaded that matches our search. This means that if we encounter this error, we should have partial results but will need to redo for the date range in question to get the complete set of results (will add that to log).

There's also the chance that a docket was removed while we're iterating, but I'm assuming that case is too rare to care about.

Comment thread juriscraper/state/florida/scraper.py Outdated
Comment thread juriscraper/state/florida/scraper.py Outdated
raise ValueError(
"Unknown court id %r. Call fetch_courts() first." % court_id
)
court_external_id = str(court_metadata.court.external_identifier)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This court_external_id is a string here, but it's an integer on this line:

"caseHeader.courtID": court_data.court.external_identifier,

It might not be an issue, but I just wanted to confirm which one is correct so we can keep it consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an int in both places to stay consistent with courts.py

Comment thread juriscraper/state/florida/scraper.py Outdated
case_uuid = output_case.case_uuid

FloridaCaseInfoParser.populate_transfers(output_case)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this endpoint supposed to fetch the case detail data?

I'm seeing that we have the /courts/{court}/cms/cases/{case} endpoint defined in FloridaCaseInfoParser, but I don't see it being used here or anywhere else to fetch the case detail data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot that there's more data when we fetch the case directly.


de_parser = FloridaDocketEntryListParser(court_id=court_id.value)
parties_parser = FloridaPartyListParser(court_id=court_id.value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we're also missing fetching the /courts/{court}/cms/cases/{case}/hearings endpoint?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will necessitate adding a field to FloridaCase to nicely capture hearings. Adding.

Comment thread juriscraper/state/RequestManager.py Outdated
def __del__(self) -> None:
"""Clean up allocated resources. Cancels the request loop if it's running
and closes the client if it's open."""
if hasattr(self, "loop_future") and self._loop_future:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be _loop_future here?

Suggested change
if hasattr(self, "loop_future") and self._loop_future:
if hasattr(self, "_loop_future") and self._loop_future:

Comment thread juriscraper/state/RequestManager.py Outdated
Comment on lines +441 to +451
# Best-effort async cleanup; callers should prefer `await close()`.
try:
loop = asyncio.get_running_loop()
except RuntimeError:
try:
asyncio.run(client.aclose())
except Exception:
pass
else:
_ = loop.create_task(client.aclose())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? I think this could lead to masking a different bug. Could we drop this best-effort path and instead only log a warning/error if the client is not closed?

Comment thread juriscraper/state/RequestManager.py Outdated
self, manager: "RequestManager", request: ScheduledRequest
) -> None:
"""Ensure that requests aren't sent faster than the rate limit."""
_ = await self._lock.acquire()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this lock required?

From RequestManager._loop:

handler.before_send(self, request)

It's awaiting every request, so it seems that before_send is only executed once at a time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a bit too excited about async on this PR, which is why this is here. I was thinking of a situation with many request managers on different threads sharing one rate-limit instance, but realistically that's out of scope. Will simplify this.

Comment thread juriscraper/state/RequestManager.py Outdated
while remaining_tries > 0:
try:
_ = await request.response()
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too broad?

I'm seeing that within send() you're doing _ = response.raise_for_status(), so that means Retry.listen() will catch any HTTP exception, including 4xx errors that might not need to be retried. What if we only retry on transient errors like 5xx, network errors, timeouts, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be narrower, yes. Was an oversight on my end.

@MorganBennetDev
Copy link
Copy Markdown
Contributor Author

@albertisfu Answers to some of your questions. I may have been confused about some stuff we had discussed about scrapers.

Additionally, one architectural concern is the introduction of a new RequestManager class.

I was under the impression that we were neglecting the duplication concern for Florida and New York as part of a scraper experimentation phase and we were going to go back after the experimentation phase and clean everything up. I may have gotten the wrong impression from our discussion though.

Is this new RequestManager very different from the one used for TAMES?

The main difference is that where the TAMES manager uses an all_response_fn parameter passed at initialization, this one uses a list of RequestHandler instances instead. A single RequestHandler is functionally identical to the all_response_fn parameter. The benefit I see to having multiple handlers is that we can separate out and parameterize repeated logic like rate limits, retries, and archiving to S3, hopefully minimizing code we have to duplicate for each state.

How hard would it be, and what downsides would we have, if we instead extended and reused the TAMES ScraperRequestManager / RateLimitedRequestManager for Florida?

We would need to reimplement rate-limiting and retries for Florida somehow (probably a wrapper on FloridaScraper around ScraperRequestManager.get for retries and a RateLimitedRequestManger for throttling) and swap out any httpx-specific stuff for its Requests equivalent. Other than that, I don't think any core logic would need to change.

As a sidenote, now that I'm looking at this code after spending some time away from it, it is a bit complex for what it's trying to achieve. If you're okay with keeping the multiple RequestHandler approach, I will be simplifying this a lot.

@MorganBennetDev MorganBennetDev moved this from To Do to In progress in Sprint (Web Team) May 26, 2026
MorganBennetDev and others added 3 commits May 26, 2026 14:49
The Florida case-list endpoint now has a trailing slash (sourced from
FloridaCaseListParser.endpoint), and fetch_case_data hits the case-detail
endpoint before parties/entries. Unrouted requests previously fell through
to the recorder's 404, which the Retry handler used to swallow; it now
propagates 4xx errors so the URL mismatches surface as test failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@albertisfu
Copy link
Copy Markdown
Contributor

I was under the impression that we were neglecting the duplication concern for Florida and New York as part of a scraper experimentation phase and we were going to go back after the experimentation phase and clean everything up. I may have gotten the wrong impression from our discussion though.

Yeah, this is correct. For New York and Florida, we decided to use different approaches because Brennan is introducing a new scraper architecture based on the Kent driver. This differs from our traditional approach, like the one used in TAMES and the current Florida work.

So yes, we'll have two different approaches, and after the New York scrapers are implemented, we'll analyze what works best by comparing them against the approach used in Texas and Florida.

From my understanding, and based on the code in Texas and Florida, both scrapers follow similar approaches. So, if possible, we should try to unify the foundations of those scrapers so they share at least the same basic principles. That way, we'll only have two approaches to maintain in the future: the traditional one (this) and the new one introduced for New York (using the Kent driver).

The main difference is that where the TAMES manager uses an all_response_fn parameter passed at initialization, this one uses a list of RequestHandler instances instead. A single RequestHandler is functionally identical to the all_response_fn parameter. The benefit I see to having multiple handlers is that we can separate out and parameterize repeated logic like rate limits, retries, and archiving to S3, hopefully minimizing code we have to duplicate for each state.

Ok I see the benefit of having multiple RequestHandler instances.

We would need to reimplement rate-limiting and retries for Florida somehow (probably a wrapper on FloridaScraper around ScraperRequestManager.get for retries and a RateLimitedRequestManger for throttling) and swap out any httpx-specific stuff for its Requests equivalent. Other than that, I don't think any core logic would need to change.

As a sidenote, now that I'm looking at this code after spending some time away from it, it is a bit complex for what it's trying to achieve. If you're okay with keeping the multiple RequestHandler approach, I will be simplifying this a lot.

I think it's worth trying to unify the approaches for this and Texas. Are you suggesting that we can improve the manager used in Texas to support multiple RequestHandlers?

What I would like to see is for the base classes for the Texas and Florida scrapers to be the same, since this will be the "traditional" approach that we use to compare against the new scraper approach introduced in New York. That way, when we need to decide which architecture is better for future states, we can compare against the best unified "traditional" approach that incorporates the best ideas we've developed so far from both Texas and Florida.

This would also simplify code maintainability as we introduce new features or fix potential bugs.

Could you describe a bit what the plan would be if we decide to extend and reuse the classes in juriscraper/state/BaseStateScraper.py instead of creating new ones for Florida? Just to make sure we wouldn't need to change many things that are currently working in the Texas scraper, and that they can work effortlessly with the Florida core logic.

@MorganBennetDev
Copy link
Copy Markdown
Contributor Author

So yes, we'll have two different approaches, and after the New York scrapers are implemented, we'll analyze what works best by comparing them against the approach used in Texas and Florida.

Okay, thank you for the clarification. I thought it was: traditional approach, Kent, whatever Morgan comes up with. This makes more sense.

Could you describe a bit what the plan would be if we decide to extend and reuse the classes in juriscraper/state/BaseStateScraper.py instead of creating new ones for Florida?

To extend BaseStateScraper: Change FloridaScraper.backfill from an AsyncGenerator to a Generator. This would probably necessitate changing most other async methods on FloridaScraper to sync or change BaseStateScraper.backfill to an AsyncGenerator.

If we want a request manager that accepts multiple callbacks, that means that state.RequestManager needs to be merged into BaseStateScraper.ScraperRequestManager somehow. The TAMES scraper will need to be updated to handle the multiple callback version of RequestManager, which will necessitate some changes on the CourtListener side to deal with the initialization being different. I think we should keep the async aspect of state.RequestManager since it lends itself well to multiple callbacks, but that means more changes to Texas.

@albertisfu
Copy link
Copy Markdown
Contributor

To extend BaseStateScraper: Change FloridaScraper.backfill from an AsyncGenerator to a Generator. This would probably necessitate changing most other async methods on FloridaScraper to sync or change BaseStateScraper.backfill to an AsyncGenerator.

Considering we'll need to keep Florida requests controlled by a global rate limit. Will we have any benefit of making Florida scraper async? Do you think we can lose any advantage vs a sync Florida scraper?
If we go sync I think it'd be easier to reuse BaseStateScraper without having to refactor Texas scraper?

I'm thinking that if one day we need to make scraper async Kent, will be a best fit for that work as Brennan mentioned there is a native async driver on it.

If we want a request manager that accepts multiple callbacks, that means that state.RequestManager needs to be merged into BaseStateScraper.ScraperRequestManager somehow. The TAMES scraper will need to be updated to handle the multiple callback version of RequestManager, which will necessitate some changes on the CourtListener side to deal with the initialization being different. I think we should keep the async aspect of state.RequestManager since it lends itself well to multiple callbacks, but that means more changes to Texas.

If we drop async support and keep the multiple handlers, would it be possible to extend ScraperRequestManager to support multiple handlers while remaining backward compatible with a single all_response_fn? That way, we would not need to modify anything in the Texas scraper. I imagine something like:

  • Refactor ScraperRequestManager.__init__ to accept handlers: list[RequestHandler] | None, max_retries: int = 0, and retryable_status(HTTP status codes to retry).
  • All parameters would be optional, and the default behavior would remain unchanged. all_response_fn would be preserved for backward compatibility and would provide an interface for registering a single post-response handler (used in Texas).
  • Update ScraperRequestManager.request to execute handlers before and after sending the request, handle retries in a loop, etc.
  • Add a RateLimit(RequestHandler) here, while removing Retry(RequestHandler) since retries would instead be handled directly in request.
  • Make FloridaScraper extend BaseStateScraper and adopt the backfill(courts, date_range) signature, which is the same one used in Texas.

So after this change, I imagine we would be able to use ScraperRequestManager like this:

ScraperRequestManager(handlers=[RateLimitHandler(2.5)], max_retries=3)

And we would probably also be able to remove RateLimitedRequestManager from cl/scrapers/management/commands/back_scrape_dockets.py and use ScraperRequestManager instead.

What do you think?

@albertisfu albertisfu moved this from In progress to To Do in Sprint (Web Team) May 27, 2026
@MorganBennetDev
Copy link
Copy Markdown
Contributor Author

Considering we'll need to keep Florida requests controlled by a global rate limit. Will we have any benefit of making Florida scraper async? Do you think we can lose any advantage vs a sync Florida scraper?

We lose the ability to execute pre- and post-request logic concurrently with requests. So any requests, parsing, and archiving would all block each other. In my thinking, the big disadvantage of having Florida be sync is that to archive raw responses, we would either have to block the next request until our IO is done, or archive everything at the end (which would be extremely brittle). I don't know how fast saving something to S3 is, but it would be nice to not have to worry about that regardless of our RPS ratelimit.

If we go sync I think it'd be easier to reuse BaseStateScraper without having to refactor Texas scraper?

Keeping the request manager sync and moving Florida to be fully sync would allow Texas to stay the same, yes.

If we drop async support and keep the multiple handlers, would it be possible to extend ScraperRequestManager to support multiple handlers while remaining backward compatible with a single all_response_fn?

. . .

What do you think?

I think this all sounds good and plausible to implement.

I do think we can still have retries work using the same handler pattern even if everything's sync. This would let us keep the ScraperRequestManager focused and not lock all consumers into a single retry implementation if they need something different (Texas scrapers in CL use backoff, but I haven't implemented that here for instance).

And we would probably also be able to remove RateLimitedRequestManager from cl/scrapers/management/commands/back_scrape_dockets.py and use ScraperRequestManager instead.

Looking at back_scrape_dockets.py in CL, it won't be a drop-in replacement, but it won't be difficult.

@albertisfu
Copy link
Copy Markdown
Contributor

Thanks. As per our conversation, since we’re considering storing the content in S3, and in Florida this could become a frequent and intensive operation due to the multiple endpoints we need to scrape, this process could become a real bottleneck. Because of that, we decided to keep the async approach so we can run pre and post-request logic concurrently.

We’ll just simplify the code and prune functionalities that might not be needed right now for Florida.

MorganBennetDev and others added 6 commits May 27, 2026 16:23
… Future

ScheduledRequest constructed httpx.Request directly, bypassing the
AsyncClient base_url / default header / cookie / param merge that
build_request() performs. httpx.send() sends pre-built requests as-is,
so requests to relative paths like "/courts" failed with
"unknown url type". Route through self.build_request() and wrap the
result via a new ScheduledRequest.from_httpx_request classmethod.

FloridaScraper.courts was a @cached_property over an async def, which
cached the coroutine — not the resolved dict. The second await raised
"cannot reuse already awaited coroutine". Replace with a property that
lazily creates and caches an asyncio.Future, which is re-awaitable and
naturally collapses concurrent first-callers onto a single fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To Do

Development

Successfully merging this pull request may close these issues.

2 participants