Florida Scraper Class by MorganBennetDev · Pull Request #1961 · freelawproject/juriscraper

MorganBennetDev · 2026-05-19T00:24:33Z

Adds a scraper for the Florida courts site and a configurable request management class.

Summary

`scraper.py`

Add the new async scraper class.

`RequestManager.py`

Add a configurable request manager class with rate limits and retries

Unsure how much this one overlaps with Kent, but since we're still in the experimentation phase with scrapers, I figured it would be okay.

`cases.py`, `common.py`, `docket_entries.py`, `parties.py`, and `documents.py`

Adds a supplementary method to parsers for paginated endpoints to keep the pagination data when returning results. Achieved by extending an ABC in common.py.
Move court ID map from cases.py to courts.py because it makes more sense there.

`metadata.py`

Miscellaneous Pydantic validation classes for metadata endpoints.

Notes

When we integrate this into CourtListener, we can achieve response archiving (as discussed in the planning issue) by passing a custom RequestHandler to the scraper which will save responses to S3. We can also set it up to handle pulling responses from S3 instead of the network if we do a two-phase backfill (case list and then case metadata).

Might be abstract nonsense, but we'll see

…ging

Adds 16 async unittest cases (httpx.MockTransport) covering RequestManager lifecycle, handler hook ordering, RateLimit (including concurrent serialization), and Retry. Fixes two bugs surfaced by the new tests: - ScheduledRequest.reset() cancelled response_future unconditionally, killing listen handlers that were already awaiting it on the first attempt (and silently disabling Retry). Now only swaps the future when it's already done. - make_default_logger(__name__) was using the module name as a log file path, dropping a file in the repo root each run. Matches every other call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds FloridaScraper built on RequestManager, covering courts and case metadata enumeration, case-list pagination with binary date-range subdivision under the 10k-result API cap, and per-case fetch. backfill() yields CaseRefs for downstream ingestion. Design choices baked in: - ScrapedCourtExternalID IntEnum names the 7 in-scope appellate courts. The API silently adding ID 8 should not start scraping a court we haven't audited; adding one here is an explicit decision. - Single-day >10k overflow raises InsanityException (matches Texas TAMES). Soft-warning would silently truncate by filedDate asc, which is a data-integrity bug in a historical backfill. - _write_json uses compact separators, no indent, no sort_keys. Optimizes for archive storage; files can be reformatted locally if inspection is needed. Tests: 13 cases covering URL/path helpers, datetime formatting, the filename-safe sanitizer, and the scraper's core flows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The scraper no longer persists anything. Case-list and per-case payloads are yielded to the caller (CourtListener) which is responsible for storage. Drops _write_json, _fs_safe, _case_dir, the output_root field, and the Path argument on run_backfill. The /courts and per-court reference metadata endpoints (casepartysubtypes, casecategories, docketentrysubtypes) are hit at most once each per scraper instance via lazy in-memory caches. New dataclasses: - CourtMetadata bundles the three reference payloads for one court. - CaseData bundles a CaseRef with every parsed payload (case, hearings, parties, docket entries, and docket-entry documents keyed by docketEntryUUID). backfill() now yields CaseData instead of CaseRef. CaseRef survives as the discovery handle from enumerate_cases. run_backfill keeps its int count return and discards payloads; callers wanting the data should drive FloridaScraper.backfill() directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five bugs surfaced while updating tests after the 04cdc9f cleanup: - fetch_case_data looked up case.court_id (the validator-rewritten juriscraper id like "flca04") in FLORIDA_COURT_EXTERNAL_ID_MAP, which is keyed by the API's external id ("5"). The lookup never hit, so backfill(full_scrape=True) raised ValueError on every case. Switched to FloridaCourtID(case.court_id) and added a distinct error for known-but-uncached courts. - _fetch_paginated initialised `page = 0` and never reassigned it; only the unused `next_page` sentinel incremented. Multi-page responses looped forever fetching page 0. Dropped the dead variable. - enumerate_cases broke on page_number >= total_pages, firing one page late. Now breaks on page_number + 1 >= total_pages. - typing.override only exists in stdlib from 3.12+, but requires-python is ">=3.10". Imported from typing_extensions instead in common.py, cases.py, docket_entries.py, documents.py, and parties.py. tox -e py310 could not collect the florida tests before this change. - __init__.py still re-exported CaseData, CaseRef, and run_backfill — symbols removed in b93e970 and 04cdc9f. Cleaned up. Test updates remove a sys.modules/importlib workaround that the test file used to dodge the broken __init__.py, and simplify the _parse_case helper that was mutating court_id to work around the fetch_case_data bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

MorganBennetDev · 2026-05-19T00:28:33Z

I'm always forgetting the changelog :(

albertisfu

Thanks @MorganBennetDev, this looks good. I have a few comments and suggestions.

Additionally, one architectural concern is the introduction of a new RequestManager class. My concern is that we'll end up maintaining similar code and different approaches across our state scrapers. For TAMES, we already have ScraperRequestManager in BaseStateScraper.py and RateLimitedRequestManager in cl/scrapers/management/commands/back_scrape_dockets.py.

Is this new RequestManager very different from the one used for TAMES? How hard would it be, and what downsides would we have, if we instead extended and reused the TAMES ScraperRequestManager / RateLimitedRequestManager for Florida? We could even move RateLimitedRequestManager to Juriscraper if that makes things easier.

albertisfu · 2026-05-26T14:45:46Z

+        totals = {r.page.total_elements for r in results}
+        if len(totals) > 1:
+            logger.error(
+                f"Paginated fetch returned different totalElements across fetches ({totals})."


Can we log more params here that would help us identify where this issue happened? For example, the endpoint and other useful params, so we can try to replicate the request manually and check if there is any inconsistency that we should fix.

Also, is it ok to still return results here, or would it be better to just raise the error and return?

Also, is it ok to still return results here, or would it be better to just raise the error and return?

I'm working under the assumption that if the number of results changes while we're iterating through the pages, it means that something was uploaded that matches our search. This means that if we encounter this error, we should have partial results but will need to redo for the date range in question to get the complete set of results (will add that to log).

There's also the chance that a docket was removed while we're iterating, but I'm assuming that case is too rare to care about.

albertisfu · 2026-05-26T14:46:24Z

+            raise ValueError(
+                "Unknown court id %r. Call fetch_courts() first." % court_id
+            )
+        court_external_id = str(court_metadata.court.external_identifier)


This court_external_id is a string here, but it's an integer on this line:

juriscraper/juriscraper/state/florida/scraper.py

Line 438 in 33f17f8

"caseHeader.courtID": court_data.court.external_identifier,

It might not be an issue, but I just wanted to confirm which one is correct so we can keep it consistent.

Should be an int in both places to stay consistent with courts.py

albertisfu · 2026-05-26T14:46:54Z

+        case_uuid = output_case.case_uuid
+
+        FloridaCaseInfoParser.populate_transfers(output_case)
+


Is this endpoint supposed to fetch the case detail data?

I'm seeing that we have the /courts/{court}/cms/cases/{case} endpoint defined in FloridaCaseInfoParser, but I don't see it being used here or anywhere else to fetch the case detail data.

Yes, I forgot that there's more data when we fetch the case directly.

albertisfu · 2026-05-26T14:47:11Z

+
+        de_parser = FloridaDocketEntryListParser(court_id=court_id.value)
+        parties_parser = FloridaPartyListParser(court_id=court_id.value)
+


It seems that we're also missing fetching the /courts/{court}/cms/cases/{case}/hearings endpoint?

This will necessitate adding a field to FloridaCase to nicely capture hearings. Adding.

albertisfu · 2026-05-26T14:47:45Z

+    def __del__(self) -> None:
+        """Clean up allocated resources. Cancels the request loop if it's running
+        and closes the client if it's open."""
+        if hasattr(self, "loop_future") and self._loop_future:


Shouldn't be _loop_future here?

Suggested change

if hasattr(self, "loop_future") and self._loop_future:

if hasattr(self, "_loop_future") and self._loop_future:

albertisfu · 2026-05-26T14:48:16Z

+        # Best-effort async cleanup; callers should prefer `await close()`.
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            try:
+                asyncio.run(client.aclose())
+            except Exception:
+                pass
+        else:
+            _ = loop.create_task(client.aclose())
+


Is this really necessary? I think this could lead to masking a different bug. Could we drop this best-effort path and instead only log a warning/error if the client is not closed?

albertisfu · 2026-05-26T14:49:02Z

+        self, manager: "RequestManager", request: ScheduledRequest
+    ) -> None:
+        """Ensure that requests aren't sent faster than the rate limit."""
+        _ = await self._lock.acquire()


Is this lock required?

From RequestManager._loop:

juriscraper/juriscraper/state/RequestManager.py

Line 174 in 33f17f8

handler.before_send(self, request)

It's awaiting every request, so it seems that before_send is only executed once at a time?

I got a bit too excited about async on this PR, which is why this is here. I was thinking of a situation with many request managers on different threads sharing one rate-limit instance, but realistically that's out of scope. Will simplify this.

albertisfu · 2026-05-26T14:49:16Z

+        while remaining_tries > 0:
+            try:
+                _ = await request.response()
+            except Exception:


Is this too broad?

I'm seeing that within send() you're doing _ = response.raise_for_status(), so that means Retry.listen() will catch any HTTP exception, including 4xx errors that might not need to be retried. What if we only retry on transient errors like 5xx, network errors, timeouts, etc.

This should be narrower, yes. Was an oversight on my end.

MorganBennetDev · 2026-05-26T17:45:52Z

@albertisfu Answers to some of your questions. I may have been confused about some stuff we had discussed about scrapers.

Additionally, one architectural concern is the introduction of a new RequestManager class.

I was under the impression that we were neglecting the duplication concern for Florida and New York as part of a scraper experimentation phase and we were going to go back after the experimentation phase and clean everything up. I may have gotten the wrong impression from our discussion though.

Is this new RequestManager very different from the one used for TAMES?

The main difference is that where the TAMES manager uses an all_response_fn parameter passed at initialization, this one uses a list of RequestHandler instances instead. A single RequestHandler is functionally identical to the all_response_fn parameter. The benefit I see to having multiple handlers is that we can separate out and parameterize repeated logic like rate limits, retries, and archiving to S3, hopefully minimizing code we have to duplicate for each state.

How hard would it be, and what downsides would we have, if we instead extended and reused the TAMES ScraperRequestManager / RateLimitedRequestManager for Florida?

We would need to reimplement rate-limiting and retries for Florida somehow (probably a wrapper on FloridaScraper around ScraperRequestManager.get for retries and a RateLimitedRequestManger for throttling) and swap out any httpx-specific stuff for its Requests equivalent. Other than that, I don't think any core logic would need to change.

As a sidenote, now that I'm looking at this code after spending some time away from it, it is a bit complex for what it's trying to achieve. If you're okay with keeping the multiple RequestHandler approach, I will be simplifying this a lot.

The Florida case-list endpoint now has a trailing slash (sourced from FloridaCaseListParser.endpoint), and fetch_case_data hits the case-detail endpoint before parties/entries. Unrouted requests previously fell through to the recorder's 404, which the Retry handler used to swallow; it now propagates 4xx errors so the URL mismatches surface as test failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

albertisfu · 2026-05-26T20:28:26Z

I was under the impression that we were neglecting the duplication concern for Florida and New York as part of a scraper experimentation phase and we were going to go back after the experimentation phase and clean everything up. I may have gotten the wrong impression from our discussion though.

Yeah, this is correct. For New York and Florida, we decided to use different approaches because Brennan is introducing a new scraper architecture based on the Kent driver. This differs from our traditional approach, like the one used in TAMES and the current Florida work.

So yes, we'll have two different approaches, and after the New York scrapers are implemented, we'll analyze what works best by comparing them against the approach used in Texas and Florida.

From my understanding, and based on the code in Texas and Florida, both scrapers follow similar approaches. So, if possible, we should try to unify the foundations of those scrapers so they share at least the same basic principles. That way, we'll only have two approaches to maintain in the future: the traditional one (this) and the new one introduced for New York (using the Kent driver).

The main difference is that where the TAMES manager uses an all_response_fn parameter passed at initialization, this one uses a list of RequestHandler instances instead. A single RequestHandler is functionally identical to the all_response_fn parameter. The benefit I see to having multiple handlers is that we can separate out and parameterize repeated logic like rate limits, retries, and archiving to S3, hopefully minimizing code we have to duplicate for each state.

Ok I see the benefit of having multiple RequestHandler instances.

We would need to reimplement rate-limiting and retries for Florida somehow (probably a wrapper on FloridaScraper around ScraperRequestManager.get for retries and a RateLimitedRequestManger for throttling) and swap out any httpx-specific stuff for its Requests equivalent. Other than that, I don't think any core logic would need to change.

As a sidenote, now that I'm looking at this code after spending some time away from it, it is a bit complex for what it's trying to achieve. If you're okay with keeping the multiple RequestHandler approach, I will be simplifying this a lot.

I think it's worth trying to unify the approaches for this and Texas. Are you suggesting that we can improve the manager used in Texas to support multiple RequestHandlers?

What I would like to see is for the base classes for the Texas and Florida scrapers to be the same, since this will be the "traditional" approach that we use to compare against the new scraper approach introduced in New York. That way, when we need to decide which architecture is better for future states, we can compare against the best unified "traditional" approach that incorporates the best ideas we've developed so far from both Texas and Florida.

This would also simplify code maintainability as we introduce new features or fix potential bugs.

Could you describe a bit what the plan would be if we decide to extend and reuse the classes in juriscraper/state/BaseStateScraper.py instead of creating new ones for Florida? Just to make sure we wouldn't need to change many things that are currently working in the Texas scraper, and that they can work effortlessly with the Florida core logic.

MorganBennetDev · 2026-05-26T22:19:13Z

So yes, we'll have two different approaches, and after the New York scrapers are implemented, we'll analyze what works best by comparing them against the approach used in Texas and Florida.

Okay, thank you for the clarification. I thought it was: traditional approach, Kent, whatever Morgan comes up with. This makes more sense.

Could you describe a bit what the plan would be if we decide to extend and reuse the classes in juriscraper/state/BaseStateScraper.py instead of creating new ones for Florida?

To extend BaseStateScraper: Change FloridaScraper.backfill from an AsyncGenerator to a Generator. This would probably necessitate changing most other async methods on FloridaScraper to sync or change BaseStateScraper.backfill to an AsyncGenerator.

If we want a request manager that accepts multiple callbacks, that means that state.RequestManager needs to be merged into BaseStateScraper.ScraperRequestManager somehow. The TAMES scraper will need to be updated to handle the multiple callback version of RequestManager, which will necessitate some changes on the CourtListener side to deal with the initialization being different. I think we should keep the async aspect of state.RequestManager since it lends itself well to multiple callbacks, but that means more changes to Texas.

albertisfu · 2026-05-27T15:17:36Z

To extend BaseStateScraper: Change FloridaScraper.backfill from an AsyncGenerator to a Generator. This would probably necessitate changing most other async methods on FloridaScraper to sync or change BaseStateScraper.backfill to an AsyncGenerator.

Considering we'll need to keep Florida requests controlled by a global rate limit. Will we have any benefit of making Florida scraper async? Do you think we can lose any advantage vs a sync Florida scraper?
If we go sync I think it'd be easier to reuse BaseStateScraper without having to refactor Texas scraper?

I'm thinking that if one day we need to make scraper async Kent, will be a best fit for that work as Brennan mentioned there is a native async driver on it.

If we want a request manager that accepts multiple callbacks, that means that state.RequestManager needs to be merged into BaseStateScraper.ScraperRequestManager somehow. The TAMES scraper will need to be updated to handle the multiple callback version of RequestManager, which will necessitate some changes on the CourtListener side to deal with the initialization being different. I think we should keep the async aspect of state.RequestManager since it lends itself well to multiple callbacks, but that means more changes to Texas.

If we drop async support and keep the multiple handlers, would it be possible to extend ScraperRequestManager to support multiple handlers while remaining backward compatible with a single all_response_fn? That way, we would not need to modify anything in the Texas scraper. I imagine something like:

Refactor ScraperRequestManager.__init__ to accept handlers: list[RequestHandler] | None, max_retries: int = 0, and retryable_status(HTTP status codes to retry).
All parameters would be optional, and the default behavior would remain unchanged. all_response_fn would be preserved for backward compatibility and would provide an interface for registering a single post-response handler (used in Texas).
Update ScraperRequestManager.request to execute handlers before and after sending the request, handle retries in a loop, etc.
Add a RateLimit(RequestHandler) here, while removing Retry(RequestHandler) since retries would instead be handled directly in request.
Make FloridaScraper extend BaseStateScraper and adopt the backfill(courts, date_range) signature, which is the same one used in Texas.

So after this change, I imagine we would be able to use ScraperRequestManager like this:

ScraperRequestManager(handlers=[RateLimitHandler(2.5)], max_retries=3)

And we would probably also be able to remove RateLimitedRequestManager from cl/scrapers/management/commands/back_scrape_dockets.py and use ScraperRequestManager instead.

What do you think?

MorganBennetDev · 2026-05-27T16:52:27Z

Considering we'll need to keep Florida requests controlled by a global rate limit. Will we have any benefit of making Florida scraper async? Do you think we can lose any advantage vs a sync Florida scraper?

We lose the ability to execute pre- and post-request logic concurrently with requests. So any requests, parsing, and archiving would all block each other. In my thinking, the big disadvantage of having Florida be sync is that to archive raw responses, we would either have to block the next request until our IO is done, or archive everything at the end (which would be extremely brittle). I don't know how fast saving something to S3 is, but it would be nice to not have to worry about that regardless of our RPS ratelimit.

If we go sync I think it'd be easier to reuse BaseStateScraper without having to refactor Texas scraper?

Keeping the request manager sync and moving Florida to be fully sync would allow Texas to stay the same, yes.

If we drop async support and keep the multiple handlers, would it be possible to extend ScraperRequestManager to support multiple handlers while remaining backward compatible with a single all_response_fn?

. . .

What do you think?

I think this all sounds good and plausible to implement.

I do think we can still have retries work using the same handler pattern even if everything's sync. This would let us keep the ScraperRequestManager focused and not lock all consumers into a single retry implementation if they need something different (Texas scrapers in CL use backoff, but I haven't implemented that here for instance).

And we would probably also be able to remove RateLimitedRequestManager from cl/scrapers/management/commands/back_scrape_dockets.py and use ScraperRequestManager instead.

Looking at back_scrape_dockets.py in CL, it won't be a drop-in replacement, but it won't be difficult.

albertisfu · 2026-05-27T17:25:00Z

Thanks. As per our conversation, since we’re considering storing the content in S3, and in Florida this could become a frequent and intensive operation due to the multiple endpoints we need to scrape, this process could become a real bottleneck. Because of that, we decided to keep the async approach so we can run pre and post-request logic concurrently.

We’ll just simplify the code and prune functionalities that might not be needed right now for Florida.

… Future ScheduledRequest constructed httpx.Request directly, bypassing the AsyncClient base_url / default header / cookie / param merge that build_request() performs. httpx.send() sends pre-built requests as-is, so requests to relative paths like "/courts" failed with "unknown url type". Route through self.build_request() and wrap the result via a new ScheduledRequest.from_httpx_request classmethod. FloridaScraper.courts was a @cached_property over an async def, which cached the coroutine — not the resolved dict. The second await raised "cannot reuse already awaited coroutine". Replace with a property that lazily creates and caches an asyncio.Future, which is re-awaitable and naturally collapses concurrent first-callers onto a single fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MorganBennetDev and others added 12 commits May 11, 2026 18:18

feat(florida): Scraper scaffolding

2482783

Might be abstract nonsense, but we'll see

feat(florida): Complete request manager

ed5cdf7

feat(florida): Fix RequestManager bugs and structural issues; add log…

756398e

…ging

chore(florida): Docs and annotations for RequestManager

256356c

feat(florida): Rate limiting

af86ea2

feat(florida): Better rate limiting

d7c8956

chore(florida): Move RequestManager to state module

275299c

fix(florida): Clean up scraper code

04cdc9f

MorganBennetDev requested review from Brennan-Chesley-FLP and albertisfu May 19, 2026 00:24

MorganBennetDev assigned albertisfu May 19, 2026

MorganBennetDev added this to Sprint (Web Team) May 19, 2026

pre-commit-ci Bot and others added 3 commits May 19, 2026 00:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

e43203f

for more information, see https://pre-commit.ci

chore(florida): Update changelog

ca2ed0d

Merge branch 'main' into morgan/florida-scraper

33f17f8

MorganBennetDev moved this to To Do in Sprint (Web Team) May 19, 2026

albertisfu requested changes May 26, 2026

View reviewed changes

albertisfu assigned MorganBennetDev and unassigned albertisfu May 26, 2026

fix(florida): Review comments

8f2b5d1

MorganBennetDev moved this from To Do to In progress in Sprint (Web Team) May 26, 2026

MorganBennetDev and others added 3 commits May 26, 2026 14:49

fix(florida): Messed up and added typing.override again

3d2e207

fix(florida): Update RequestManager tests

7270968

Merge branch 'main' into morgan/florida-scraper

c5120ab

albertisfu moved this from In progress to To Do in Sprint (Web Team) May 27, 2026

MorganBennetDev and others added 6 commits May 27, 2026 16:23

feat(florida): More scraper clean up

d0a0e20

chore(florida): Small cleanup in scraper.py and tests

5609301

style(florida): Make Claude's fix nicer

f86f8fe

feat(florida): Parse and scrape hearings endpoint

80d2bce

chore(florida): Update __init__

533af94

MorganBennetDev assigned albertisfu and unassigned MorganBennetDev May 28, 2026

fix(florida): Scraper tests

b700d37

MorganBennetDev mentioned this pull request May 29, 2026

Create Docket merger for Florida freelawproject/courtlistener#7361

Open

		case_uuid = output_case.case_uuid

		FloridaCaseInfoParser.populate_transfers(output_case)


		de_parser = FloridaDocketEntryListParser(court_id=court_id.value)
		parties_parser = FloridaPartyListParser(court_id=court_id.value)

	if hasattr(self, "loop_future") and self._loop_future:
	if hasattr(self, "_loop_future") and self._loop_future:

Uh oh!

Conversation

MorganBennetDev commented May 19, 2026

Summary

scraper.py

RequestManager.py

cases.py, common.py, docket_entries.py, parties.py, and documents.py

metadata.py

Notes

Uh oh!

MorganBennetDev commented May 19, 2026

Uh oh!

albertisfu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MorganBennetDev commented May 26, 2026

Uh oh!

albertisfu commented May 26, 2026

Uh oh!

MorganBennetDev commented May 26, 2026

Uh oh!

albertisfu commented May 27, 2026

Uh oh!

MorganBennetDev commented May 27, 2026

Uh oh!

albertisfu commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`scraper.py`

`RequestManager.py`

`cases.py`, `common.py`, `docket_entries.py`, `parties.py`, and `documents.py`

`metadata.py`