Skip to content

Commit cfdb926

Browse files
SmossMephistic
andauthored
Add PDF text fallback for bill documents (#2121)
- Extract embedded text from bill PDFs when DocumentText is missing - Add backfill tooling and documentation for repairing existing bills Co-authored-by: Mephistic <deathbyfiresermon@gmail.com>
1 parent 0afcaef commit cfdb926

15 files changed

Lines changed: 849 additions & 100 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ cert.txt
8888
# lets each user define their own vscode settings
8989
.vscode/settings.json
9090

91+
.serena/
9192
# local MCP server config (contains auth tokens)
9293
.mcp.json
9394
mcp-server/create-agent-key.ts

docs/bill-pdf-text-extraction.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Bill PDF Text Extraction
2+
3+
Some Massachusetts Legislature bill records have `content.DocumentText` set to
4+
null in the Document API even though the bill PDF contains embedded text. Maple
5+
now falls back to the official PDF at `/Bills/{court}/{billId}.pdf` when the API
6+
text is missing.
7+
8+
## Extraction Scope
9+
10+
The current extractor handles PDFs with embedded text. It does not perform OCR,
11+
so scanned or image-only PDFs are reported but not repaired.
12+
13+
Known 194th General Court examples:
14+
15+
- `H1`: large embedded-text PDF.
16+
- `H4787`: short embedded-text PDF.
17+
- `H5008`: ballot initiative embedded-text PDF.
18+
- `S2539`: regulatory/report-style embedded-text PDF.
19+
- `H18`: image-only/scanned PDF; no OCR support in this implementation.
20+
21+
## Runtime Scraper Behavior
22+
23+
The bill scraper first calls the MA Legislature Document API. If
24+
`DocumentText` is present, it stores the API response as before. If
25+
`DocumentText` is null or absent, the scraper downloads the PDF and tries to
26+
extract text with `pdf-parse`.
27+
28+
Successful PDF extraction stores the result in the existing
29+
`content.DocumentText` field. Failed extraction leaves `DocumentText` absent and
30+
logs the extraction status.
31+
32+
## Backfill Existing Bills
33+
34+
Run the PDF text backfill in dry-run mode first:
35+
36+
```sh
37+
yarn firebase-admin run-script backfillBillPdfText --env dev -- --court 194 --bills "H1 H18 H4787 H5008 S2539" --output ./bill-pdf-text-dry-run.csv
38+
```
39+
40+
After reviewing the CSV, commit writes:
41+
42+
```sh
43+
yarn firebase-admin run-script backfillBillPdfText --env dev -- --court 194 --commit true --output ./bill-pdf-text-dev.csv
44+
```
45+
46+
The script only writes `content.DocumentText` and `fetchedAt` for bills that are
47+
missing `content.DocumentText`. Bills that already have text are skipped.
48+
49+
## Summary And Topic Backfill
50+
51+
Updating existing bill documents does not trigger the Python LLM function,
52+
because that function currently runs on document creation only. After committing
53+
PDF text, run the LLM backfill for the repaired bills:
54+
55+
```sh
56+
python llm/backfill_summaries_runner.py --court 194 --bill-ids "H1 H4787 H5008 S2539" --output ./summaries-and-topics.csv
57+
```
58+
59+
Use `--dry-run` to verify which rows would be processed without updating
60+
Firestore.
61+
62+
`backfill_summaries.py` is the legacy immediate-run wrapper.
63+
`backfill_summaries_runner.py` is the import-safe CLI and test target.

functions/package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
"luxon": "^2.3.1",
3131
"nanoid": "^3.3.2",
3232
"object-hash": "^3.0.0",
33+
"pdf-parse": "1.1.1",
3334
"runtypes": "6.6.0",
3435
"ssl-root-cas": "^1.3.1",
3536
"typesense": "^1.2.2",
@@ -41,6 +42,7 @@
4142
"@types/jsdom": "^21.1.7",
4243
"@types/luxon": "^2.0.9",
4344
"@types/object-hash": "^2.2.1",
45+
"@types/pdf-parse": "1.1.5",
4446
"copyfiles": "^2.4.1",
4547
"firebase-functions-test": "^0.3.3",
4648
"firebase-tools": "^13.18.0",

functions/src/bills/bills.test.ts

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
jest.mock("../malegislature", () => ({
2+
getDocument: jest.fn(),
3+
getDocumentPdf: jest.fn()
4+
}))
5+
jest.mock("./pdfText", () => ({
6+
extractBillTextFromPdf: jest.fn()
7+
}))
8+
9+
import { getDocumentWithPdfTextFallback } from "./documentTextFallback"
10+
import { extractBillTextFromPdf } from "./pdfText"
11+
12+
const mockedApi = jest.requireMock("../malegislature") as {
13+
getDocument: jest.Mock
14+
getDocumentPdf: jest.Mock
15+
}
16+
const mockedExtractBillTextFromPdf =
17+
extractBillTextFromPdf as jest.MockedFunction<typeof extractBillTextFromPdf>
18+
19+
describe("getDocumentWithPdfTextFallback", () => {
20+
beforeEach(() => {
21+
jest.resetAllMocks()
22+
})
23+
24+
it("does not fetch a PDF when API text is present", async () => {
25+
mockedApi.getDocument.mockResolvedValue({ DocumentText: "API text" })
26+
27+
await expect(
28+
getDocumentWithPdfTextFallback(194, "H1")
29+
).resolves.toMatchObject({
30+
content: { DocumentText: "API text" },
31+
documentTextSource: "api"
32+
})
33+
expect(mockedApi.getDocumentPdf).not.toHaveBeenCalled()
34+
})
35+
36+
it("sets DocumentText when PDF extraction succeeds", async () => {
37+
mockedApi.getDocument.mockResolvedValue({ DocumentText: null })
38+
mockedApi.getDocumentPdf.mockResolvedValue(Buffer.from("pdf"))
39+
mockedExtractBillTextFromPdf.mockResolvedValue({
40+
status: "extracted",
41+
text: "PDF text",
42+
pageCount: 1,
43+
charCount: 7
44+
})
45+
46+
await expect(
47+
getDocumentWithPdfTextFallback(194, "H1")
48+
).resolves.toMatchObject({
49+
content: { DocumentText: "PDF text" },
50+
documentTextSource: "pdf",
51+
pdfTextExtraction: { status: "extracted" }
52+
})
53+
})
54+
55+
it("leaves DocumentText absent when PDF has no text", async () => {
56+
mockedApi.getDocument.mockResolvedValue({ DocumentText: null })
57+
mockedApi.getDocumentPdf.mockResolvedValue(Buffer.from("pdf"))
58+
mockedExtractBillTextFromPdf.mockResolvedValue({
59+
status: "no-text",
60+
pageCount: 1,
61+
charCount: 0
62+
})
63+
64+
const result = await getDocumentWithPdfTextFallback(194, "H18")
65+
66+
expect(result.content).not.toHaveProperty("DocumentText")
67+
expect(result.pdfTextExtraction).toMatchObject({ status: "no-text" })
68+
})
69+
70+
it("leaves DocumentText absent when PDF fetch fails", async () => {
71+
mockedApi.getDocument.mockResolvedValue({ DocumentText: null })
72+
mockedApi.getDocumentPdf.mockRejectedValue(new Error("not found"))
73+
74+
const result = await getDocumentWithPdfTextFallback(194, "H18")
75+
76+
expect(result.content).not.toHaveProperty("DocumentText")
77+
expect(result.pdfTextExtraction).toMatchObject({
78+
status: "fetch-error",
79+
error: "not found"
80+
})
81+
})
82+
})

functions/src/bills/bills.ts

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
import { isString } from "lodash"
2+
import { logger } from "firebase-functions"
23
import { logFetchError } from "../common"
34
import * as api from "../malegislature"
45
import { createScraper } from "../scraper"
6+
import { getDocumentWithPdfTextFallback } from "./documentTextFallback"
57
import { Bill, MISSING_TIMESTAMP } from "./types"
68

9+
export { getDocumentWithPdfTextFallback } from "./documentTextFallback"
10+
711
/**
812
* There are around 8000 documents. With 8 batches per day, 20 parallel
913
* scrapers, and 50 documents per batch, we will process all documents once per
@@ -18,7 +22,8 @@ export const { fetchBatch: fetchBillBatch, startBatches: startBillBatches } =
1822
fetchBatchTimeout: 240,
1923
startBatchTimeout: 240,
2024
fetchResource: async (court: number, id: string, current) => {
21-
const content = await api.getDocument({ id, court })
25+
const { content, pdfTextExtraction } =
26+
await getDocumentWithPdfTextFallback(court, id)
2227
const history = await api
2328
.getBillHistory(court, id)
2429
.catch(logFetchError("bill history", id))
@@ -28,8 +33,11 @@ export const { fetchBatch: fetchBillBatch, startBatches: startBillBatches } =
2833
.getSimilarBills(court, id)
2934
.catch(logFetchError("similar bills", id))
3035
.then(bills => bills?.map(b => b.BillNumber).filter(isString) ?? [])
31-
if (content.DocumentText == null) {
32-
delete content.DocumentText
36+
37+
if (content.DocumentText == null && pdfTextExtraction) {
38+
logger.info(
39+
`No bill text extracted from PDF for ${court}/${id}: ${pdfTextExtraction.status}`
40+
)
3341
}
3442

3543
const resource: Partial<Bill> = {
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import * as api from "../malegislature"
2+
import { extractBillTextFromPdf, PdfTextExtractionResult } from "./pdfText"
3+
4+
export type DocumentTextFallbackResult = {
5+
content: any
6+
documentTextSource?: "api" | "pdf"
7+
pdfTextExtraction?: PdfTextExtractionResult | PdfFetchFailure
8+
}
9+
10+
type PdfFetchFailure = {
11+
status: "fetch-error"
12+
charCount: 0
13+
pageCount?: undefined
14+
error: string
15+
}
16+
17+
export async function getDocumentWithPdfTextFallback(
18+
court: number,
19+
id: string
20+
): Promise<DocumentTextFallbackResult> {
21+
const content = await api.getDocument({ id, court })
22+
23+
if (content.DocumentText != null) {
24+
return {
25+
content,
26+
documentTextSource: "api"
27+
}
28+
}
29+
30+
delete content.DocumentText
31+
32+
let pdf: Buffer
33+
try {
34+
pdf = await api.getDocumentPdf({ id, court })
35+
} catch (e) {
36+
return {
37+
content,
38+
pdfTextExtraction: {
39+
status: "fetch-error",
40+
charCount: 0,
41+
error: e instanceof Error ? e.message : String(e)
42+
}
43+
}
44+
}
45+
46+
const pdfTextExtraction = await extractBillTextFromPdf(pdf)
47+
if (pdfTextExtraction.status === "extracted") {
48+
content.DocumentText = pdfTextExtraction.text
49+
return {
50+
content,
51+
documentTextSource: "pdf",
52+
pdfTextExtraction
53+
}
54+
}
55+
56+
return {
57+
content,
58+
pdfTextExtraction
59+
}
60+
}
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
const mockedPdfParse = jest.fn()
2+
3+
jest.mock("pdf-parse/lib/pdf-parse", () => mockedPdfParse)
4+
5+
import { extractBillTextFromPdf, normalizeExtractedBillText } from "./pdfText"
6+
7+
describe("normalizeExtractedBillText", () => {
8+
it("trims and collapses noisy whitespace", () => {
9+
expect(
10+
normalizeExtractedBillText(" \r\n Section 1. Text\t\t here. \n\n\n")
11+
).toBe("Section 1. Text here.")
12+
})
13+
14+
it("removes standalone page counters", () => {
15+
expect(
16+
normalizeExtractedBillText("1 of 3\nHOUSE No. 1\n-- 2 of 3 --\nBill text")
17+
).toBe("HOUSE No. 1\nBill text")
18+
})
19+
20+
it("preserves substantive bill text", () => {
21+
const text =
22+
"The General Laws are hereby amended.\nSection 2. This act shall take effect."
23+
24+
expect(normalizeExtractedBillText(text)).toBe(text)
25+
})
26+
})
27+
28+
describe("extractBillTextFromPdf", () => {
29+
beforeEach(() => {
30+
mockedPdfParse.mockReset()
31+
})
32+
33+
it("returns extracted when text is long enough", async () => {
34+
mockedPdfParse.mockResolvedValue({
35+
text: "An Act " + "with enough extracted text. ".repeat(10),
36+
numpages: 2,
37+
numrender: 2,
38+
info: {},
39+
metadata: {},
40+
version: "default"
41+
})
42+
43+
const result = await extractBillTextFromPdf(Buffer.from("pdf"))
44+
45+
expect(result.status).toBe("extracted")
46+
expect(result.pageCount).toBe(2)
47+
expect(result.text).toContain("An Act")
48+
})
49+
50+
it("returns no-text for empty extraction", async () => {
51+
mockedPdfParse.mockResolvedValue({
52+
text: " \n\t ",
53+
numpages: 1,
54+
numrender: 1,
55+
info: {},
56+
metadata: {},
57+
version: "default"
58+
})
59+
60+
await expect(
61+
extractBillTextFromPdf(Buffer.from("pdf"))
62+
).resolves.toMatchObject({
63+
status: "no-text",
64+
charCount: 0,
65+
pageCount: 1
66+
})
67+
})
68+
69+
it("returns too-short for tiny extraction", async () => {
70+
mockedPdfParse.mockResolvedValue({
71+
text: "short text",
72+
numpages: 1,
73+
numrender: 1,
74+
info: {},
75+
metadata: {},
76+
version: "default"
77+
})
78+
79+
await expect(
80+
extractBillTextFromPdf(Buffer.from("pdf"))
81+
).resolves.toMatchObject({
82+
status: "too-short",
83+
text: "short text",
84+
pageCount: 1
85+
})
86+
})
87+
88+
it("returns parse-error when parser throws", async () => {
89+
mockedPdfParse.mockRejectedValue(new Error("bad pdf"))
90+
91+
await expect(
92+
extractBillTextFromPdf(Buffer.from("pdf"))
93+
).resolves.toMatchObject({
94+
status: "parse-error",
95+
error: "bad pdf"
96+
})
97+
})
98+
})

0 commit comments

Comments
 (0)