Skip to content

Commit 453762b

Browse files
authored
Merge pull request #2044 from Hack23/copilot/auto-fetch-full-text-documents
feat: Auto-fetch full text for top-N documents per analysis run (--auto-full-text-top-n)
2 parents a63366f + 862da67 commit 453762b

8 files changed

Lines changed: 691 additions & 22 deletions

File tree

.github/prompts/05-analysis-gate.md

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,13 @@ This is the **only** gate separating analysis from article generation. If it fai
3131
- `forward-indicators.md` declares **≥ 10 dated indicators** (bullet or table rows matching a date pattern across the four horizon sections).
3232
- `coalition-mathematics.md` contains a seat-count table (≥ 1 table row with `Ja`/`Nej`/`Avstår` or a party-to-seats mapping).
3333
- `implementation-feasibility.md` — when it names a recognised agency (Kriminalvården, Polismyndigheten, Försäkringskassan, Skatteverket, Migrationsverket, Arbetsförmedlingen, Socialstyrelsen, Transportstyrelsen, Trafikverket, Naturvårdsverket, Energimyndigheten) — contains a `statskontoret.se` URL citation **or** the literal phrase `none found` in the `Statskontoret relevance` row.
34+
9. **PIR status sidecar**`pir-status.json` is present and valid so open PIRs can roll forward to the next cycle.
35+
10. **Top-2 full-text availability** — when `data-download-manifest.md` contains a `## Full-Text Fetch Outcomes` table (written by `download-parliamentary-data.ts --auto-full-text-top-n`), at least 2 top documents must have `full_text_available=true`. Add `<!-- full-text-fallback: <reason> -->` to the manifest to bypass (e.g. when full text is genuinely unavailable from the MCP server or the flag was not used).
36+
11. **Supplementary artifacts** — see §Supplementary checks below (blocking for aggregation/Tier-C/multi-run).
3437

3538
## Implementation
3639

37-
No dedicated validator script exists yet — implement the checks as an inline bash gate. Full implementation (covers checks 1–9, plus conditional check 9b where applicable):
40+
No dedicated validator script exists yet — implement the checks as an inline bash gate. Full implementation (covers checks 1–11, plus conditional check 9b where applicable):
3841

3942
```bash
4043
set -Eeuo pipefail
@@ -238,9 +241,10 @@ fi
238241
# populate the `| **Statskontoret relevance** | ... |` row with either a
239242
# statskontoret.se URL or the literal `none found` when no relevant coverage exists.
240243
AGENCY_RE='Kriminalvård(en)?|Polismyndigheten|Försäkringskassan|Skatteverket|Migrationsverket|Arbetsförmedlingen|Socialstyrelsen|Transportstyrelsen|Trafikverket|Naturvårdsverket|Energimyndigheten'
244+
STATSKONTORET_RELEVANCE_RE='^\|[[:space:]]*\*\*Statskontoret relevance\*\*[[:space:]]*\|[[:space:]]*([^|]*statskontoret\.se[^|]*|[^|]*none found[^|]*)\|'
241245
if [ -s "$ANALYSIS_DIR/implementation-feasibility.md" ]; then
242246
if grep -qE "$AGENCY_RE" "$ANALYSIS_DIR/implementation-feasibility.md"; then
243-
grep -qiE '^\|[[:space:]]*\*\*Statskontoret relevance\*\*[[:space:]]*\|[[:space:]]*([^|]*statskontoret\.se[^|]*|[^|]*none found[^|]*)\|' "$ANALYSIS_DIR/implementation-feasibility.md" \
247+
grep -qiE "$STATSKONTORET_RELEVANCE_RE" "$ANALYSIS_DIR/implementation-feasibility.md" \
244248
|| { echo "❌ implementation-feasibility.md: names a recognised agency but the Statskontoret relevance row lacks a statskontoret.se URL or 'none found'"; FAIL=1; }
245249
fi
246250
fi
@@ -319,6 +323,26 @@ except Exception as e:
319323
" 2>&1 || FAIL=1
320324
fi
321325

326+
# Check 10 — top-2 full-text availability (auto-full-text-top-n gate)
327+
# When the manifest contains a "Full-Text Fetch Outcomes" table (written by
328+
# download-parliamentary-data.ts --auto-full-text-top-n), verify that at least
329+
# 2 top documents have full_text_available=true. A fallback annotation
330+
# <!-- full-text-fallback: <reason> --> anywhere in the manifest bypasses
331+
# this check so that runs without the flag, or runs where full text is
332+
# genuinely unavailable from the MCP server, are not blocked.
333+
if [ -s "$ANALYSIS_DIR/data-download-manifest.md" ]; then
334+
if grep -q "## Full-Text Fetch Outcomes" "$ANALYSIS_DIR/data-download-manifest.md"; then
335+
if grep -q "full-text-fallback:" "$ANALYSIS_DIR/data-download-manifest.md"; then
336+
: # Fallback annotation present — bypass check
337+
else
338+
FT_SUCCESS=$(grep -cE '^\|[[:space:]]*[A-Za-z0-9_-]+[[:space:]]*\|[[:space:]]*true' \
339+
"$ANALYSIS_DIR/data-download-manifest.md" || true)
340+
[ "${FT_SUCCESS:-0}" -ge 2 ] \
341+
|| { echo "❌ data-download-manifest.md: Full-Text Fetch Outcomes table present but fewer than 2 top documents have full_text_available=true (found ${FT_SUCCESS:-0}). Add <!-- full-text-fallback: <reason> --> to the manifest to bypass."; FAIL=1; }
342+
fi
343+
fi
344+
fi
345+
322346
[ "$FAIL" -eq 0 ] || exit 1
323347
```
324348

@@ -351,7 +375,7 @@ Non-blocking for `standard` / `deep` runs; **blocking for `comprehensive` / Tier
351375
Inline bash probe — append to the main block after `FAIL=0` bookkeeping completes. Supplementary artifacts have **three independent blocking triggers**, not a single tier-only rule: **aggregation article types** (`weekly-review`, `monthly-review`) require the aggregation artifacts; any run whose **tier** is `comprehensive` (the Tier-C run mode) requires the Tier-C supplementary set; and `cross-run-diff.md` is blocking whenever the workflow has **≥ 2 production runs** of the same article type, including `standard` and `deep` runs. `ARTICLE_TYPE` encodes the workflow family; `ANALYSIS_TIER` (when set) encodes the depth tier (`standard` | `deep` | `comprehensive`); `ANALYSIS_RUN_COUNT` (when set) is the numeric count of runs for the same article-generation cycle (if unset or non-numeric, treated as `1`).
352376

353377
```bash
354-
# Check 10 — supplementary artifacts (blocking for aggregation types, any Tier-C run, and S5 when run-count >= 2)
378+
# Check 11 — supplementary artifacts (blocking for aggregation types, any Tier-C run, and S5 when run-count >= 2)
355379
IS_AGGREGATION=0
356380
IS_TIER_C=0
357381
IS_MULTI_RUN=0

analysis/methodologies/ai-driven-analysis-guide.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
<a href="#"><img src="https://img.shields.io/badge/Classification-Public-green?style=for-the-badge" alt="Classification"/></a>
1717
</p>
1818

19-
**📋 Document Owner:** CEO | **📄 Version:** 6.6 | **📅 Last Updated:** 2026-04-25 (UTC)
19+
**📋 Document Owner:** CEO | **📄 Version:** 6.7 | **📅 Last Updated:** 2026-04-27 (UTC)
2020
**🔄 Review Cycle:** Quarterly | **⏰ Next Review:** 2026-07-21
2121
**🏢 Owner:** Hack23 AB (Org.nr 5595347807) | **🏷️ Classification:** Public
2222

@@ -87,11 +87,13 @@ Scripts run the download. Example:
8787
```bash
8888
npx tsx scripts/download-parliamentary-data.ts \
8989
--date ${ARTICLE_DATE} \
90-
--scope ${DOC_TYPE} \
91-
--out analysis/daily/${ARTICLE_DATE}/${DOC_TYPE}/data/
90+
--doc-type ${DOC_TYPE} \
91+
--auto-full-text-top-n 2
9292
```
9393

94-
**Write `data-download-manifest.md`** using the [manifest template](../templates/data-download-manifest.md). It records what arrived, from which MCP tools, with what data-depth distribution (FULL-TEXT / SUMMARY / METADATA-ONLY).
94+
**`--auto-full-text-top-n 2`** (recommended for L2/L3 runs): after the bulk download, the script calls `get_dokument_innehall` with `include_full_text=true` for the top-2 documents (by order in the downloaded batch) and persists the retrieved content to `analysis/daily/${ARTICLE_DATE}/${DOC_TYPE}/full-text/{dok_id}.md`. Accept the extra 30–60 s as a documented quality investment. The manifest's `## Full-Text Fetch Outcomes` table records `full_text_available` per `dok_id`; the analysis gate (check 10) enforces that ≥ 2 succeed or a `<!-- full-text-fallback: <reason> -->` annotation is present.
95+
96+
**Write `data-download-manifest.md`** using the [manifest template](../templates/data-download-manifest.md). It records what arrived, from which MCP tools, with what data-depth distribution (FULL-TEXT / SUMMARY / METADATA-ONLY) and — when `--auto-full-text-top-n` is used — the `## Full-Text Fetch Outcomes` table.
9597

9698
After `download-parliamentary-data.ts` completes for `committeeReports`, also run the voting-records script to capture party-level vote counts and defector detection for each betänkande:
9799

scripts/download-parliamentary-data.ts

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,9 @@ import {
3434
flattenDocuments,
3535
subtractBusinessDays,
3636
MAX_LOOKBACK_BUSINESS_DAYS,
37+
fetchFullTextForTopN,
3738
} from './parliamentary-data/data-downloader.js';
38-
import type { DocumentTypeKey } from './parliamentary-data/data-downloader.js';
39+
import type { DocumentTypeKey, FullTextFetchOutcome } from './parliamentary-data/data-downloader.js';
3940

4041
import { persistDownloadedData, sanitizeDokId } from './parliamentary-data/data-persistence.js';
4142

@@ -148,10 +149,11 @@ export function parseArgs(argv: string[]): {
148149
})
149150
: [];
150151

151-
// --auto-full-text-top-n: Override the per-type full-text enrichment limit.
152-
// When set, only the top N documents per type receive fetchDocumentDetails
153-
// (full-text) enrichment, enabling more targeted significance-scoring input.
154-
// Defaults to MAX_ENRICHMENT_PER_TYPE when omitted (null → caller uses default).
152+
// --auto-full-text-top-n: Override the per-type full-text enrichment limit and
153+
// persist full text outcomes for the first N documents in the current filtered
154+
// array order. Defaults to null when omitted so downloadAllDocuments uses
155+
// MAX_ENRICHMENT_PER_TYPE; explicit 0 disables per-type enrichment and
156+
// persisted full-text fetching. No DIW significance ranking is applied here.
155157
const autoFullTextTopNArg = get('--auto-full-text-top-n');
156158
let autoFullTextTopN: number | null = null;
157159
if (autoFullTextTopNArg !== null) {
@@ -235,6 +237,7 @@ function serializeDataManifest(
235237
docCounts: Record<string, number>,
236238
dateFilteredTotal: number,
237239
dataFreshness: string | null,
240+
fullTextOutcomes?: FullTextFetchOutcome[],
238241
): string {
239242
const totalDocs = Object.values(docCounts).reduce((a, b) => a + b, 0);
240243
const lines: string[] = [
@@ -267,6 +270,21 @@ function serializeDataManifest(
267270
lines.push(`Data sourced from ${dataFreshness} via lookback fallback — check freshness indicators.`);
268271
}
269272

273+
// Append full-text fetch outcomes when --auto-full-text-top-n was used.
274+
if (fullTextOutcomes && fullTextOutcomes.length > 0) {
275+
lines.push('', '## Full-Text Fetch Outcomes', '');
276+
lines.push('| dok_id | full_text_available | chars | notes |');
277+
lines.push('|--------|--------------------:|------:|-------|');
278+
for (const o of fullTextOutcomes) {
279+
const available = o.success ? 'true' : 'false';
280+
const chars = o.chars > 0 ? String(o.chars) : '0';
281+
const notes = o.reason ?? (o.filePath ? `persisted: ${o.filePath}` : '');
282+
lines.push(`| ${o.dokId} | ${available} | ${chars} | ${notes} |`);
283+
}
284+
const successCount = fullTextOutcomes.filter(o => o.success).length;
285+
lines.push('', `**Full-text retrieved**: ${successCount}/${fullTextOutcomes.length} top documents`);
286+
}
287+
270288
return lines.join('\n');
271289
}
272290

@@ -514,10 +532,27 @@ async function runPreArticleAnalysis(opts: {
514532
const persistResult = persistDownloadedData(data, resolvedRm);
515533
console.log(` 🗄️ Persisted data for ${persistResult.written} documents to ${path.relative(REPO_ROOT, persistResult.dataRoot)}/ (${persistResult.skipped} skipped)`);
516534

535+
// ── Step 2b: Auto-fetch full text for top-N documents ────────────────────
536+
let fullTextOutcomes: FullTextFetchOutcome[] | undefined;
537+
if (autoFullTextTopN !== null && autoFullTextTopN > 0 && allDocs.length > 0) {
538+
console.log(`\n📄 Step 2b: Auto-fetching full text for top-${autoFullTextTopN} documents (--auto-full-text-top-n=${autoFullTextTopN})...`);
539+
console.log(' ⏱️ This may take 30–60 s — documented quality investment for deep-analysis tiers.');
540+
fullTextOutcomes = await fetchFullTextForTopN(client, allDocs, autoFullTextTopN, outputDir);
541+
const successCount = fullTextOutcomes.filter(o => o.success).length;
542+
console.log(` ✅ Full text retrieved for ${successCount}/${fullTextOutcomes.length} document(s)`);
543+
for (const o of fullTextOutcomes) {
544+
if (o.success) {
545+
console.log(` ✅ ${o.dokId}: ${o.chars} chars → ${o.filePath}`);
546+
} else {
547+
console.warn(` ⚠️ ${o.dokId}: ${o.reason}`);
548+
}
549+
}
550+
}
551+
517552
// Write data-download-manifest.md (factual download summary — NOT analysis)
518553
const manifestContent = serializeDataManifest(
519554
date, generatedAt, manifest.dataSources, manifest.docCounts,
520-
allDocs.length, dataFreshness,
555+
allDocs.length, dataFreshness, fullTextOutcomes,
521556
);
522557
const manifestPath = path.join(outputDir, 'data-download-manifest.md');
523558
fs.writeFileSync(manifestPath, manifestContent, 'utf8');
@@ -553,6 +588,11 @@ async function runPreArticleAnalysis(opts: {
553588
console.log(`\n✅ Data download complete! Results in: ${path.relative(REPO_ROOT, outputDir)}/`);
554589
console.log(` 📄 ${totalFiles} total files written (1 manifest + ${storedCount} documents)`);
555590
console.log(` 📊 ${allDocs.length} documents available for AI analysis`);
591+
if (autoFullTextTopN !== null && autoFullTextTopN > 0) {
592+
const successCount = fullTextOutcomes?.filter(o => o.success).length ?? 0;
593+
const attempted = fullTextOutcomes?.length ?? 0;
594+
console.log(` 📄 Full text: ${successCount}/${attempted} top-${autoFullTextTopN} documents (see full-text/ sub-folder)`);
595+
}
556596
if (docType) {
557597
console.log(` 📋 Scoped to: ${docType}`);
558598
}

scripts/fetch-statskontoret.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,8 @@ export async function fetchStatskontoretCached(
180180

181181
try {
182182
links = await client.discoverDownloads(sourceKey);
183-
// Stamp provenance after the fetch completes so `fetchedAt` reflects when
184-
// the data was actually retrieved, not when the request was issued.
183+
// Stamp provenance after discovery completes so `fetchedAt` reflects the
184+
// cache completion time, not when the request was issued.
185185
fetchedAt = new Date().toISOString();
186186
writeCacheEntry(filePath, { fetchedAt, sourceKey, links });
187187
} catch (error) {

0 commit comments

Comments
 (0)