Skip to content

Commit 8cdbdba

Browse files
authored
Merge pull request #1998 from Hack23/copilot/improve-article-generation-templates
Self-host Mermaid + aggregator HTML quality fixes + scripted article minimums
2 parents 3106e54 + c5d7d6d commit 8cdbdba

94 files changed

Lines changed: 20862 additions & 17661 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,18 @@ builds/
8484
# Pass-1 snapshots used by the analysis gate (see .github/prompts/05-analysis-gate.md)
8585
analysis/daily/*/*/pass1/
8686

87+
# SEO metadata backfill diff reports (see analysis/metadata-backfill/README.md and
88+
# .github/prompts/seo-metadata-contract.md §6). Intentionally NOT ignored —
89+
# the dry-run CSV is committed so PRs 3 / 4 / 5 can consume it deterministically
90+
# and reviewers can inspect tier classification + violation codes in-place.
8791
# SEO metadata backfill diff reports (see analysis/metadata-backfill/README.md and
8892
# .github/prompts/seo-metadata-contract.md §6). Intentionally NOT ignored —
8993
# the dry-run CSV is committed so PRs 3 / 4 / 5 can consume it deterministically
9094
# and reviewers can inspect tier classification + violation codes in-place.
9195
!analysis/metadata-backfill/*.csv
9296
!analysis/metadata-backfill/README.md
97+
98+
# Vendored Mermaid distribution copied from node_modules by
99+
# scripts/copy-vendor-mermaid.ts during `prebuild`. Reproducible from the
100+
# pinned `mermaid` devDependency in package.json — do not commit.
101+
js/lib/mermaid/

Article-Generation.md

Lines changed: 70 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -454,18 +454,38 @@ npx tsx scripts/aggregate-analysis.ts --all
454454

455455
### Cleaning and transformation rules
456456

457-
The aggregator:
457+
The aggregator (see [`scripts/render-lib/aggregator.ts`](scripts/render-lib/aggregator.ts) `cleanArtifactBody`):
458458

459459
- Requires `executive-brief.md`.
460460
- Inserts a `Reader Intelligence Guide` before artifact sections so public readers can find high-value analysis such as media framing and forward indicators without scanning every audit artifact.
461461
- Strips YAML front matter from each artifact.
462462
- Removes the first H1 from each artifact and injects its own consistent `## Section Title` heading.
463+
- **Demotes every internal heading by one level** (`##``###`, `###``####`, …, capped at H6) before concatenation. Without this, every artifact's own H2s become siblings of the wrapper-injected `## Section Title` and the rendered article ends up with ~170 H2s and a flat outline that violates WCAG 2.4.6 ("Headings and Labels"). Headings inside fenced code blocks are not affected. **Tested by** [`tests/render-lib.test.ts > demoteHeadings`](tests/render-lib.test.ts).
464+
- **Strips legacy `_Source: file.md_` italic preamble lines** that some artifact templates author at the top of their body. Source attribution now lives in the auto-generated [Reader Intelligence Guide](#-reader-intelligence-guide-deterministic-navigation-layer) and the [`## Article Sources` appendix](#-article-sources-appendix-canonical-source-list) — repeating it under every heading reads like a folder listing, not journalism. Inline prose mentions like *"primary source: data.riksdagen.se/…"* are preserved.
465+
- **Normalises heading slugs** to drop leading hyphens emitted by `github-slugger` when a heading starts with a stripped character (e.g. emoji like `🎯` in `## 🎯 BLUF` slug to `-bluf` and would otherwise become `id="rm--bluf"` once the `rm-` prefix is applied). Both [`markdown.ts#rehypeSlugWithPrefix`](scripts/render-lib/markdown.ts) and [`aggregator.ts#anchorForTitle`](scripts/render-lib/aggregator.ts) collapse leading/trailing hyphens to keep heading IDs and Reader Intelligence Guide anchors in lock-step.
463466
- Removes leading admin bylines such as `Author`, `Run ID`, `Classification`, `Confidence`, `Prepared by`, `Methodology` and similar metadata fields.
464467
- Removes trailing `Document control`, `Audit trail`, `Generated by`, template footer and `Pass 2` self-audit sections.
465468
- Rewrites relative Markdown links to absolute GitHub blob URLs.
466469
- Keeps Mermaid fences untouched so the renderer can preserve them.
470+
- Annotates each section heading with an HTML comment of shape `<!-- source: <file> :: <github-blob-url> -->` for offline auditors. The comment is dropped by `rehype-sanitize` so it never reaches rendered HTML.
467471
- Builds front matter with `title`, `description`, `date`, `subfolder`, `slug`, `source_folder`, `generated_at`, `language` and `layout`.
468472

473+
### 📚 Article Sources appendix (canonical source list)
474+
475+
After every artifact section the aggregator emits a single `## Article Sources` H2 at the very end of the article. Each entry is a markdown list link to the artifact on GitHub:
476+
477+
```markdown
478+
## Article Sources
479+
480+
Each section above projects one analysis artifact. The full audited markdown is available on GitHub:
481+
482+
- [`executive-brief.md`](https://github.com/Hack23/riksdagsmonitor/blob/main/analysis/daily/.../executive-brief.md)
483+
- [`synthesis-summary.md`](https://github.com/.../synthesis-summary.md)
484+
-
485+
```
486+
487+
This replaces the legacy per-section `_Source: file.md_` italics. Auditors get one canonical list; readers see clean prose; SEO crawlers see one trustworthy `<ul>` of primary-source links instead of 25+ duplicated italics.
488+
469489
### Title and description extraction
470490

471491
`article.md` metadata comes from `executive-brief.md`:
@@ -495,14 +515,47 @@ layout: article
495515
---
496516
```
497517

498-
It then emits deterministic sections such as `## Executive Brief`, `## Synthesis Summary`, `## Intelligence Assessment — Key Judgments`, `## Significance Scoring`, and so on. Each section includes a source attribution line like:
518+
It then emits deterministic sections such as `## Executive Brief`, `## Synthesis Summary`, `## Intelligence Assessment — Key Judgments`, `## Significance Scoring`, and so on. Source attribution is provided by the auto-generated `## Reader Intelligence Guide` (top of article) and `## Article Sources` appendix (bottom of article); the per-section heading carries an HTML comment for offline auditors:
499519

500520
```markdown
501-
_Source: [`executive-brief.md`](https://github.com/Hack23/riksdagsmonitor/blob/main/analysis/daily/2026-04-24/interpellations/executive-brief.md)_
521+
## Executive Brief
522+
<!-- source: executive-brief.md :: https://github.com/Hack23/riksdagsmonitor/blob/main/analysis/daily/2026-04-24/interpellations/executive-brief.md -->
523+
524+
### 🎯 BLUF
525+
526+
…artifact body content, with all internal headings demoted by one level so the outline stays semantically nested…
502527
```
503528

504529
The generated first body section is `## Reader Intelligence Guide`, which is intentionally not sourced to a single artifact because it is a deterministic navigation projection of the artifact set.
505530

531+
### ✅ Article minimum-content validator (`scripts/validate-article.ts`)
532+
533+
Every aggregated `analysis/daily/$DATE/$SUBFOLDER/article.md` is checked by [`scripts/validate-article.ts`](scripts/validate-article.ts) — a hard, scripted CI gate that fails the build on any of the following violations:
534+
535+
| Rule code | What it blocks | Why it matters |
536+
|---|---|---|
537+
| `unresolved-placeholder` | `[REQUIRED:…]`, `AI_MUST_REPLACE`, `<insert …>`, `TBD:`, `FILL IN` strings surviving Pass-2 | Templates carry these markers on disk; if they reach `article.md` the AI agent skipped a substitution. Article is not publishable. |
538+
| `missing-reader-guide` | Article missing `## Reader Intelligence Guide` | Aggregator-generated; if missing, the aggregator broke. |
539+
| `missing-executive-brief` | Article missing `## Executive Brief` H2 | Required artifact malformed. |
540+
| `missing-bluf` | No `BLUF` heading anywhere | Editorial product cannot ship without a Bottom-Line-Up-Front. |
541+
| `missing-sources-appendix` | Article missing `## Article Sources` | Aggregator-generated; if missing, re-aggregate. |
542+
| `bluf-too-short` | BLUF prose < 80 chars | Stub BLUFs (e.g. `TODO`, `pending`) escape Pass-2. A publishable BLUF needs actor + active verb + object + when + so-what. |
543+
| `bluf-too-long` | BLUF prose > 1200 chars | Runaway dumps belong in Synthesis Summary or Intelligence Assessment, not the 60-second read. |
544+
| `empty-heading-slug` | Any heading whose permissive slug is empty (e.g. emoji-only) | Empty `#anchor` would break the Reader Intelligence Guide and SERP deep-links. |
545+
| `per-doc-missing-dok_id` | Any `### HD…`/`### FiU…` per-document subsection lacking at least one dok_id-style code in its body | Every per-document subsection must trace to a primary-source identifier; orphan sections are blocked. |
546+
547+
**Run locally:**
548+
549+
```bash
550+
# Validate every aggregated article in the repo:
551+
npm run validate-article
552+
553+
# Validate a single article:
554+
npx tsx scripts/validate-article.ts analysis/daily/2026-04-24/interpellations/article.md
555+
```
556+
557+
The validator is wired into `npm run validate-all` and runs as a hard CI gate after aggregation. It is **content-only** — structural projections (heading demotion, source-preamble stripping, slug normalisation) are unit-tested in [`tests/render-lib.test.ts`](tests/render-lib.test.ts); this script guards the AI-authored contribution: the artifact contents that the aggregator concatenates.
558+
506559
---
507560

508561
## 🌐 How `article.md` Becomes HTML
@@ -659,7 +712,20 @@ The rendering path is:
659712
2. [`scripts/render-lib/markdown.ts`](scripts/render-lib/markdown.ts) rewrites them to `<pre class="mermaid">` before Markdown parsing.
660713
3. `rehype-sanitize` allows the `pre.mermaid` class.
661714
4. [`scripts/render-lib/chrome.ts`](scripts/render-lib/chrome.ts) includes `js/lib/mermaid-init.mjs`.
662-
5. [`js/lib/mermaid-init.mjs`](js/lib/mermaid-init.mjs) dynamically imports Mermaid `11.4.1` from jsDelivr, initializes a dark theme and renders all Mermaid blocks after page load.
715+
5. [`js/lib/mermaid-init.mjs`](js/lib/mermaid-init.mjs) dynamically imports Mermaid `11.4.1` from the **same-origin vendored copy under `js/lib/mermaid/`**, initializes a dark theme and renders all Mermaid blocks after page load.
716+
717+
The Mermaid distribution is vendored at build time:
718+
719+
| Step | Location | What it does |
720+
|---|---|---|
721+
| **Pin** | [`package.json`](package.json) `devDependencies` | `mermaid` is pinned (currently `11.4.1`) — supply-chain audited like every other dependency, in the npm SBOM. |
722+
| **Copy** | [`scripts/copy-vendor-mermaid.ts`](scripts/copy-vendor-mermaid.ts) | Run as the first step of `prebuild` (and `predev`). Copies `node_modules/mermaid/dist/mermaid.esm.min.mjs` and its required `chunks/mermaid.esm.min/*.mjs` into `js/lib/mermaid/` (≈2.6 MB, 64 files). Sourcemaps, type declarations, mocks and other ESM variants are excluded. |
723+
| **Gitignore** | [`.gitignore`](.gitignore) | `js/lib/mermaid/` is intentionally ignored — the directory is reproducible from the pinned dependency, so we don't commit duplicates of `node_modules` content. |
724+
| **Bundle** | [`.github/workflows/deploy-s3.yml`](.github/workflows/deploy-s3.yml) | The "Copy JS libraries to build output" step merges the full `js/` tree (including `js/lib/mermaid/`) into `dist/js/` after the Vite build, alongside `chart.umd.4.4.1.js`, `d3.7.9.0.min.js`, etc. |
725+
| **Deploy** | [`scripts/deploy-s3.sh`](scripts/deploy-s3.sh) | `*.mjs` files are uploaded with `Content-Type: application/javascript` and `Cache-Control: public, max-age=31536000, immutable` — same long-cache treatment as every other vendored asset. |
726+
| **Guard** | [`tests/no-external-cdn.test.ts`](tests/no-external-cdn.test.ts) | Vitest test that fails CI if any runtime file under `js/` or any rendered article under `news/` references `cdn.jsdelivr.net`, `cdnjs.cloudflare.com`, `unpkg.com`, `esm.sh`, `cdn.skypack.dev`, or `ajax.googleapis.com`. Riksdagsmonitor serves all JavaScript from its own S3/CloudFront origin — no external CDN allowed. |
727+
728+
CSP impact: scripts can be allowed with `script-src 'self'` only — no third-party host needs to be added to the policy. SRI hashes for every Mermaid `.mjs` chunk are produced by `vite-plugin-sri-gen` because the files now live under the build output.
663729

664730
The analysis gate requires color-coded Mermaid through `style` directives or Mermaid `themeVariables` / `%%{init}` blocks.
665731

SECURITY_ARCHITECTURE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ Referrer-Policy: strict-origin-when-cross-origin
385385
Permissions-Policy: geolocation=(), microphone=(), camera=()
386386
```
387387

388-
**Note:** CSP includes `'unsafe-inline'` for Chart.js/D3.js inline styles and large inline dashboard script (946 lines). The `connect-src` directive includes `https://raw.githubusercontent.com` to allow fetching CIA CSV data from the cia repository. Security headers are configured via AWS CloudFront Response Headers Policy for the primary deployment. GitHub Pages disaster recovery inherits default GitHub Pages security headers. Future enhancement: nonce-based CSP for stricter inline script control (roadmap: 2027). Chart.js, D3.js, and chartjs-plugin-annotation are hosted locally on CloudFront (js/lib/) rather than via external CDN, eliminating external script dependencies.
388+
**Note:** CSP includes `'unsafe-inline'` for Chart.js/D3.js inline styles and large inline dashboard script (946 lines). The `connect-src` directive includes `https://raw.githubusercontent.com` to allow fetching CIA CSV data from the cia repository. Security headers are configured via AWS CloudFront Response Headers Policy for the primary deployment. GitHub Pages disaster recovery inherits default GitHub Pages security headers. Future enhancement: nonce-based CSP for stricter inline script control (roadmap: 2027). Chart.js, D3.js, chartjs-plugin-annotation **and Mermaid** are hosted locally on CloudFront (`js/lib/`) rather than via external CDN, eliminating external script dependencies (CI-enforced by [`tests/no-external-cdn.test.ts`](tests/no-external-cdn.test.ts)).
389389

390390
**Control Mapping:**
391391
- ISO 27001: A.13.1 Network Security Management

0 commit comments

Comments
 (0)