Skip to content

feat: initial implementation — pipeline, four tracks, web viewer, CI#1

Merged
kanywst merged 18 commits into
mainfrom
feat/initial-scaffold
May 17, 2026
Merged

feat: initial implementation — pipeline, four tracks, web viewer, CI#1
kanywst merged 18 commits into
mainfrom
feat/initial-scaffold

Conversation

@kanywst
Copy link
Copy Markdown
Member

@kanywst kanywst commented May 17, 2026

Summary

  • Pipeline (scripts/awsdd/): collect (RSS + GitHub Releases) → normalize → score → report as a small shared Python package. Scoring is (freshness × 2) + (keyword × source_weight) + severity with word-bounded keyword matching, so generic fresh items only get the freshness baseline while topic-relevant keyword hits get amplified by source trust.
  • Tracks: iam, security, whats-new, releases. Each is self-contained under tracks/<name>/ with config/sources.yaml, an identical Makefile (derives track name from CURDIR), and a track-local README. New tracks via make new-track NAME=<name> and are auto-picked-up by the root Makefile.
  • Retention: normalize drops items with published_at older than 180 days from normalized.json, prune.sh keeps 30 days of raw and 60 days of daily reports. Weekly reports are kept indefinitely.
  • Web (web/): Astro 6 + Tailwind v4 (postcss) + recharts island. Static output. Routes: /, /<track>/, /<track>/<mode>/<slug>/. Reads tracks/*/data/scored.json and tracks/*/reports/**/*.md at build time. Markdown is sanitized via isomorphic-dompurify.
  • CI: ci.yml (lint / type-check / pytest / build), codeql.yml, audit.yml (pip-audit + npm audit + license report), daily-update.yml (06:00 UTC), weekly-digest.yml (Mon 08:00 UTC), deploy-pages.yml (push to main). Dependabot covers pip, npm, github-actions.
  • MIT license.
  • Seed data committed so the first Pages deploy has content before the first scheduled run lands.

What I verified locally

  • All four tracks run make update end-to-end on a clean venv.
  • make -C tracks/iam weekly produces reports/weekly/2026-W20.md.
  • npm run build produces 10 static pages including dynamic /<track>/<mode>/<slug>/ routes; data loads, links render, no SSR warnings.
  • All committed Markdown lints clean against the repo's .markdownlint-cli2.jsonc.
  • pytest is green (40 tests, 79% coverage on scripts/awsdd/).
  • ruff check + ruff format --check clean.

What I did NOT do

  • Enable GitHub Pages on the repo. Settings → Pages → Source must be set to GitHub Actions for deploy-pages.yml to publish.
  • Configure branch protection. PRs into main should be gated on ci.yml jobs (Settings → Branches → main).

Test plan

  • Settings → Pages → Source = GitHub Actions, then re-run deploy-pages manually.
  • Open the deployed home page and confirm: bento layout renders, top items list populated, ISO-week trend chart shows.
  • Open /iam/, click into the latest daily/<date>, confirm the rendered Markdown matches tracks/iam/reports/daily/<date>.md.
  • Wait for tomorrow's daily-update run; confirm a chore(<track>): daily update YYYY-MM-DD commit lands on main.
  • Wait for Monday's weekly-digest run; confirm a chore(<track>): weekly digest YYYY-Www commit lands and a new weekly Markdown appears.

kanywst added 2 commits May 17, 2026 20:57
Adds the full implementation:

- scripts/awsdd/: shared Python package (collect_rss, collect_github,
  normalize, score, report) reading per-track config/sources.yaml.
  Scoring is freshness*2 + keyword*source_weight + severity, so generic
  fresh items only get the freshness baseline and topic-specific
  keyword hits get amplified by source trust.
- tracks/{iam,security,whats-new,releases}/: each self-contained with
  Makefile, config/sources.yaml, and README. Track Makefile derives the
  track name from CURDIR so the body is identical everywhere.
- templates/new-track/ + scripts/new-track.sh / new-deep-dive.sh for
  scaffolding additional tracks and deep-dive notes.
- web/: Astro 6 + Tailwind v4 (postcss) + recharts island. Static
  output, reads tracks/*/data/scored.json and tracks/*/reports/**/*.md
  at build time. Routes: /, /[track]/, /[track]/[mode]/[slug]/.
- .github/workflows/: daily-update (06:00 UTC, matrix.track),
  weekly-digest (Mon 08:00 UTC), deploy-pages (push to main),
  pr-checks. .github/dependabot.yml covers pip / npm / github-actions.
- MIT license.
Output of `make update` for all four tracks plus `make -C tracks/iam
weekly`, captured locally on 2026-05-17 so the Pages site has content
before the first scheduled CI run lands.

  iam:        261 RSS +  60 GH = 293 scored
  security:  1907 RSS +   0 GH = 1907 scored
  whats-new:  170 RSS +   0 GH = 170 scored
  releases:    0 RSS + 135 GH = 135 scored

These files are regenerated daily by .github/workflows/daily-update.yml
and pruned by scripts/prune.sh (raw kept 30d, daily reports kept 60d).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Warning

Rate limit exceeded

@kanywst has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 35 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 862aba7d-4316-42ef-822c-b8338fcfbc2e

📥 Commits

Reviewing files that changed from the base of the PR and between 439ffe4 and 226ea9c.

📒 Files selected for processing (27)
  • .github/workflows/daily-update.yml
  • .github/workflows/deploy-pages.yml
  • .github/workflows/weekly-digest.yml
  • Makefile
  • scripts/awsdd/_dates.py
  • scripts/awsdd/collect_github.py
  • scripts/awsdd/collect_rss.py
  • scripts/awsdd/config.py
  • scripts/awsdd/normalize.py
  • scripts/awsdd/report.py
  • scripts/awsdd/score.py
  • scripts/new-deep-dive.sh
  • scripts/new-track.sh
  • tests/test_collect_github.py
  • tests/test_collect_rss.py
  • tests/test_normalize.py
  • tests/test_report.py
  • tests/test_score.py
  • tracks/iam/data/scored.json
  • tracks/iam/reports/daily/2026-05-17.md
  • tracks/iam/reports/weekly/2026-W20.md
  • tracks/releases/data/scored.json
  • tracks/releases/reports/daily/2026-05-17.md
  • tracks/security/data/scored.json
  • tracks/whats-new/data/scored.json
  • tracks/whats-new/reports/daily/2026-05-17.md
  • web/src/lib/data.ts
📝 Walkthrough

Walkthrough

This PR introduces a complete AWS content aggregation and reporting platform. It implements a Python backend pipeline (RSS/GitHub collection, normalization, scoring) with four pre-configured tracks, multi-step GitHub Actions automation, and a static Astro web frontend. The system collects identity/auth, security, cloud news, and release information from configured sources, ranks by custom scoring logic, and publishes daily/weekly digests alongside an interactive dashboard.

Changes

Complete AWS Deepdive Platform

Layer / File(s) Summary
Project Foundation
README.md, LICENSE, .gitignore, pyproject.toml, requirements*.txt, .pre-commit-config.yaml, .markdownlint-cli2.jsonc
Project documentation, MIT License, configuration for linting/testing/dependencies, and git/editor ignore patterns.
Data Schema and Utilities
scripts/awsdd/__init__.py, scripts/awsdd/schema.py, scripts/awsdd/_dates.py, scripts/awsdd/config.py
Item dataclass with source/content/scoring fields; SourceKind type; UTC-aware datetime parsing with epoch fallback; repository path and YAML config loading helpers.
Data Collection Pipeline
scripts/awsdd/collect_rss.py, scripts/awsdd/collect_github.py
RSS feed ingestion with feedparser, HTML-to-text conversion (entity unescaping, script stripping), severity inference; GitHub API pagination with release filtering and link header following.
Data Processing Pipeline
scripts/awsdd/normalize.py, scripts/awsdd/score.py, scripts/awsdd/report.py
Normalization deduplicates by ID with retention pruning; scoring computes freshness decay + keyword matching + source weighting + severity; report generation filters to windows, ranks, and formats Markdown.
Track Scaffolding and Orchestration
Makefile, scripts/new-track.sh, scripts/new-deep-dive.sh, scripts/prune.sh, templates/deep-dive.md, templates/lab-report.md, templates/new-track/
Root Makefile orchestrating multi-track workflows; track template with Makefile and config stub; Bash scripts for creating new tracks/deep-dives with placeholder substitution; artifact pruning script.
Four Concrete Tracks
tracks/iam/, tracks/security/, tracks/whats-new/, tracks/releases/
IAM, security, news, and release tracks with per-track Makefiles, README docs, YAML source configs (keywords, weights, RSS feeds, GitHub repos), sample raw/normalized/scored data, and generated daily/weekly reports.
GitHub Automation Workflows
.github/dependabot.yml, .github/workflows/audit.yml, .github/workflows/ci.yml, .github/workflows/codeql.yml, .github/workflows/daily-update.yml, .github/workflows/deploy-pages.yml, .github/workflows/weekly-digest.yml
Dependabot for dependencies; audit for vulnerability/license scanning; CI for linting/testing/builds; CodeQL for security analysis; daily/weekly track automation with retry logic; GitHub Pages deployment.
Testing Infrastructure
tests/conftest.py, tests/fixtures/, tests/test_collect_*.py, tests/test_normalize.py, tests/test_score.py, tests/test_report.py, tests/test_pipeline.py
pytest fixtures for isolated filesystem testing; RSS/GitHub release fixtures; unit tests for collection, normalization, scoring, reporting; end-to-end pipeline smoke test.
Web Frontend Configuration
web/astro.config.mjs, web/package.json, web/postcss.config.mjs, web/tsconfig.json, web/.gitignore
Astro static site builder with React integration and Shiki highlighting; npm dependencies (Astro, React, Tailwind, recharts, marked, DOMPurify); PostCSS with Tailwind; TypeScript strict mode.
Web Frontend Styling and Layout
web/src/styles/global.css, web/src/layouts/Base.astro
Dark theme global CSS with Tailwind directives, typography, and Markdown prose customization; Base layout providing header navigation, slot content area, and footer attribution.
Web Frontend Components
web/src/components/ItemRow.astro, web/src/components/TrackCard.astro, web/src/components/TrendChart.tsx
Reusable ItemRow component for item lists; TrackCard for track summaries with latest report; TrendChart React component for weekly volume area chart.
Web Frontend Data and Utilities
web/src/lib/data.ts, web/src/lib/markdown.ts
TypeScript loaders for scored items, track filtering, markdown report reading, and weekly volume aggregation; Markdown-to-HTML conversion with GFM and sanitization.
Web Frontend Pages
web/src/pages/index.astro, web/src/pages/[track]/index.astro, web/src/pages/[track]/[mode]/[slug].astro
Homepage with top-12 items and trend chart; per-track pages with reports and top items; dynamic report detail pages with breadcrumb navigation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This PR introduces a substantial new system with multiple interconnected layers: Python data pipeline (collection, normalization, scoring), Bash/Make automation infrastructure, four fully-configured tracks with real data, eight GitHub Actions workflows, comprehensive test coverage, and a complete Astro web frontend. While many individual files are straightforward configuration or data files, the integration points (script entry points, data contracts, API pagination, scoring logic, route generation) demand careful review. The heterogeneity of Python, Bash, YAML, JavaScript, TypeScript, and Markdown changes across a broad scope, and the end-to-end flow from collection through rendering requires understanding the entire pipeline. Medium-to-high complexity overall due to logic density in collectors/scorers and the breadth of moving parts.

🐰 A deepdive into data flows so grand,
RSS feeds and GitHub at hand,
Scored and sorted with care,
Reports that are rare,
A dashboard to learn AWS's land! 🌟

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/initial-scaffold

gemini-code-assist[bot]

This comment was marked as resolved.

kanywst added 2 commits May 17, 2026 21:03
What changed:

- Replace pr-checks.yml with ci.yml — 6 parallel jobs: lint-python
  (ruff check + format), lint-web (astro check), lint-meta (markdownlint
  + actionlint), test-python (pytest + coverage XML artefact),
  build-python (compileall), build-web (astro build). Runs on PR and
  on push to main (paths-ignore for data-only commits).
- Add codeql.yml for python + javascript-typescript with the
  security-and-quality query suite. Weekly cron + on PR/push to main.
- Add audit.yml: pip-audit, npm audit --audit-level=high, license
  reports for both, and a copyleft gate (fails on GPL / AGPL in the
  Python tree, allows LGPL). Weekly cron + on deps-file PRs.
- Add pyproject.toml with ruff and pytest config, pinning py312 +
  enabling rules E/W/F/I/B/UP/SIM. Coverage source = awsdd.
- Add requirements-dev.txt (pytest, pytest-cov, ruff, pip-audit,
  pip-licenses).
- Add .markdownlint-cli2.jsonc mirroring kt's global config so CI and
  local agree on the same disabled rules.
- Add .pre-commit-config.yaml (ruff, markdownlint, actionlint, basic
  hygiene hooks) for opt-in local enforcement.
- Add pytest suite at tests/ (24 tests, 71% line coverage): score,
  normalize, report, collect_rss, collect_github, plus a full
  normalize → score → report pipeline smoke test against fixtures
  (no network).
- Refactor scripts/awsdd/collect_rss.py and collect_github.py to
  expose pure entry_to_item / release_to_item conversion functions so
  the collectors can be unit-tested without hitting RSS or the GitHub
  API. The collect() entry points are unchanged.
- Apply ruff format + auto-fixes across the awsdd package
  (datetime.UTC alias, import ordering, no behavior change).
- Makefile: add dev-install / test / lint / format / audit targets.
- README: add a CI table and a Local commands block listing the new
  make targets.

PR gating relies on branch protection (Settings → Branches) requiring
ci.yml's job statuses to pass before merge into main. That's a repo
config step, not in this commit.
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

- collect_github._get now guards with isinstance(res, list) so an
  unexpected dict response (rate-limit envelope, 404, etc.) does not
  poison downstream iteration with AttributeError. [HIGH]
- report._parse_iso and score._parse_iso drop the manual `.replace
  ("Z", "+00:00")`; datetime.fromisoformat handles the Z suffix
  natively from Python 3.11 onward. Also narrow the bare `except` to
  (ValueError, TypeError). [MEDIUM]
- report.render writes links as `[title](<url>)` so AWS URLs that
  contain parentheses (common in whats-new and release-notes paths)
  do not break vanilla Markdown link parsing. [MEDIUM]
- new-deep-dive.sh and new-track.sh switch sed delimiter from `/` to
  `|` so a track / topic that contains a slash does not abort sed.
  [MEDIUM]
- Regenerate today's seed reports with the new angle-bracket link
  format so the committed sample matches the live formatter.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

Date-handling robustness:
- collect_rss._iso, collect_github.release_to_item, report._parse_iso,
  score._parse_iso: fall back to the Unix epoch (1970-01-01) instead of
  datetime.now(UTC) when a date is missing or corrupted. The old
  behaviour gave malformed items a maximum freshness score and pushed
  them to the top of every report; the new behaviour leaves them at
  ~0 freshness so they sink. [MEDIUM x4]

Network robustness:
- collect_rss.collect now fetches via urlopen with a 30s timeout and
  passes the body to feedparser.parse, instead of feedparser.parse(url)
  which has no built-in timeout. A single stuck origin no longer hangs
  the whole pipeline. [MEDIUM]

XSS hardening:
- web/src/lib/markdown.ts pipes marked's output through
  isomorphic-dompurify before returning. Report content is rendered via
  `set:html`, and titles / summaries originate in untrusted RSS and
  GitHub feeds, so unsanitized HTML could land in the page (verified by
  smoke: a fixture with <script>, onerror, and javascript: URLs is
  fully stripped). [SECURITY-MEDIUM]

Tests:
- tests/test_score.py: regression test that a garbage published_at
  scores near-zero freshness. Locks in the epoch fallback.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

Type precision:
- schema.Item.source_kind is now Literal["rss", "github"] instead of
  bare str, matching the comment and actual usage.

Robustness:
- report.render now escapes both `[` and `]` in titles (previously only
  `]`), so titles like "[CVE-2026-xxx]" cannot collide with markdown
  link parsing.
- normalize.normalize: narrow the two bare `except Exception` blocks
  to `(OSError, json.JSONDecodeError)` so genuine bugs surface instead
  of being swallowed.
- collect_github._get: narrow the fallback `except Exception` to
  `(URLError, TimeoutError, json.JSONDecodeError)` (HTTPError is still
  caught separately above).

Reproducibility:
- requirements.txt and requirements-dev.txt now pin exact versions
  with `==`. Dependabot continues to PR upgrades, but a fresh checkout
  + `pip install -r ...` now resolves to a known-good set.

Config / docs:
- tracks/iam/config/sources.yaml: drop `spiffe/spiffe` (verified via
  GitHub API: repo exists but has zero releases — it's the specs
  tree, not a release surface).
- README.md: remove the duplicate 3-bullet "CI" section that the new
  CI table superseded.

Regenerated today's seed reports under tracks/*/reports/ so they
match the new formatter output (extra `\[` escape would only matter
on titles containing `[`, but regen keeps the committed sample byte-
identical to what CI would produce now).
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

Accepted (5 / 9):

- .pre-commit-config.yaml: bump astral-sh/ruff-pre-commit rev v0.8.4
  → v0.15.13. The 0.8.4 tag really does not exist (gemini was right);
  v0.15.13 is current and matches the pinned `ruff` runtime version.
  [HIGH]
- .github/dependabot.yml: Asia/Tokyo → Etc/UTC for all three update
  blocks. Neutral default for a public repo. [MEDIUM]
- README.md: rewrite the scoring formula description to match the
  actual `(freshness × 2) + (keyword × source-weight) + severity`
  implementation (it was incorrectly listed as fully multiplicative).
  Also update the Layout block: `pr-checks.yml` was replaced by
  `ci.yml` plus `codeql.yml` and `audit.yml` two commits ago but the
  doc still pointed at the old name. [MEDIUM]
- scripts/awsdd/_dates.py (new): house the shared `parse_iso` and an
  EPOCH constant. score.py and report.py now import from it instead
  of carrying duplicate `_parse_iso` definitions. [MEDIUM - DRY]
- web/src/lib/data.ts: hoist the `86_400_000` magic number into a
  `MS_IN_DAY` constant. [MEDIUM]

Rejected (4 / 9), reasoning:

- requirements{,-dev}.txt "version does not exist" [CRITICAL ×2]:
  verified via `curl https://pypi.org/pypi/<pkg>/<ver>/json` — all
  seven pinned versions return HTTP 200 and `pip install` succeeds
  both locally and in every prior CI run. Gemini's knowledge cutoff
  predates these releases; the pins stay.
- pyproject.toml `SIM108` (ternary nag) [MEDIUM]: deliberately
  disabled because `if/else` blocks often read clearer than nested
  ternaries in this code base. No change.
- collect_rss `BeautifulSoup` swap [MEDIUM]: regex strip + DOMPurify
  sanitization on the rendering side already cover the layered
  defense; the dep cost outweighs the marginal robustness gain for
  the well-formed AWS feeds we consume. No change.

Regenerated seed reports so the committed sample matches the (byte-
identical) output the formatter now produces.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

Both findings legitimate, both applied:

- collect_rss._fetch caps the response at 10 MiB (MAX_FEED_BYTES) via
  r.read(MAX_FEED_BYTES). Without a bound, a hostile or runaway origin
  could exhaust the worker's memory. AWS feeds are sub-megabyte; the
  cap is just a safety net. [MEDIUM]
- collect_rss._strip_html now unescapes BEFORE stripping tags. The
  old order let entity-encoded markup (`&lt;script&gt;...`) bypass
  the strip pass and resurface as raw HTML after html.unescape ran on
  the regex output. New regression test
  `test_summary_strips_entity_encoded_tags` locks in the corrected
  order. DOMPurify on the web rendering side is still the last line of
  defense, but cleaning at the source is the right layer. [MEDIUM]
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

All 6 findings legitimate, all applied (with regression tests):

- collect_github._get_page caps r.read at 10 MiB (MAX_RESPONSE_BYTES),
  same DoS guard as collect_rss. [MEDIUM]
- collect_github now follows Link.rel="next" up to MAX_PAGES=5 pages
  via _get_all + _next_url, and default per_page bumped 20→50. A
  busy repo (aws-cli, aws-cdk) could previously lose releases beyond
  page 1; with daily CI the new bound covers all realistic catch-up
  windows. Three tests pin the Link-header parser. [MEDIUM]
- collect_rss._strip_html now uses stdlib html.parser instead of a
  regex. A real parser handles `<`/`>` inside text content (the old
  regex chomped through "1 < 2 and 4 > 3"), and a _TextExtractor that
  tracks <script>/<style> depth drops their contents entirely. Two
  regression tests cover both cases. [MEDIUM]
- collect_rss._severity now matches with `\b(critical|high|medium|
  low)\b`. The old substring check turned "Slow performance" into a
  low-severity item and "Highlighted feature" into a high one. [MEDIUM]
- score._keyword_hits uses `\b{re.escape(keyword)}\b`. The old
  substring check let `iam` hit "diagram" and `sts` hit "tests". Two
  regression tests lock in that "updated diagram and test results"
  scores zero and "iam supports sts session tags" scores correctly.
  [MEDIUM]
- web/src/lib/data.ts keeps the process.cwd()-based ROOT (we
  validated that import.meta.url breaks under Astro's bundler in a
  prior attempt), but now asserts at module load that ROOT/tracks
  exists and throws a clear error if invoked from the wrong cwd, so
  the previous silent "no items yet" failure mode becomes loud. [MEDIUM]

Side effects of the word-bounded scoring change: all four tracks were
re-scored against the existing normalized.json snapshots and seed
reports regenerated. Top items shift slightly (substring-only hits no
longer count) but the change is purely a precision improvement.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

All 4 findings legitimate, all applied:

- collect_github.collect now uses `entry.get("repo")` + skip-with-log
  instead of `entry["repo"]`. A malformed sources.yaml entry no longer
  crashes the whole track. New test exercises a mixed-validity config.
  [MEDIUM]
- collect_rss.collect: same defensive `.get()` treatment for the feed
  `id` and `url` keys, with the same skip-and-log pattern. New test
  verifies two malformed entries are skipped and the well-formed one
  still fires. [MEDIUM]
- collect_rss._strip_html joins extracted text with " " instead of ""
  so block-element neighbours like `<div>A</div><div>B</div>` do not
  merge into `AB`. The trailing \s+ collapse keeps the output tidy.
  Regression test pins it. [MEDIUM]
- normalize.normalize gains a 180-day retention pass: items with
  `published_at` older than the cutoff are dropped from normalized
  .json. This is the file that's committed and loaded whole by the
  web build, so unbounded growth was a real concern as the project
  ages. Test asserts old + epoch-fallback items are pruned, recent
  ones kept. [MEDIUM]

Side effect: regenerated seed data + reports. Retention bites hard
this round because the initial collection swept years of Security
Bulletins history:

  iam:        293 → 178 (pruned 115)
  security:  1907 → 56  (pruned 1851)
  releases:   135 → 117 (pruned 18)
  whats-new:  170 → 170 (already inside window)

That's a ~5 MB reduction in the committed JSON tree.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

12 of 13 findings applied; LICENSE 0-draft is the real GitHub org name
(not a placeholder) so left as-is.

Defensive parsing:
- config.load_sources: type-guards `yaml.safe_load` output. A non-dict
  YAML root used to slip through `or {}` and crash downstream .get()
  calls. [coderabbit, outside-diff]
- _dates.parse_iso: when the input has no timezone offset, fall back
  to UTC instead of returning a naive datetime that would TypeError
  against the tz-aware `now` in score / report. [coderabbit, major]
- score.score_item: wrap the `float()` cast on source_weights in
  try/except — a YAML typo no longer aborts the whole pipeline.
  [coderabbit, major]
- collect_rss.entry_to_item: `entry.get("tags") or []` (was `, []`)
  so an explicit `None` in feedparser output doesn't TypeError the
  comprehension. [gemini]

Defensive reporting:
- report.render: in the empty-window fallback, explicitly sort items
  by score before slicing. scored.json is usually pre-sorted but the
  defense costs nothing. [coderabbit, major]
- web/src/lib/data.ts loadScored: throw on JSON parse failure
  instead of returning [] and shipping an apparently-successful
  empty site. [coderabbit, major]

Automation hardening:
- Makefile: TRACKS now auto-discovered via `$(wildcard tracks/*/)`.
  `make new-track NAME=foo` is immediately picked up by install /
  update / weekly instead of needing a manual edit. [coderabbit, major]
- new-track.sh, new-deep-dive.sh: escape sed metacharacters (`&`,
  `\`, `|`) in the substitution values so a name containing those
  doesn't trigger sed's whole-match replacement. [coderabbit, minor]

Workflow safety:
- daily-update.yml, weekly-digest.yml: pin checkout `ref: main` so a
  workflow_dispatch from a feature branch can't commit non-main data
  to main. [coderabbit, major]
- deploy-pages.yml: pin checkout to main + add an `if: github.ref ==
  'refs/heads/main'` guard on the build job to make manual dispatches
  from other branches a no-op. [coderabbit, major]

Tests:
- test_score.test_malformed_source_weight_falls_back_to_one
- test_normalize.test_load_sources_returns_dict_for_non_mapping_root
- test_normalize.test_parse_iso_assumes_utc_for_naive_input

Side effect: seed regenerated (no diff in content beyond what the
re-score produces against the same normalized snapshot).
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

All 3 findings legitimate, all applied:

- collect_rss._fetch now returns raw bytes instead of decoded str.
  feedparser does its own encoding detection (XML prolog charset, BOM,
  Content-Type) and gzip handling when fed bytes; pre-decoding to
  UTF-8 with errors='replace' silently defeats that and would also
  mojibake non-UTF-8 feeds. collect() already passes the result
  directly to feedparser.parse, so no caller change. [MEDIUM]

- collect_github._get_page reads MAX_RESPONSE_BYTES + 1 and refuses
  any response that hits the cap, instead of decoding a truncated
  body with errors='replace' that could feed corrupt JSON into
  json.loads. The strict utf-8 decode now surfaces a real encoding
  bug as UnicodeDecodeError (also caught) instead of being papered
  over. [MEDIUM]

- report.render escapes `>` in URLs to %3E before wrapping in the
  Markdown `<...>` angle pair. A query string like `?q=a>b` would
  otherwise close the URL pair early and break link parsing. New
  test_gt_in_url_is_escaped pins it. [MEDIUM]
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

All 8 findings legitimate, all applied. Three logical changes:

1. Explicit UTF-8 on every read_text / write_text in scripts/awsdd/
   (config.py, normalize.py, score.py, report.py, collect_rss.py,
   collect_github.py). Implicit locale-dependent encoding is a
   portability hazard — non-UTF-8 default locales (some Windows
   environments, exotic CI runners) would corrupt non-ASCII content
   silently when we already use ensure_ascii=False. [MEDIUM ×6]

2. collect_rss._fetch now refuses oversized responses the same way
   collect_github does — read MAX_FEED_BYTES + 1, error if the cap
   was hit. The old code returned a truncated byte string and let
   feedparser silently bozo-error on the malformed tail. [MEDIUM]

3. report.render collapses `\r` and `\n` in titles to spaces before
   the bracket-escape pass. A `\n# heading` in a feed title used to
   spawn a stray Markdown heading inside the list. Regression test
   test_newline_in_title_is_collapsed pins it. [MEDIUM]

Seed reports regenerated; web rebuilds 10 pages.
@kanywst

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

All 3 findings legitimate, all applied:

- collect_github.collect: add `isinstance(entry, dict)` guard before
  `entry.get("repo")`. A null or scalar YAML entry (`- foo`) used to
  AttributeError out of the whole track; now it's logged and skipped.
  Existing malformed-entry test extended with null and string cases.
  [MEDIUM]

- collect_rss.collect: same isinstance guard for `feed`. Existing
  malformed-feed test extended with a `null` entry. [MEDIUM]

- score._compiled_patterns: wrap the per-keyword regex compile in
  `functools.lru_cache` keyed on the keyword tuple. score_item is
  called once per item (hundreds per track per run); without caching
  we'd recompile the same primary/secondary patterns every call.
  Switched score_item's keyword extraction to tuples so they're
  hashable for the cache. No behaviour change — same word-bounded
  semantics, just compiled once per distinct keyword set. [MEDIUM]
@kanywst
Copy link
Copy Markdown
Member Author

kanywst commented May 17, 2026

/gemini review

gemini-code-assist[bot]

This comment was marked as resolved.

All 5 findings legitimate, all applied:

Defensive exception handling:
- collect_github._get_page: add ValueError to except. urlopen raises
  it for malformed URLs ("unknown url type" etc.), e.g. a sources.yaml
  entry with a missing scheme. [MEDIUM]
- collect_rss._fetch: same ValueError addition. [MEDIUM]
- score.score: wrap json.loads(normalized.json) in try/except. A
  corrupt or truncated file (disk-full mid-write etc.) used to crash
  the whole scoring step instead of leaving the previous scored.json
  in place. [MEDIUM]
- report.render: same try/except around json.loads(scored.json).
  [MEDIUM]

Stale guidance:
- new-track.sh: the "next steps" message still told users to edit
  TRACKS in the root Makefile, but Makefile now auto-discovers tracks
  via `wildcard tracks/*/`. Updated to point at the two places that
  do still need a manual update for a new track:
    .github/workflows/{daily-update,weekly-digest}.yml matrix.track
    web/src/lib/data.ts TRACKS const
  with a footnote that the root Makefile is auto. [MEDIUM]
@kanywst
Copy link
Copy Markdown
Member Author

kanywst commented May 17, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request initializes the aws-deepdive project, an automated system for collecting, scoring, and reporting AWS identity and security updates. The implementation features a Python-based data pipeline for RSS and GitHub release ingestion, a customizable scoring engine, and an Astro-powered web frontend for displaying daily and weekly digests. Additionally, the PR includes comprehensive test coverage, GitHub Actions workflows for automated updates, and scaffolding scripts for extending the project with new tracks. I have no feedback to provide.

@kanywst kanywst merged commit 7776e88 into main May 17, 2026
@kanywst kanywst deleted the feat/initial-scaffold branch May 17, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants