feat: initial implementation — pipeline, four tracks, web viewer, CI by kanywst · Pull Request #1 · 0-draft/aws-deepdive

kanywst · 2026-05-17T11:57:39Z

Summary

Pipeline (scripts/awsdd/): collect (RSS + GitHub Releases) → normalize → score → report as a small shared Python package. Scoring is (freshness × 2) + (keyword × source_weight) + severity with word-bounded keyword matching, so generic fresh items only get the freshness baseline while topic-relevant keyword hits get amplified by source trust.
Tracks: iam, security, whats-new, releases. Each is self-contained under tracks/<name>/ with config/sources.yaml, an identical Makefile (derives track name from CURDIR), and a track-local README. New tracks via make new-track NAME=<name> and are auto-picked-up by the root Makefile.
Retention: normalize drops items with published_at older than 180 days from normalized.json, prune.sh keeps 30 days of raw and 60 days of daily reports. Weekly reports are kept indefinitely.
Web (web/): Astro 6 + Tailwind v4 (postcss) + recharts island. Static output. Routes: /, /<track>/, /<track>/<mode>/<slug>/. Reads tracks/*/data/scored.json and tracks/*/reports/**/*.md at build time. Markdown is sanitized via isomorphic-dompurify.
CI: ci.yml (lint / type-check / pytest / build), codeql.yml, audit.yml (pip-audit + npm audit + license report), daily-update.yml (06:00 UTC), weekly-digest.yml (Mon 08:00 UTC), deploy-pages.yml (push to main). Dependabot covers pip, npm, github-actions.
MIT license.
Seed data committed so the first Pages deploy has content before the first scheduled run lands.

What I verified locally

All four tracks run make update end-to-end on a clean venv.
make -C tracks/iam weekly produces reports/weekly/2026-W20.md.
npm run build produces 10 static pages including dynamic /<track>/<mode>/<slug>/ routes; data loads, links render, no SSR warnings.
All committed Markdown lints clean against the repo's .markdownlint-cli2.jsonc.
pytest is green (40 tests, 79% coverage on scripts/awsdd/).
ruff check + ruff format --check clean.

What I did NOT do

Enable GitHub Pages on the repo. Settings → Pages → Source must be set to GitHub Actions for deploy-pages.yml to publish.
Configure branch protection. PRs into main should be gated on ci.yml jobs (Settings → Branches → main).

Test plan

Settings → Pages → Source = GitHub Actions, then re-run deploy-pages manually.
Open the deployed home page and confirm: bento layout renders, top items list populated, ISO-week trend chart shows.
Open /iam/, click into the latest daily/<date>, confirm the rendered Markdown matches tracks/iam/reports/daily/<date>.md.
Wait for tomorrow's daily-update run; confirm a chore(<track>): daily update YYYY-MM-DD commit lands on main.
Wait for Monday's weekly-digest run; confirm a chore(<track>): weekly digest YYYY-Www commit lands and a new weekly Markdown appears.

Adds the full implementation: - scripts/awsdd/: shared Python package (collect_rss, collect_github, normalize, score, report) reading per-track config/sources.yaml. Scoring is freshness*2 + keyword*source_weight + severity, so generic fresh items only get the freshness baseline and topic-specific keyword hits get amplified by source trust. - tracks/{iam,security,whats-new,releases}/: each self-contained with Makefile, config/sources.yaml, and README. Track Makefile derives the track name from CURDIR so the body is identical everywhere. - templates/new-track/ + scripts/new-track.sh / new-deep-dive.sh for scaffolding additional tracks and deep-dive notes. - web/: Astro 6 + Tailwind v4 (postcss) + recharts island. Static output, reads tracks/*/data/scored.json and tracks/*/reports/**/*.md at build time. Routes: /, /[track]/, /[track]/[mode]/[slug]/. - .github/workflows/: daily-update (06:00 UTC, matrix.track), weekly-digest (Mon 08:00 UTC), deploy-pages (push to main), pr-checks. .github/dependabot.yml covers pip / npm / github-actions. - MIT license.

Output of `make update` for all four tracks plus `make -C tracks/iam weekly`, captured locally on 2026-05-17 so the Pages site has content before the first scheduled CI run lands. iam: 261 RSS + 60 GH = 293 scored security: 1907 RSS + 0 GH = 1907 scored whats-new: 170 RSS + 0 GH = 170 scored releases: 0 RSS + 135 GH = 135 scored These files are regenerated daily by .github/workflows/daily-update.yml and pruned by scripts/prune.sh (raw kept 30d, daily reports kept 60d).

coderabbitai · 2026-05-17T11:57:50Z

Warning

Rate limit exceeded

@kanywst has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 35 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 862aba7d-4316-42ef-822c-b8338fcfbc2e

📥 Commits

Reviewing files that changed from the base of the PR and between 439ffe4 and 226ea9c.

📒 Files selected for processing (27)

.github/workflows/daily-update.yml
.github/workflows/deploy-pages.yml
.github/workflows/weekly-digest.yml
Makefile
scripts/awsdd/_dates.py
scripts/awsdd/collect_github.py
scripts/awsdd/collect_rss.py
scripts/awsdd/config.py
scripts/awsdd/normalize.py
scripts/awsdd/report.py
scripts/awsdd/score.py
scripts/new-deep-dive.sh
scripts/new-track.sh
tests/test_collect_github.py
tests/test_collect_rss.py
tests/test_normalize.py
tests/test_report.py
tests/test_score.py
tracks/iam/data/scored.json
tracks/iam/reports/daily/2026-05-17.md
tracks/iam/reports/weekly/2026-W20.md
tracks/releases/data/scored.json
tracks/releases/reports/daily/2026-05-17.md
tracks/security/data/scored.json
tracks/whats-new/data/scored.json
tracks/whats-new/reports/daily/2026-05-17.md
web/src/lib/data.ts

📝 Walkthrough

Walkthrough

This PR introduces a complete AWS content aggregation and reporting platform. It implements a Python backend pipeline (RSS/GitHub collection, normalization, scoring) with four pre-configured tracks, multi-step GitHub Actions automation, and a static Astro web frontend. The system collects identity/auth, security, cloud news, and release information from configured sources, ranks by custom scoring logic, and publishes daily/weekly digests alongside an interactive dashboard.

Changes

Complete AWS Deepdive Platform

Layer / File(s)	Summary
Project Foundation `README.md`, `LICENSE`, `.gitignore`, `pyproject.toml`, `requirements*.txt`, `.pre-commit-config.yaml`, `.markdownlint-cli2.jsonc`	Project documentation, MIT License, configuration for linting/testing/dependencies, and git/editor ignore patterns.
Data Schema and Utilities `scripts/awsdd/__init__.py`, `scripts/awsdd/schema.py`, `scripts/awsdd/_dates.py`, `scripts/awsdd/config.py`	`Item` dataclass with source/content/scoring fields; `SourceKind` type; UTC-aware datetime parsing with epoch fallback; repository path and YAML config loading helpers.
Data Collection Pipeline `scripts/awsdd/collect_rss.py`, `scripts/awsdd/collect_github.py`	RSS feed ingestion with feedparser, HTML-to-text conversion (entity unescaping, script stripping), severity inference; GitHub API pagination with release filtering and link header following.
Data Processing Pipeline `scripts/awsdd/normalize.py`, `scripts/awsdd/score.py`, `scripts/awsdd/report.py`	Normalization deduplicates by ID with retention pruning; scoring computes freshness decay + keyword matching + source weighting + severity; report generation filters to windows, ranks, and formats Markdown.
Track Scaffolding and Orchestration `Makefile`, `scripts/new-track.sh`, `scripts/new-deep-dive.sh`, `scripts/prune.sh`, `templates/deep-dive.md`, `templates/lab-report.md`, `templates/new-track/`	Root Makefile orchestrating multi-track workflows; track template with Makefile and config stub; Bash scripts for creating new tracks/deep-dives with placeholder substitution; artifact pruning script.
Four Concrete Tracks `tracks/iam/`, `tracks/security/`, `tracks/whats-new/`, `tracks/releases/`	IAM, security, news, and release tracks with per-track Makefiles, README docs, YAML source configs (keywords, weights, RSS feeds, GitHub repos), sample raw/normalized/scored data, and generated daily/weekly reports.
GitHub Automation Workflows `.github/dependabot.yml`, `.github/workflows/audit.yml`, `.github/workflows/ci.yml`, `.github/workflows/codeql.yml`, `.github/workflows/daily-update.yml`, `.github/workflows/deploy-pages.yml`, `.github/workflows/weekly-digest.yml`	Dependabot for dependencies; audit for vulnerability/license scanning; CI for linting/testing/builds; CodeQL for security analysis; daily/weekly track automation with retry logic; GitHub Pages deployment.
Testing Infrastructure `tests/conftest.py`, `tests/fixtures/`, `tests/test_collect_*.py`, `tests/test_normalize.py`, `tests/test_score.py`, `tests/test_report.py`, `tests/test_pipeline.py`	pytest fixtures for isolated filesystem testing; RSS/GitHub release fixtures; unit tests for collection, normalization, scoring, reporting; end-to-end pipeline smoke test.
Web Frontend Configuration `web/astro.config.mjs`, `web/package.json`, `web/postcss.config.mjs`, `web/tsconfig.json`, `web/.gitignore`	Astro static site builder with React integration and Shiki highlighting; npm dependencies (Astro, React, Tailwind, recharts, marked, DOMPurify); PostCSS with Tailwind; TypeScript strict mode.
Web Frontend Styling and Layout `web/src/styles/global.css`, `web/src/layouts/Base.astro`	Dark theme global CSS with Tailwind directives, typography, and Markdown prose customization; Base layout providing header navigation, slot content area, and footer attribution.
Web Frontend Components `web/src/components/ItemRow.astro`, `web/src/components/TrackCard.astro`, `web/src/components/TrendChart.tsx`	Reusable ItemRow component for item lists; TrackCard for track summaries with latest report; TrendChart React component for weekly volume area chart.
Web Frontend Data and Utilities `web/src/lib/data.ts`, `web/src/lib/markdown.ts`	TypeScript loaders for scored items, track filtering, markdown report reading, and weekly volume aggregation; Markdown-to-HTML conversion with GFM and sanitization.
Web Frontend Pages `web/src/pages/index.astro`, `web/src/pages/[track]/index.astro`, `web/src/pages/[track]/[mode]/[slug].astro`	Homepage with top-12 items and trend chart; per-track pages with reports and top items; dynamic report detail pages with breadcrumb navigation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This PR introduces a substantial new system with multiple interconnected layers: Python data pipeline (collection, normalization, scoring), Bash/Make automation infrastructure, four fully-configured tracks with real data, eight GitHub Actions workflows, comprehensive test coverage, and a complete Astro web frontend. While many individual files are straightforward configuration or data files, the integration points (script entry points, data contracts, API pagination, scoring logic, route generation) demand careful review. The heterogeneity of Python, Bash, YAML, JavaScript, TypeScript, and Markdown changes across a broad scope, and the end-to-end flow from collection through rendering requires understanding the entire pipeline. Medium-to-high complexity overall due to logic density in collectors/scorers and the breadth of moving parts.

🐰 A deepdive into data flows so grand,
RSS feeds and GitHub at hand,
Scored and sorted with care,
Reports that are rare,
A dashboard to learn AWS's land! 🌟

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/initial-scaffold

What changed: - Replace pr-checks.yml with ci.yml — 6 parallel jobs: lint-python (ruff check + format), lint-web (astro check), lint-meta (markdownlint + actionlint), test-python (pytest + coverage XML artefact), build-python (compileall), build-web (astro build). Runs on PR and on push to main (paths-ignore for data-only commits). - Add codeql.yml for python + javascript-typescript with the security-and-quality query suite. Weekly cron + on PR/push to main. - Add audit.yml: pip-audit, npm audit --audit-level=high, license reports for both, and a copyleft gate (fails on GPL / AGPL in the Python tree, allows LGPL). Weekly cron + on deps-file PRs. - Add pyproject.toml with ruff and pytest config, pinning py312 + enabling rules E/W/F/I/B/UP/SIM. Coverage source = awsdd. - Add requirements-dev.txt (pytest, pytest-cov, ruff, pip-audit, pip-licenses). - Add .markdownlint-cli2.jsonc mirroring kt's global config so CI and local agree on the same disabled rules. - Add .pre-commit-config.yaml (ruff, markdownlint, actionlint, basic hygiene hooks) for opt-in local enforcement. - Add pytest suite at tests/ (24 tests, 71% line coverage): score, normalize, report, collect_rss, collect_github, plus a full normalize → score → report pipeline smoke test against fixtures (no network). - Refactor scripts/awsdd/collect_rss.py and collect_github.py to expose pure entry_to_item / release_to_item conversion functions so the collectors can be unit-tested without hitting RSS or the GitHub API. The collect() entry points are unchanged. - Apply ruff format + auto-fixes across the awsdd package (datetime.UTC alias, import ordering, no behavior change). - Makefile: add dev-install / test / lint / format / audit targets. - README: add a CI table and a Local commands block listing the new make targets. PR gating relies on branch protection (Settings → Branches) requiring ci.yml's job statuses to pass before merge into main. That's a repo config step, not in this commit.

github-advanced-security · 2026-05-17T12:04:02Z

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

- collect_github._get now guards with isinstance(res, list) so an unexpected dict response (rate-limit envelope, 404, etc.) does not poison downstream iteration with AttributeError. [HIGH] - report._parse_iso and score._parse_iso drop the manual `.replace ("Z", "+00:00")`; datetime.fromisoformat handles the Z suffix natively from Python 3.11 onward. Also narrow the bare `except` to (ValueError, TypeError). [MEDIUM] - report.render writes links as `[title](<url>)` so AWS URLs that contain parentheses (common in whats-new and release-notes paths) do not break vanilla Markdown link parsing. [MEDIUM] - new-deep-dive.sh and new-track.sh switch sed delimiter from `/` to `|` so a track / topic that contains a slash does not abort sed. [MEDIUM] - Regenerate today's seed reports with the new angle-bracket link format so the committed sample matches the live formatter.

Date-handling robustness: - collect_rss._iso, collect_github.release_to_item, report._parse_iso, score._parse_iso: fall back to the Unix epoch (1970-01-01) instead of datetime.now(UTC) when a date is missing or corrupted. The old behaviour gave malformed items a maximum freshness score and pushed them to the top of every report; the new behaviour leaves them at ~0 freshness so they sink. [MEDIUM x4] Network robustness: - collect_rss.collect now fetches via urlopen with a 30s timeout and passes the body to feedparser.parse, instead of feedparser.parse(url) which has no built-in timeout. A single stuck origin no longer hangs the whole pipeline. [MEDIUM] XSS hardening: - web/src/lib/markdown.ts pipes marked's output through isomorphic-dompurify before returning. Report content is rendered via `set:html`, and titles / summaries originate in untrusted RSS and GitHub feeds, so unsanitized HTML could land in the page (verified by smoke: a fixture with <script>, onerror, and javascript: URLs is fully stripped). [SECURITY-MEDIUM] Tests: - tests/test_score.py: regression test that a garbage published_at scores near-zero freshness. Locks in the epoch fallback.

Type precision: - schema.Item.source_kind is now Literal["rss", "github"] instead of bare str, matching the comment and actual usage. Robustness: - report.render now escapes both `[` and `]` in titles (previously only `]`), so titles like "[CVE-2026-xxx]" cannot collide with markdown link parsing. - normalize.normalize: narrow the two bare `except Exception` blocks to `(OSError, json.JSONDecodeError)` so genuine bugs surface instead of being swallowed. - collect_github._get: narrow the fallback `except Exception` to `(URLError, TimeoutError, json.JSONDecodeError)` (HTTPError is still caught separately above). Reproducibility: - requirements.txt and requirements-dev.txt now pin exact versions with `==`. Dependabot continues to PR upgrades, but a fresh checkout + `pip install -r ...` now resolves to a known-good set. Config / docs: - tracks/iam/config/sources.yaml: drop `spiffe/spiffe` (verified via GitHub API: repo exists but has zero releases — it's the specs tree, not a release surface). - README.md: remove the duplicate 3-bullet "CI" section that the new CI table superseded. Regenerated today's seed reports under tracks/*/reports/ so they match the new formatter output (extra `\[` escape would only matter on titles containing `[`, but regen keeps the committed sample byte- identical to what CI would produce now).

Accepted (5 / 9): - .pre-commit-config.yaml: bump astral-sh/ruff-pre-commit rev v0.8.4 → v0.15.13. The 0.8.4 tag really does not exist (gemini was right); v0.15.13 is current and matches the pinned `ruff` runtime version. [HIGH] - .github/dependabot.yml: Asia/Tokyo → Etc/UTC for all three update blocks. Neutral default for a public repo. [MEDIUM] - README.md: rewrite the scoring formula description to match the actual `(freshness × 2) + (keyword × source-weight) + severity` implementation (it was incorrectly listed as fully multiplicative). Also update the Layout block: `pr-checks.yml` was replaced by `ci.yml` plus `codeql.yml` and `audit.yml` two commits ago but the doc still pointed at the old name. [MEDIUM] - scripts/awsdd/_dates.py (new): house the shared `parse_iso` and an EPOCH constant. score.py and report.py now import from it instead of carrying duplicate `_parse_iso` definitions. [MEDIUM - DRY] - web/src/lib/data.ts: hoist the `86_400_000` magic number into a `MS_IN_DAY` constant. [MEDIUM] Rejected (4 / 9), reasoning: - requirements{,-dev}.txt "version does not exist" [CRITICAL ×2]: verified via `curl https://pypi.org/pypi/<pkg>/<ver>/json` — all seven pinned versions return HTTP 200 and `pip install` succeeds both locally and in every prior CI run. Gemini's knowledge cutoff predates these releases; the pins stay. - pyproject.toml `SIM108` (ternary nag) [MEDIUM]: deliberately disabled because `if/else` blocks often read clearer than nested ternaries in this code base. No change. - collect_rss `BeautifulSoup` swap [MEDIUM]: regex strip + DOMPurify sanitization on the rendering side already cover the layered defense; the dep cost outweighs the marginal robustness gain for the well-formed AWS feeds we consume. No change. Regenerated seed reports so the committed sample matches the (byte- identical) output the formatter now produces.

Both findings legitimate, both applied: - collect_rss._fetch caps the response at 10 MiB (MAX_FEED_BYTES) via r.read(MAX_FEED_BYTES). Without a bound, a hostile or runaway origin could exhaust the worker's memory. AWS feeds are sub-megabyte; the cap is just a safety net. [MEDIUM] - collect_rss._strip_html now unescapes BEFORE stripping tags. The old order let entity-encoded markup (`<script>...`) bypass the strip pass and resurface as raw HTML after html.unescape ran on the regex output. New regression test `test_summary_strips_entity_encoded_tags` locks in the corrected order. DOMPurify on the web rendering side is still the last line of defense, but cleaning at the source is the right layer. [MEDIUM]

All 6 findings legitimate, all applied (with regression tests): - collect_github._get_page caps r.read at 10 MiB (MAX_RESPONSE_BYTES), same DoS guard as collect_rss. [MEDIUM] - collect_github now follows Link.rel="next" up to MAX_PAGES=5 pages via _get_all + _next_url, and default per_page bumped 20→50. A busy repo (aws-cli, aws-cdk) could previously lose releases beyond page 1; with daily CI the new bound covers all realistic catch-up windows. Three tests pin the Link-header parser. [MEDIUM] - collect_rss._strip_html now uses stdlib html.parser instead of a regex. A real parser handles `<`/`>` inside text content (the old regex chomped through "1 < 2 and 4 > 3"), and a _TextExtractor that tracks <script>/<style> depth drops their contents entirely. Two regression tests cover both cases. [MEDIUM] - collect_rss._severity now matches with `\b(critical|high|medium| low)\b`. The old substring check turned "Slow performance" into a low-severity item and "Highlighted feature" into a high one. [MEDIUM] - score._keyword_hits uses `\b{re.escape(keyword)}\b`. The old substring check let `iam` hit "diagram" and `sts` hit "tests". Two regression tests lock in that "updated diagram and test results" scores zero and "iam supports sts session tags" scores correctly. [MEDIUM] - web/src/lib/data.ts keeps the process.cwd()-based ROOT (we validated that import.meta.url breaks under Astro's bundler in a prior attempt), but now asserts at module load that ROOT/tracks exists and throws a clear error if invoked from the wrong cwd, so the previous silent "no items yet" failure mode becomes loud. [MEDIUM] Side effects of the word-bounded scoring change: all four tracks were re-scored against the existing normalized.json snapshots and seed reports regenerated. Top items shift slightly (substring-only hits no longer count) but the change is purely a precision improvement.

All 4 findings legitimate, all applied: - collect_github.collect now uses `entry.get("repo")` + skip-with-log instead of `entry["repo"]`. A malformed sources.yaml entry no longer crashes the whole track. New test exercises a mixed-validity config. [MEDIUM] - collect_rss.collect: same defensive `.get()` treatment for the feed `id` and `url` keys, with the same skip-and-log pattern. New test verifies two malformed entries are skipped and the well-formed one still fires. [MEDIUM] - collect_rss._strip_html joins extracted text with " " instead of "" so block-element neighbours like `<div>A</div><div>B</div>` do not merge into `AB`. The trailing \s+ collapse keeps the output tidy. Regression test pins it. [MEDIUM] - normalize.normalize gains a 180-day retention pass: items with `published_at` older than the cutoff are dropped from normalized .json. This is the file that's committed and loaded whole by the web build, so unbounded growth was a real concern as the project ages. Test asserts old + epoch-fallback items are pruned, recent ones kept. [MEDIUM] Side effect: regenerated seed data + reports. Retention bites hard this round because the initial collection swept years of Security Bulletins history: iam: 293 → 178 (pruned 115) security: 1907 → 56 (pruned 1851) releases: 135 → 117 (pruned 18) whats-new: 170 → 170 (already inside window) That's a ~5 MB reduction in the committed JSON tree.

12 of 13 findings applied; LICENSE 0-draft is the real GitHub org name (not a placeholder) so left as-is. Defensive parsing: - config.load_sources: type-guards `yaml.safe_load` output. A non-dict YAML root used to slip through `or {}` and crash downstream .get() calls. [coderabbit, outside-diff] - _dates.parse_iso: when the input has no timezone offset, fall back to UTC instead of returning a naive datetime that would TypeError against the tz-aware `now` in score / report. [coderabbit, major] - score.score_item: wrap the `float()` cast on source_weights in try/except — a YAML typo no longer aborts the whole pipeline. [coderabbit, major] - collect_rss.entry_to_item: `entry.get("tags") or []` (was `, []`) so an explicit `None` in feedparser output doesn't TypeError the comprehension. [gemini] Defensive reporting: - report.render: in the empty-window fallback, explicitly sort items by score before slicing. scored.json is usually pre-sorted but the defense costs nothing. [coderabbit, major] - web/src/lib/data.ts loadScored: throw on JSON parse failure instead of returning [] and shipping an apparently-successful empty site. [coderabbit, major] Automation hardening: - Makefile: TRACKS now auto-discovered via `$(wildcard tracks/*/)`. `make new-track NAME=foo` is immediately picked up by install / update / weekly instead of needing a manual edit. [coderabbit, major] - new-track.sh, new-deep-dive.sh: escape sed metacharacters (`&`, `\`, `|`) in the substitution values so a name containing those doesn't trigger sed's whole-match replacement. [coderabbit, minor] Workflow safety: - daily-update.yml, weekly-digest.yml: pin checkout `ref: main` so a workflow_dispatch from a feature branch can't commit non-main data to main. [coderabbit, major] - deploy-pages.yml: pin checkout to main + add an `if: github.ref == 'refs/heads/main'` guard on the build job to make manual dispatches from other branches a no-op. [coderabbit, major] Tests: - test_score.test_malformed_source_weight_falls_back_to_one - test_normalize.test_load_sources_returns_dict_for_non_mapping_root - test_normalize.test_parse_iso_assumes_utc_for_naive_input Side effect: seed regenerated (no diff in content beyond what the re-score produces against the same normalized snapshot).

All 3 findings legitimate, all applied: - collect_rss._fetch now returns raw bytes instead of decoded str. feedparser does its own encoding detection (XML prolog charset, BOM, Content-Type) and gzip handling when fed bytes; pre-decoding to UTF-8 with errors='replace' silently defeats that and would also mojibake non-UTF-8 feeds. collect() already passes the result directly to feedparser.parse, so no caller change. [MEDIUM] - collect_github._get_page reads MAX_RESPONSE_BYTES + 1 and refuses any response that hits the cap, instead of decoding a truncated body with errors='replace' that could feed corrupt JSON into json.loads. The strict utf-8 decode now surfaces a real encoding bug as UnicodeDecodeError (also caught) instead of being papered over. [MEDIUM] - report.render escapes `>` in URLs to %3E before wrapping in the Markdown `<...>` angle pair. A query string like `?q=a>b` would otherwise close the URL pair early and break link parsing. New test_gt_in_url_is_escaped pins it. [MEDIUM]

All 8 findings legitimate, all applied. Three logical changes: 1. Explicit UTF-8 on every read_text / write_text in scripts/awsdd/ (config.py, normalize.py, score.py, report.py, collect_rss.py, collect_github.py). Implicit locale-dependent encoding is a portability hazard — non-UTF-8 default locales (some Windows environments, exotic CI runners) would corrupt non-ASCII content silently when we already use ensure_ascii=False. [MEDIUM ×6] 2. collect_rss._fetch now refuses oversized responses the same way collect_github does — read MAX_FEED_BYTES + 1, error if the cap was hit. The old code returned a truncated byte string and let feedparser silently bozo-error on the malformed tail. [MEDIUM] 3. report.render collapses `\r` and `\n` in titles to spaces before the bracket-escape pass. A `\n# heading` in a feed title used to spawn a stray Markdown heading inside the list. Regression test test_newline_in_title_is_collapsed pins it. [MEDIUM] Seed reports regenerated; web rebuilds 10 pages.

All 3 findings legitimate, all applied: - collect_github.collect: add `isinstance(entry, dict)` guard before `entry.get("repo")`. A null or scalar YAML entry (`- foo`) used to AttributeError out of the whole track; now it's logged and skipped. Existing malformed-entry test extended with null and string cases. [MEDIUM] - collect_rss.collect: same isinstance guard for `feed`. Existing malformed-feed test extended with a `null` entry. [MEDIUM] - score._compiled_patterns: wrap the per-keyword regex compile in `functools.lru_cache` keyed on the keyword tuple. score_item is called once per item (hundreds per track per run); without caching we'd recompile the same primary/secondary patterns every call. Switched score_item's keyword extraction to tuples so they're hashable for the cache. No behaviour change — same word-bounded semantics, just compiled once per distinct keyword set. [MEDIUM]

kanywst · 2026-05-17T13:42:35Z

/gemini review

All 5 findings legitimate, all applied: Defensive exception handling: - collect_github._get_page: add ValueError to except. urlopen raises it for malformed URLs ("unknown url type" etc.), e.g. a sources.yaml entry with a missing scheme. [MEDIUM] - collect_rss._fetch: same ValueError addition. [MEDIUM] - score.score: wrap json.loads(normalized.json) in try/except. A corrupt or truncated file (disk-full mid-write etc.) used to crash the whole scoring step instead of leaving the previous scored.json in place. [MEDIUM] - report.render: same try/except around json.loads(scored.json). [MEDIUM] Stale guidance: - new-track.sh: the "next steps" message still told users to edit TRACKS in the root Makefile, but Makefile now auto-discovers tracks via `wildcard tracks/*/`. Updated to point at the two places that do still need a manual update for a new track: .github/workflows/{daily-update,weekly-digest}.yml matrix.track web/src/lib/data.ts TRACKS const with a footnote that the root Makefile is auto. [MEDIUM]

kanywst · 2026-05-17T13:51:43Z

/gemini review

gemini-code-assist

Code Review

This pull request initializes the aws-deepdive project, an automated system for collecting, scoring, and reporting AWS identity and security updates. The implementation features a Python-based data pipeline for RSS and GitHub release ingestion, a customizable scoring engine, and an Astro-powered web frontend for displaying daily and weekly digests. Additionally, the PR includes comprehensive test coverage, GitHub Actions workflows for automated updates, and scaffolding scripts for extending the project with new tracks. I have no feedback to provide.

kanywst added 2 commits May 17, 2026 20:57

fix(ci): bump Node to 22 — Astro 6 dropped Node 20 support

ff2a324

This comment was marked as resolved.

Sign in to view

kanywst added 2 commits May 17, 2026 21:03

chore: ignore pytest/coverage/ruff caches

37414ea

This comment was marked as resolved.

Sign in to view

style: ruff format tests/test_report.py (CI fix)

f85a865

This comment was marked as resolved.

Sign in to view

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

kanywst merged commit 7776e88 into main May 17, 2026

kanywst deleted the feat/initial-scaffold branch May 17, 2026 13:57

Conversation

kanywst commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What I verified locally

What I did NOT do

Test plan

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Uh oh!

This comment was marked as resolved.

Uh oh!

github-advanced-security Bot commented May 17, 2026

What Enabling Code Scanning Means:

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

kanywst commented May 17, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

kanywst commented May 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kanywst commented May 17, 2026 •

edited

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading