feat: initial implementation — pipeline, four tracks, web viewer, CI#1
Conversation
Adds the full implementation:
- scripts/awsdd/: shared Python package (collect_rss, collect_github,
normalize, score, report) reading per-track config/sources.yaml.
Scoring is freshness*2 + keyword*source_weight + severity, so generic
fresh items only get the freshness baseline and topic-specific
keyword hits get amplified by source trust.
- tracks/{iam,security,whats-new,releases}/: each self-contained with
Makefile, config/sources.yaml, and README. Track Makefile derives the
track name from CURDIR so the body is identical everywhere.
- templates/new-track/ + scripts/new-track.sh / new-deep-dive.sh for
scaffolding additional tracks and deep-dive notes.
- web/: Astro 6 + Tailwind v4 (postcss) + recharts island. Static
output, reads tracks/*/data/scored.json and tracks/*/reports/**/*.md
at build time. Routes: /, /[track]/, /[track]/[mode]/[slug]/.
- .github/workflows/: daily-update (06:00 UTC, matrix.track),
weekly-digest (Mon 08:00 UTC), deploy-pages (push to main),
pr-checks. .github/dependabot.yml covers pip / npm / github-actions.
- MIT license.
Output of `make update` for all four tracks plus `make -C tracks/iam weekly`, captured locally on 2026-05-17 so the Pages site has content before the first scheduled CI run lands. iam: 261 RSS + 60 GH = 293 scored security: 1907 RSS + 0 GH = 1907 scored whats-new: 170 RSS + 0 GH = 170 scored releases: 0 RSS + 135 GH = 135 scored These files are regenerated daily by .github/workflows/daily-update.yml and pruned by scripts/prune.sh (raw kept 30d, daily reports kept 60d).
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (27)
📝 WalkthroughWalkthroughThis PR introduces a complete AWS content aggregation and reporting platform. It implements a Python backend pipeline (RSS/GitHub collection, normalization, scoring) with four pre-configured tracks, multi-step GitHub Actions automation, and a static Astro web frontend. The system collects identity/auth, security, cloud news, and release information from configured sources, ranks by custom scoring logic, and publishes daily/weekly digests alongside an interactive dashboard. ChangesComplete AWS Deepdive Platform
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This PR introduces a substantial new system with multiple interconnected layers: Python data pipeline (collection, normalization, scoring), Bash/Make automation infrastructure, four fully-configured tracks with real data, eight GitHub Actions workflows, comprehensive test coverage, and a complete Astro web frontend. While many individual files are straightforward configuration or data files, the integration points (script entry points, data contracts, API pagination, scoring logic, route generation) demand careful review. The heterogeneity of Python, Bash, YAML, JavaScript, TypeScript, and Markdown changes across a broad scope, and the end-to-end flow from collection through rendering requires understanding the entire pipeline. Medium-to-high complexity overall due to logic density in collectors/scorers and the breadth of moving parts.
✨ Finishing Touches🧪 Generate unit tests (beta)
|
What changed: - Replace pr-checks.yml with ci.yml — 6 parallel jobs: lint-python (ruff check + format), lint-web (astro check), lint-meta (markdownlint + actionlint), test-python (pytest + coverage XML artefact), build-python (compileall), build-web (astro build). Runs on PR and on push to main (paths-ignore for data-only commits). - Add codeql.yml for python + javascript-typescript with the security-and-quality query suite. Weekly cron + on PR/push to main. - Add audit.yml: pip-audit, npm audit --audit-level=high, license reports for both, and a copyleft gate (fails on GPL / AGPL in the Python tree, allows LGPL). Weekly cron + on deps-file PRs. - Add pyproject.toml with ruff and pytest config, pinning py312 + enabling rules E/W/F/I/B/UP/SIM. Coverage source = awsdd. - Add requirements-dev.txt (pytest, pytest-cov, ruff, pip-audit, pip-licenses). - Add .markdownlint-cli2.jsonc mirroring kt's global config so CI and local agree on the same disabled rules. - Add .pre-commit-config.yaml (ruff, markdownlint, actionlint, basic hygiene hooks) for opt-in local enforcement. - Add pytest suite at tests/ (24 tests, 71% line coverage): score, normalize, report, collect_rss, collect_github, plus a full normalize → score → report pipeline smoke test against fixtures (no network). - Refactor scripts/awsdd/collect_rss.py and collect_github.py to expose pure entry_to_item / release_to_item conversion functions so the collectors can be unit-tested without hitting RSS or the GitHub API. The collect() entry points are unchanged. - Apply ruff format + auto-fixes across the awsdd package (datetime.UTC alias, import ordering, no behavior change). - Makefile: add dev-install / test / lint / format / audit targets. - README: add a CI table and a Local commands block listing the new make targets. PR gating relies on branch protection (Settings → Branches) requiring ci.yml's job statuses to pass before merge into main. That's a repo config step, not in this commit.
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
- collect_github._get now guards with isinstance(res, list) so an
unexpected dict response (rate-limit envelope, 404, etc.) does not
poison downstream iteration with AttributeError. [HIGH]
- report._parse_iso and score._parse_iso drop the manual `.replace
("Z", "+00:00")`; datetime.fromisoformat handles the Z suffix
natively from Python 3.11 onward. Also narrow the bare `except` to
(ValueError, TypeError). [MEDIUM]
- report.render writes links as `[title](<url>)` so AWS URLs that
contain parentheses (common in whats-new and release-notes paths)
do not break vanilla Markdown link parsing. [MEDIUM]
- new-deep-dive.sh and new-track.sh switch sed delimiter from `/` to
`|` so a track / topic that contains a slash does not abort sed.
[MEDIUM]
- Regenerate today's seed reports with the new angle-bracket link
format so the committed sample matches the live formatter.
This comment was marked as resolved.
This comment was marked as resolved.
Date-handling robustness: - collect_rss._iso, collect_github.release_to_item, report._parse_iso, score._parse_iso: fall back to the Unix epoch (1970-01-01) instead of datetime.now(UTC) when a date is missing or corrupted. The old behaviour gave malformed items a maximum freshness score and pushed them to the top of every report; the new behaviour leaves them at ~0 freshness so they sink. [MEDIUM x4] Network robustness: - collect_rss.collect now fetches via urlopen with a 30s timeout and passes the body to feedparser.parse, instead of feedparser.parse(url) which has no built-in timeout. A single stuck origin no longer hangs the whole pipeline. [MEDIUM] XSS hardening: - web/src/lib/markdown.ts pipes marked's output through isomorphic-dompurify before returning. Report content is rendered via `set:html`, and titles / summaries originate in untrusted RSS and GitHub feeds, so unsanitized HTML could land in the page (verified by smoke: a fixture with <script>, onerror, and javascript: URLs is fully stripped). [SECURITY-MEDIUM] Tests: - tests/test_score.py: regression test that a garbage published_at scores near-zero freshness. Locks in the epoch fallback.
This comment was marked as resolved.
This comment was marked as resolved.
Type precision: - schema.Item.source_kind is now Literal["rss", "github"] instead of bare str, matching the comment and actual usage. Robustness: - report.render now escapes both `[` and `]` in titles (previously only `]`), so titles like "[CVE-2026-xxx]" cannot collide with markdown link parsing. - normalize.normalize: narrow the two bare `except Exception` blocks to `(OSError, json.JSONDecodeError)` so genuine bugs surface instead of being swallowed. - collect_github._get: narrow the fallback `except Exception` to `(URLError, TimeoutError, json.JSONDecodeError)` (HTTPError is still caught separately above). Reproducibility: - requirements.txt and requirements-dev.txt now pin exact versions with `==`. Dependabot continues to PR upgrades, but a fresh checkout + `pip install -r ...` now resolves to a known-good set. Config / docs: - tracks/iam/config/sources.yaml: drop `spiffe/spiffe` (verified via GitHub API: repo exists but has zero releases — it's the specs tree, not a release surface). - README.md: remove the duplicate 3-bullet "CI" section that the new CI table superseded. Regenerated today's seed reports under tracks/*/reports/ so they match the new formatter output (extra `\[` escape would only matter on titles containing `[`, but regen keeps the committed sample byte- identical to what CI would produce now).
This comment was marked as resolved.
This comment was marked as resolved.
Accepted (5 / 9):
- .pre-commit-config.yaml: bump astral-sh/ruff-pre-commit rev v0.8.4
→ v0.15.13. The 0.8.4 tag really does not exist (gemini was right);
v0.15.13 is current and matches the pinned `ruff` runtime version.
[HIGH]
- .github/dependabot.yml: Asia/Tokyo → Etc/UTC for all three update
blocks. Neutral default for a public repo. [MEDIUM]
- README.md: rewrite the scoring formula description to match the
actual `(freshness × 2) + (keyword × source-weight) + severity`
implementation (it was incorrectly listed as fully multiplicative).
Also update the Layout block: `pr-checks.yml` was replaced by
`ci.yml` plus `codeql.yml` and `audit.yml` two commits ago but the
doc still pointed at the old name. [MEDIUM]
- scripts/awsdd/_dates.py (new): house the shared `parse_iso` and an
EPOCH constant. score.py and report.py now import from it instead
of carrying duplicate `_parse_iso` definitions. [MEDIUM - DRY]
- web/src/lib/data.ts: hoist the `86_400_000` magic number into a
`MS_IN_DAY` constant. [MEDIUM]
Rejected (4 / 9), reasoning:
- requirements{,-dev}.txt "version does not exist" [CRITICAL ×2]:
verified via `curl https://pypi.org/pypi/<pkg>/<ver>/json` — all
seven pinned versions return HTTP 200 and `pip install` succeeds
both locally and in every prior CI run. Gemini's knowledge cutoff
predates these releases; the pins stay.
- pyproject.toml `SIM108` (ternary nag) [MEDIUM]: deliberately
disabled because `if/else` blocks often read clearer than nested
ternaries in this code base. No change.
- collect_rss `BeautifulSoup` swap [MEDIUM]: regex strip + DOMPurify
sanitization on the rendering side already cover the layered
defense; the dep cost outweighs the marginal robustness gain for
the well-formed AWS feeds we consume. No change.
Regenerated seed reports so the committed sample matches the (byte-
identical) output the formatter now produces.
This comment was marked as resolved.
This comment was marked as resolved.
Both findings legitimate, both applied: - collect_rss._fetch caps the response at 10 MiB (MAX_FEED_BYTES) via r.read(MAX_FEED_BYTES). Without a bound, a hostile or runaway origin could exhaust the worker's memory. AWS feeds are sub-megabyte; the cap is just a safety net. [MEDIUM] - collect_rss._strip_html now unescapes BEFORE stripping tags. The old order let entity-encoded markup (`<script>...`) bypass the strip pass and resurface as raw HTML after html.unescape ran on the regex output. New regression test `test_summary_strips_entity_encoded_tags` locks in the corrected order. DOMPurify on the web rendering side is still the last line of defense, but cleaning at the source is the right layer. [MEDIUM]
This comment was marked as resolved.
This comment was marked as resolved.
All 6 findings legitimate, all applied (with regression tests):
- collect_github._get_page caps r.read at 10 MiB (MAX_RESPONSE_BYTES),
same DoS guard as collect_rss. [MEDIUM]
- collect_github now follows Link.rel="next" up to MAX_PAGES=5 pages
via _get_all + _next_url, and default per_page bumped 20→50. A
busy repo (aws-cli, aws-cdk) could previously lose releases beyond
page 1; with daily CI the new bound covers all realistic catch-up
windows. Three tests pin the Link-header parser. [MEDIUM]
- collect_rss._strip_html now uses stdlib html.parser instead of a
regex. A real parser handles `<`/`>` inside text content (the old
regex chomped through "1 < 2 and 4 > 3"), and a _TextExtractor that
tracks <script>/<style> depth drops their contents entirely. Two
regression tests cover both cases. [MEDIUM]
- collect_rss._severity now matches with `\b(critical|high|medium|
low)\b`. The old substring check turned "Slow performance" into a
low-severity item and "Highlighted feature" into a high one. [MEDIUM]
- score._keyword_hits uses `\b{re.escape(keyword)}\b`. The old
substring check let `iam` hit "diagram" and `sts` hit "tests". Two
regression tests lock in that "updated diagram and test results"
scores zero and "iam supports sts session tags" scores correctly.
[MEDIUM]
- web/src/lib/data.ts keeps the process.cwd()-based ROOT (we
validated that import.meta.url breaks under Astro's bundler in a
prior attempt), but now asserts at module load that ROOT/tracks
exists and throws a clear error if invoked from the wrong cwd, so
the previous silent "no items yet" failure mode becomes loud. [MEDIUM]
Side effects of the word-bounded scoring change: all four tracks were
re-scored against the existing normalized.json snapshots and seed
reports regenerated. Top items shift slightly (substring-only hits no
longer count) but the change is purely a precision improvement.
This comment was marked as resolved.
This comment was marked as resolved.
All 4 findings legitimate, all applied:
- collect_github.collect now uses `entry.get("repo")` + skip-with-log
instead of `entry["repo"]`. A malformed sources.yaml entry no longer
crashes the whole track. New test exercises a mixed-validity config.
[MEDIUM]
- collect_rss.collect: same defensive `.get()` treatment for the feed
`id` and `url` keys, with the same skip-and-log pattern. New test
verifies two malformed entries are skipped and the well-formed one
still fires. [MEDIUM]
- collect_rss._strip_html joins extracted text with " " instead of ""
so block-element neighbours like `<div>A</div><div>B</div>` do not
merge into `AB`. The trailing \s+ collapse keeps the output tidy.
Regression test pins it. [MEDIUM]
- normalize.normalize gains a 180-day retention pass: items with
`published_at` older than the cutoff are dropped from normalized
.json. This is the file that's committed and loaded whole by the
web build, so unbounded growth was a real concern as the project
ages. Test asserts old + epoch-fallback items are pruned, recent
ones kept. [MEDIUM]
Side effect: regenerated seed data + reports. Retention bites hard
this round because the initial collection swept years of Security
Bulletins history:
iam: 293 → 178 (pruned 115)
security: 1907 → 56 (pruned 1851)
releases: 135 → 117 (pruned 18)
whats-new: 170 → 170 (already inside window)
That's a ~5 MB reduction in the committed JSON tree.
This comment was marked as resolved.
This comment was marked as resolved.
12 of 13 findings applied; LICENSE 0-draft is the real GitHub org name
(not a placeholder) so left as-is.
Defensive parsing:
- config.load_sources: type-guards `yaml.safe_load` output. A non-dict
YAML root used to slip through `or {}` and crash downstream .get()
calls. [coderabbit, outside-diff]
- _dates.parse_iso: when the input has no timezone offset, fall back
to UTC instead of returning a naive datetime that would TypeError
against the tz-aware `now` in score / report. [coderabbit, major]
- score.score_item: wrap the `float()` cast on source_weights in
try/except — a YAML typo no longer aborts the whole pipeline.
[coderabbit, major]
- collect_rss.entry_to_item: `entry.get("tags") or []` (was `, []`)
so an explicit `None` in feedparser output doesn't TypeError the
comprehension. [gemini]
Defensive reporting:
- report.render: in the empty-window fallback, explicitly sort items
by score before slicing. scored.json is usually pre-sorted but the
defense costs nothing. [coderabbit, major]
- web/src/lib/data.ts loadScored: throw on JSON parse failure
instead of returning [] and shipping an apparently-successful
empty site. [coderabbit, major]
Automation hardening:
- Makefile: TRACKS now auto-discovered via `$(wildcard tracks/*/)`.
`make new-track NAME=foo` is immediately picked up by install /
update / weekly instead of needing a manual edit. [coderabbit, major]
- new-track.sh, new-deep-dive.sh: escape sed metacharacters (`&`,
`\`, `|`) in the substitution values so a name containing those
doesn't trigger sed's whole-match replacement. [coderabbit, minor]
Workflow safety:
- daily-update.yml, weekly-digest.yml: pin checkout `ref: main` so a
workflow_dispatch from a feature branch can't commit non-main data
to main. [coderabbit, major]
- deploy-pages.yml: pin checkout to main + add an `if: github.ref ==
'refs/heads/main'` guard on the build job to make manual dispatches
from other branches a no-op. [coderabbit, major]
Tests:
- test_score.test_malformed_source_weight_falls_back_to_one
- test_normalize.test_load_sources_returns_dict_for_non_mapping_root
- test_normalize.test_parse_iso_assumes_utc_for_naive_input
Side effect: seed regenerated (no diff in content beyond what the
re-score produces against the same normalized snapshot).
This comment was marked as resolved.
This comment was marked as resolved.
All 3 findings legitimate, all applied: - collect_rss._fetch now returns raw bytes instead of decoded str. feedparser does its own encoding detection (XML prolog charset, BOM, Content-Type) and gzip handling when fed bytes; pre-decoding to UTF-8 with errors='replace' silently defeats that and would also mojibake non-UTF-8 feeds. collect() already passes the result directly to feedparser.parse, so no caller change. [MEDIUM] - collect_github._get_page reads MAX_RESPONSE_BYTES + 1 and refuses any response that hits the cap, instead of decoding a truncated body with errors='replace' that could feed corrupt JSON into json.loads. The strict utf-8 decode now surfaces a real encoding bug as UnicodeDecodeError (also caught) instead of being papered over. [MEDIUM] - report.render escapes `>` in URLs to %3E before wrapping in the Markdown `<...>` angle pair. A query string like `?q=a>b` would otherwise close the URL pair early and break link parsing. New test_gt_in_url_is_escaped pins it. [MEDIUM]
This comment was marked as resolved.
This comment was marked as resolved.
All 8 findings legitimate, all applied. Three logical changes: 1. Explicit UTF-8 on every read_text / write_text in scripts/awsdd/ (config.py, normalize.py, score.py, report.py, collect_rss.py, collect_github.py). Implicit locale-dependent encoding is a portability hazard — non-UTF-8 default locales (some Windows environments, exotic CI runners) would corrupt non-ASCII content silently when we already use ensure_ascii=False. [MEDIUM ×6] 2. collect_rss._fetch now refuses oversized responses the same way collect_github does — read MAX_FEED_BYTES + 1, error if the cap was hit. The old code returned a truncated byte string and let feedparser silently bozo-error on the malformed tail. [MEDIUM] 3. report.render collapses `\r` and `\n` in titles to spaces before the bracket-escape pass. A `\n# heading` in a feed title used to spawn a stray Markdown heading inside the list. Regression test test_newline_in_title_is_collapsed pins it. [MEDIUM] Seed reports regenerated; web rebuilds 10 pages.
This comment was marked as resolved.
This comment was marked as resolved.
All 3 findings legitimate, all applied:
- collect_github.collect: add `isinstance(entry, dict)` guard before
`entry.get("repo")`. A null or scalar YAML entry (`- foo`) used to
AttributeError out of the whole track; now it's logged and skipped.
Existing malformed-entry test extended with null and string cases.
[MEDIUM]
- collect_rss.collect: same isinstance guard for `feed`. Existing
malformed-feed test extended with a `null` entry. [MEDIUM]
- score._compiled_patterns: wrap the per-keyword regex compile in
`functools.lru_cache` keyed on the keyword tuple. score_item is
called once per item (hundreds per track per run); without caching
we'd recompile the same primary/secondary patterns every call.
Switched score_item's keyword extraction to tuples so they're
hashable for the cache. No behaviour change — same word-bounded
semantics, just compiled once per distinct keyword set. [MEDIUM]
|
/gemini review |
All 5 findings legitimate, all applied:
Defensive exception handling:
- collect_github._get_page: add ValueError to except. urlopen raises
it for malformed URLs ("unknown url type" etc.), e.g. a sources.yaml
entry with a missing scheme. [MEDIUM]
- collect_rss._fetch: same ValueError addition. [MEDIUM]
- score.score: wrap json.loads(normalized.json) in try/except. A
corrupt or truncated file (disk-full mid-write etc.) used to crash
the whole scoring step instead of leaving the previous scored.json
in place. [MEDIUM]
- report.render: same try/except around json.loads(scored.json).
[MEDIUM]
Stale guidance:
- new-track.sh: the "next steps" message still told users to edit
TRACKS in the root Makefile, but Makefile now auto-discovers tracks
via `wildcard tracks/*/`. Updated to point at the two places that
do still need a manual update for a new track:
.github/workflows/{daily-update,weekly-digest}.yml matrix.track
web/src/lib/data.ts TRACKS const
with a footnote that the root Makefile is auto. [MEDIUM]
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request initializes the aws-deepdive project, an automated system for collecting, scoring, and reporting AWS identity and security updates. The implementation features a Python-based data pipeline for RSS and GitHub release ingestion, a customizable scoring engine, and an Astro-powered web frontend for displaying daily and weekly digests. Additionally, the PR includes comprehensive test coverage, GitHub Actions workflows for automated updates, and scaffolding scripts for extending the project with new tracks. I have no feedback to provide.
Summary
scripts/awsdd/):collect (RSS + GitHub Releases) → normalize → score → reportas a small shared Python package. Scoring is(freshness × 2) + (keyword × source_weight) + severitywith word-bounded keyword matching, so generic fresh items only get the freshness baseline while topic-relevant keyword hits get amplified by source trust.iam,security,whats-new,releases. Each is self-contained undertracks/<name>/withconfig/sources.yaml, an identicalMakefile(derives track name fromCURDIR), and a track-local README. New tracks viamake new-track NAME=<name>and are auto-picked-up by the root Makefile.normalizedrops items withpublished_atolder than 180 days fromnormalized.json,prune.shkeeps 30 days of raw and 60 days of daily reports. Weekly reports are kept indefinitely.web/): Astro 6 + Tailwind v4 (postcss) + recharts island. Static output. Routes:/,/<track>/,/<track>/<mode>/<slug>/. Readstracks/*/data/scored.jsonandtracks/*/reports/**/*.mdat build time. Markdown is sanitized viaisomorphic-dompurify.ci.yml(lint / type-check / pytest / build),codeql.yml,audit.yml(pip-audit + npm audit + license report),daily-update.yml(06:00 UTC),weekly-digest.yml(Mon 08:00 UTC),deploy-pages.yml(push to main). Dependabot covers pip, npm, github-actions.What I verified locally
make updateend-to-end on a clean venv.make -C tracks/iam weeklyproducesreports/weekly/2026-W20.md.npm run buildproduces 10 static pages including dynamic/<track>/<mode>/<slug>/routes; data loads, links render, no SSR warnings..markdownlint-cli2.jsonc.pytestis green (40 tests, 79% coverage onscripts/awsdd/).ruff check+ruff format --checkclean.What I did NOT do
deploy-pages.ymlto publish.mainshould be gated onci.ymljobs (Settings → Branches → main).Test plan
deploy-pagesmanually./iam/, click into the latestdaily/<date>, confirm the rendered Markdown matchestracks/iam/reports/daily/<date>.md.daily-updaterun; confirm achore(<track>): daily update YYYY-MM-DDcommit lands onmain.weekly-digestrun; confirm achore(<track>): weekly digest YYYY-Wwwcommit lands and a new weekly Markdown appears.