Skip to content
This repository was archived by the owner on Jun 23, 2026. It is now read-only.

Commit 52a8e41

Browse files
committed
fix: address gemini-code-assist 5th review on PR #1
Both findings legitimate, both applied: - collect_rss._fetch caps the response at 10 MiB (MAX_FEED_BYTES) via r.read(MAX_FEED_BYTES). Without a bound, a hostile or runaway origin could exhaust the worker's memory. AWS feeds are sub-megabyte; the cap is just a safety net. [MEDIUM] - collect_rss._strip_html now unescapes BEFORE stripping tags. The old order let entity-encoded markup (`<script>...`) bypass the strip pass and resurface as raw HTML after html.unescape ran on the regex output. New regression test `test_summary_strips_entity_encoded_tags` locks in the corrected order. DOMPurify on the web rendering side is still the last line of defense, but cleaning at the source is the right layer. [MEDIUM]
1 parent 1145de7 commit 52a8e41

2 files changed

Lines changed: 17 additions & 2 deletions

File tree

scripts/awsdd/collect_rss.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
USER_AGENT = "aws-deepdive/0.1 (+https://github.com/0-draft/aws-deepdive)"
1818
FETCH_TIMEOUT = 30 # seconds
19+
MAX_FEED_BYTES = 10 * 1024 * 1024 # 10 MiB cap to bound memory on hostile / runaway feeds
1920
EPOCH_ISO = "1970-01-01T00:00:00+00:00"
2021

2122

@@ -36,15 +37,18 @@ def _fetch(url: str, timeout: int = FETCH_TIMEOUT) -> str | None:
3637
req = Request(url, headers={"User-Agent": USER_AGENT})
3738
try:
3839
with urlopen(req, timeout=timeout) as r:
39-
return r.read().decode("utf-8", errors="replace")
40+
return r.read(MAX_FEED_BYTES).decode("utf-8", errors="replace")
4041
except (URLError, TimeoutError) as e:
4142
print(f"[collect_rss] fetch {url}: {e}")
4243
return None
4344

4445

4546
def _strip_html(text: str) -> str:
46-
text = re.sub(r"<[^>]+>", " ", text)
47+
# Unescape FIRST so entity-encoded tags like `&lt;script&gt;` are turned
48+
# into their bracketed form before the strip pass; otherwise they bypass
49+
# the regex and leak through into the report as raw HTML.
4750
text = html.unescape(text)
51+
text = re.sub(r"<[^>]*>", " ", text)
4852
text = re.sub(r"\s+", " ", text)
4953
return text.strip()
5054

tests/test_collect_rss.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,14 @@ def test_summary_helper_strips_html():
5252
"E", (), {"get": lambda self, k, d=None: "<b>hi</b> there" if k == "summary" else d}
5353
)()
5454
assert _summary(fake) == "hi there"
55+
56+
57+
def test_summary_strips_entity_encoded_tags():
58+
# Regression: some feeds double-encode tags as `&lt;script&gt;...`.
59+
# The old strip-then-unescape order let those leak through as raw HTML.
60+
payload = "&lt;script&gt;alert(1)&lt;/script&gt;hi"
61+
fake = type("E", (), {"get": lambda self, k, d=None: payload if k == "summary" else d})()
62+
out = _summary(fake)
63+
assert "<script>" not in out
64+
assert "&lt;" not in out
65+
assert "hi" in out

0 commit comments

Comments
 (0)