Conversation
| ), "Failed to emit FINDING for interesting archived file" | ||
| for e in events: | ||
| if e.type == "FINDING" and "site.zip" in e.data.get("description", ""): | ||
| assert "web.archive.org" in e.data["url"] |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 months ago
In general, to fix incomplete URL substring sanitization, you should parse the URL and inspect its hostname (and possibly scheme) rather than checking for an allowed domain as a raw substring. For example, use urllib.parse.urlparse to extract hostname and compare it exactly to web.archive.org or check for allowed subdomains.
For this specific test, we should avoid checking "web.archive.org" in e.data["url"] and instead parse e.data["url"] with urlparse, get the hostname, and assert that it equals "web.archive.org" (or, if subdomains are expected, use .endswith(".web.archive.org")). This preserves the intent—ensuring the URL actually points to web.archive.org—without relying on substring position. Concretely:
- Add an import for
urlparsefromurllib.parseat the top ofbbot/test/test_step_2/module_tests/test_module_wayback.py(alongside the existingunquoteimport). - Replace the assertion at line 86 with logic that parses
e.data["url"]and asserts that the parsed hostname isweb.archive.org. - Keep the rest of the test logic unchanged so existing behavior is preserved.
| @@ -1,5 +1,5 @@ | ||
| import re | ||
| from urllib.parse import unquote | ||
| from urllib.parse import unquote, urlparse | ||
|
|
||
| from werkzeug.wrappers import Response | ||
|
|
||
| @@ -83,7 +83,8 @@ | ||
| ), "Failed to emit FINDING for interesting archived file" | ||
| for e in events: | ||
| if e.type == "FINDING" and "site.zip" in e.data.get("description", ""): | ||
| assert "web.archive.org" in e.data["url"] | ||
| parsed_url = urlparse(e.data["url"]) | ||
| assert parsed_url.hostname == "web.archive.org" | ||
|
|
||
|
|
||
| class TestWaybackArchive(ModuleTestBase): |
| f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}" | ||
| ) | ||
| # archive_url should contain the archive.org provenance URL | ||
| assert "web.archive.org" in e.data.get("archive_url", ""), ( |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 months ago
In general, the problem is that the code checks for "web.archive.org" as a substring of the whole URL string. To fix this class of issues, the URL should be parsed with urllib.parse.urlparse, and the check should be applied to the hostname component (or path, if that is what matters), rather than the entire string. That ensures we are actually verifying the domain and not an arbitrary substring.
The best minimal fix here is to parse archive_url and assert on its hostname. Specifically, in TestWaybackArchiveHostField.check, instead of:
assert "web.archive.org" in e.data.get("archive_url", ""), ...we should:
- Retrieve
archive_urlstring. - Parse it with
urllib.parse.urlparse. - Assert that
parsed.hostname == "web.archive.org"(or at least endswith.web.archive.orgif subdomains should be allowed).
This change keeps the existing semantics—verifying that the provenance URL is indeed pointing at archive.org—while avoiding substring-based URL checks. To implement this, we need to import urlparse (or urlsplit) from urllib.parse at the top of bbot/test/test_step_2/module_tests/test_module_wayback.py and update the assertion accordingly. No other functional changes are required.
| @@ -1,5 +1,5 @@ | ||
| import re | ||
| from urllib.parse import unquote | ||
| from urllib.parse import unquote, urlparse | ||
|
|
||
| from werkzeug.wrappers import Response | ||
|
|
||
| @@ -251,8 +251,10 @@ | ||
| f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}" | ||
| ) | ||
| # archive_url should contain the archive.org provenance URL | ||
| assert "web.archive.org" in e.data.get("archive_url", ""), ( | ||
| f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {e.data.get('archive_url')}" | ||
| archive_url = e.data.get("archive_url", "") | ||
| archive_host = urlparse(archive_url).hostname if archive_url else None | ||
| assert archive_host == "web.archive.org", ( | ||
| f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {archive_url}" | ||
| ) | ||
| # event.host should be the original host | ||
| assert str(e.host) != "web.archive.org", f"event.host should be original host, got: {e.host}" |
| assert "archive_url" in finding.data, ( | ||
| f"Hunt FINDING should have archive_url for provenance, got: {finding.data}" | ||
| ) | ||
| assert "web.archive.org" in finding.data["archive_url"], ( |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 months ago
In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.
In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.
| @@ -1,5 +1,5 @@ | ||
| import re | ||
| from urllib.parse import unquote | ||
| from urllib.parse import unquote, urlparse | ||
|
|
||
| from werkzeug.wrappers import Response | ||
|
|
||
| @@ -310,8 +310,10 @@ | ||
| assert "archive_url" in finding.data, ( | ||
| f"Hunt FINDING should have archive_url for provenance, got: {finding.data}" | ||
| ) | ||
| assert "web.archive.org" in finding.data["archive_url"], ( | ||
| f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}" | ||
| archive_url_host = urlparse(finding.data["archive_url"]).hostname | ||
| assert archive_url_host == "web.archive.org", ( | ||
| f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, " | ||
| f"full URL: {finding.data['archive_url']}" | ||
| ) | ||
|
|
||
| # WEB_PARAMETERs from archived content should also have archive_url |
There was a problem hiding this comment.
bro its a draft step off
📊 Performance Benchmark Report
📈 Detailed Results (All Benchmarks)
🎯 Performance Summary! 1 regression ⚠️
23 unchanged ✅🔍 Significant Changes (>10%)
🐍 Python Version 3.11.15 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## blasthttp-integration-clean #2909 +/- ##
============================================================
- Coverage 91% 91% -0%
============================================================
Files 443 443
Lines 38305 39030 +725
============================================================
+ Hits 34621 35243 +622
- Misses 3684 3787 +103 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| async def run_in_executor_mp(self, callback, *args, **kwargs): | ||
| """ | ||
| Same as run_in_executor() except with a process pool executor | ||
| Use only in cases where callback is CPU-bound | ||
| Same as run_in_executor() except with a process pool executor. | ||
| Use only in cases where callback is CPU-bound. | ||
|
|
||
| Includes a timeout (default 300s) to prevent indefinite hangs if a | ||
| child process dies or the pool enters a broken state. | ||
|
|
||
| Pass ``_timeout=seconds`` to override the default timeout. | ||
|
|
||
| Examples: | ||
| Execute callback: | ||
| >>> result = await self.helpers.run_in_executor_mp(callback_fn, arg1, arg2) | ||
| """ | ||
| timeout = kwargs.pop("_timeout", 300) | ||
| callback = partial(callback, **kwargs) | ||
| return self.loop.run_in_executor(self.process_pool, callback, *args) | ||
| future = self.loop.run_in_executor(self.process_pool, callback, *args) |
There was a problem hiding this comment.
This timeout needs to be applied one level deeper, because it cancels the awaiting of the coroutine, but leaves the stuck task executing in the process pool, thereby taking up a child process indefinitely.
- We should be executing network/api requests in the main thread
- We should only be submitting cpu-bound tasks (i.e. bulk URL parsing) to the process pool
There was a problem hiding this comment.
it was already only cpu-bound tasks, there was no network requests here
Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.
- Add max_records option (default 100000) for CDX API limit - Only retry archive fetches on connection errors/429, not on definitive HTTP status codes - Change "Loading archived URLs" message from hugeinfo to verbose - Update retry test to use ReadError instead of 503
- wayback: override _incoming_dedup_hash for URL events to prevent subdomain_enum's domain-based dedup from collapsing distinct URLs - wayback: fix FINDING confidence "MODERATE" -> "MEDIUM" (valid level) - wayback: use individual requests instead of request_batch for interesting file HEAD checks - subdomain_enum: revert is_target exemption from wildcard rejection
The _hang_forever worker process outlives the test and blocks Python's threading._shutdown via the ProcessPoolExecutor management thread. Terminate stuck workers after the test and add a safety net in pytest_sessionfinish.
Previously, asyncio.wait_for() only cancelled the awaiting coroutine but left the child process running indefinitely. On a 4-core machine, just 4 stuck workers would permanently stall the scan. Now on timeout we terminate all workers, replace the pool, and continue cleanly.
…tense to wayback-heavy URL_UNVERIFIED/URL events changed from string to dict data in 3.0 merge. Fix event.data -> event.url for hash, clean_url, and test assertions. Add filter_event override to skip subdomain_enum filtering for URL events.
Stop auto-copying archive_url from parent to child event data dicts, which could infect live HTTP responses downstream. Instead, use the from-wayback tag as a signal and traverse upward to find the nearest archive_url when needed (via new event.archive_url property).
93fee25 to
9aefafc
Compare
…st→targets, VULNERABILITY→FINDING, httpx→http module refs
9aefafc to
bce4b86
Compare
Filter ftp:// and other non-HTTP URLs in _pre_process_urls() before they enter the archive cache. Truncate event data in ValidationError messages to 200 chars to prevent terminal flooding.
Summary
Major upgrade to the wayback module with new capabilities for URL discovery, parameter extraction, and archived content retrieval.
Wayback Module Overhaul
max_records) with server-side filtering, bloom filter dedup, and garbage/crawler-trap detectionparameters=true): EmitsWEB_PARAMETERevents for query params found in archived URLs, paired with live URL events for downstream modules like lightfuzzarchive=true): Fetches archived snapshots of dead URLs from the Wayback Machine, emits them asHTTP_RESPONSEevents for excavate to process (extracting params, secrets, etc.).zip,.sql,.bak,.env, etc.) and emits FINDINGs for files that still existwayback(URL discovery) andwayback-intense(full archive + parameter extraction)Excavate Enhancements
_event_host()and_event_base_url()methods so excavate correctly attributes findings from archived content to the original target (not archive.org)DirectoryListingExtractorYARA-based submodule detecting Apache, Nginx, IIS, and Python directory listingsfrom-waybacktag propagates through the event chainCore / Infrastructure
run_in_executor_mpnow has a 300s timeout; on timeout, stuck workers are killed and the pool is replaced.max_tasks_per_child=25on Python 3.11+ prevents memory leaks__getstate__/__setstate__for proper serialization across process boundaries"You don't have permission to access "