Skip to content

Wayback upgrade#2909

Open
liquidsec wants to merge 33 commits intoblasthttp-integration-cleanfrom
wayback-upgrade
Open

Wayback upgrade#2909
liquidsec wants to merge 33 commits intoblasthttp-integration-cleanfrom
wayback-upgrade

Conversation

@liquidsec
Copy link
Copy Markdown
Contributor

@liquidsec liquidsec commented Feb 19, 2026

Summary

Major upgrade to the wayback module with new capabilities for URL discovery, parameter extraction, and archived content retrieval.

Wayback Module Overhaul

  • URL discovery: CDX API now fetches up to 100k URLs (configurable via max_records) with server-side filtering, bloom filter dedup, and garbage/crawler-trap detection
  • Parameter extraction (parameters=true): Emits WEB_PARAMETER events for query params found in archived URLs, paired with live URL events for downstream modules like lightfuzz
  • Archive retrieval (archive=true): Fetches archived snapshots of dead URLs from the Wayback Machine, emits them as HTTP_RESPONSE events for excavate to process (extracting params, secrets, etc.)
  • Interesting file detection: HEAD-checks archived URLs with notable extensions (.zip, .sql, .bak, .env, etc.) and emits FINDINGs for files that still exist
  • Rate limiting & resilience: Reactive rate limiting with exponential backoff on 429s, retry logic, HEAD pre-checks before full archive fetches
  • New presets: wayback (URL discovery) and wayback-intense (full archive + parameter extraction)

Excavate Enhancements

  • Archive-aware host/URL resolution: new _event_host() and _event_base_url() methods so excavate correctly attributes findings from archived content to the original target (not archive.org)
  • New DirectoryListingExtractor YARA-based submodule detecting Apache, Nginx, IIS, and Python directory listings
  • from-wayback tag propagates through the event chain

Core / Infrastructure

  • Process pool hardening: run_in_executor_mp now has a 300s timeout; on timeout, stuck workers are killed and the pool is replaced. max_tasks_per_child=25 on Python 3.11+ prevents memory leaks
  • Scan shutdown: Process pool workers are now terminated/killed on shutdown instead of waiting indefinitely
  • BBOTTarget pickle support: Added __getstate__/__setstate__ for proper serialization across process boundaries
  • httpx module: Discard 4xx responses containing WAF strings
  • New WAF string: "You don't have permission to access "

@liquidsec liquidsec marked this pull request as draft February 19, 2026 03:01
@liquidsec liquidsec mentioned this pull request Feb 19, 2026
), "Failed to emit FINDING for interesting archived file"
for e in events:
if e.type == "FINDING" and "site.zip" in e.data.get("description", ""):
assert "web.archive.org" in e.data["url"]

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization

The string [web.archive.org](1) may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 2 months ago

In general, to fix incomplete URL substring sanitization, you should parse the URL and inspect its hostname (and possibly scheme) rather than checking for an allowed domain as a raw substring. For example, use urllib.parse.urlparse to extract hostname and compare it exactly to web.archive.org or check for allowed subdomains.

For this specific test, we should avoid checking "web.archive.org" in e.data["url"] and instead parse e.data["url"] with urlparse, get the hostname, and assert that it equals "web.archive.org" (or, if subdomains are expected, use .endswith(".web.archive.org")). This preserves the intent—ensuring the URL actually points to web.archive.org—without relying on substring position. Concretely:

  • Add an import for urlparse from urllib.parse at the top of bbot/test/test_step_2/module_tests/test_module_wayback.py (alongside the existing unquote import).
  • Replace the assertion at line 86 with logic that parses e.data["url"] and asserts that the parsed hostname is web.archive.org.
  • Keep the rest of the test logic unchanged so existing behavior is preserved.
Suggested changeset 1
bbot/test/test_step_2/module_tests/test_module_wayback.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/bbot/test/test_step_2/module_tests/test_module_wayback.py b/bbot/test/test_step_2/module_tests/test_module_wayback.py
--- a/bbot/test/test_step_2/module_tests/test_module_wayback.py
+++ b/bbot/test/test_step_2/module_tests/test_module_wayback.py
@@ -1,5 +1,5 @@
 import re
-from urllib.parse import unquote
+from urllib.parse import unquote, urlparse
 
 from werkzeug.wrappers import Response
 
@@ -83,7 +83,8 @@
         ), "Failed to emit FINDING for interesting archived file"
         for e in events:
             if e.type == "FINDING" and "site.zip" in e.data.get("description", ""):
-                assert "web.archive.org" in e.data["url"]
+                parsed_url = urlparse(e.data["url"])
+                assert parsed_url.hostname == "web.archive.org"
 
 
 class TestWaybackArchive(ModuleTestBase):
EOF
@@ -1,5 +1,5 @@
import re
from urllib.parse import unquote
from urllib.parse import unquote, urlparse

from werkzeug.wrappers import Response

@@ -83,7 +83,8 @@
), "Failed to emit FINDING for interesting archived file"
for e in events:
if e.type == "FINDING" and "site.zip" in e.data.get("description", ""):
assert "web.archive.org" in e.data["url"]
parsed_url = urlparse(e.data["url"])
assert parsed_url.hostname == "web.archive.org"


class TestWaybackArchive(ModuleTestBase):
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}"
)
# archive_url should contain the archive.org provenance URL
assert "web.archive.org" in e.data.get("archive_url", ""), (

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization

The string [web.archive.org](1) may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 2 months ago

In general, the problem is that the code checks for "web.archive.org" as a substring of the whole URL string. To fix this class of issues, the URL should be parsed with urllib.parse.urlparse, and the check should be applied to the hostname component (or path, if that is what matters), rather than the entire string. That ensures we are actually verifying the domain and not an arbitrary substring.

The best minimal fix here is to parse archive_url and assert on its hostname. Specifically, in TestWaybackArchiveHostField.check, instead of:

assert "web.archive.org" in e.data.get("archive_url", ""), ...

we should:

  1. Retrieve archive_url string.
  2. Parse it with urllib.parse.urlparse.
  3. Assert that parsed.hostname == "web.archive.org" (or at least endswith .web.archive.org if subdomains should be allowed).

This change keeps the existing semantics—verifying that the provenance URL is indeed pointing at archive.org—while avoiding substring-based URL checks. To implement this, we need to import urlparse (or urlsplit) from urllib.parse at the top of bbot/test/test_step_2/module_tests/test_module_wayback.py and update the assertion accordingly. No other functional changes are required.

Suggested changeset 1
bbot/test/test_step_2/module_tests/test_module_wayback.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/bbot/test/test_step_2/module_tests/test_module_wayback.py b/bbot/test/test_step_2/module_tests/test_module_wayback.py
--- a/bbot/test/test_step_2/module_tests/test_module_wayback.py
+++ b/bbot/test/test_step_2/module_tests/test_module_wayback.py
@@ -1,5 +1,5 @@
 import re
-from urllib.parse import unquote
+from urllib.parse import unquote, urlparse
 
 from werkzeug.wrappers import Response
 
@@ -251,8 +251,10 @@
                 f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}"
             )
             # archive_url should contain the archive.org provenance URL
-            assert "web.archive.org" in e.data.get("archive_url", ""), (
-                f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {e.data.get('archive_url')}"
+            archive_url = e.data.get("archive_url", "")
+            archive_host = urlparse(archive_url).hostname if archive_url else None
+            assert archive_host == "web.archive.org", (
+                f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {archive_url}"
             )
             # event.host should be the original host
             assert str(e.host) != "web.archive.org", f"event.host should be original host, got: {e.host}"
EOF
@@ -1,5 +1,5 @@
import re
from urllib.parse import unquote
from urllib.parse import unquote, urlparse

from werkzeug.wrappers import Response

@@ -251,8 +251,10 @@
f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}"
)
# archive_url should contain the archive.org provenance URL
assert "web.archive.org" in e.data.get("archive_url", ""), (
f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {e.data.get('archive_url')}"
archive_url = e.data.get("archive_url", "")
archive_host = urlparse(archive_url).hostname if archive_url else None
assert archive_host == "web.archive.org", (
f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {archive_url}"
)
# event.host should be the original host
assert str(e.host) != "web.archive.org", f"event.host should be original host, got: {e.host}"
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
assert "archive_url" in finding.data, (
f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
)
assert "web.archive.org" in finding.data["archive_url"], (

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization

The string [web.archive.org](1) may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 2 months ago

In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.

In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.

Suggested changeset 1
bbot/test/test_step_2/module_tests/test_module_wayback.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/bbot/test/test_step_2/module_tests/test_module_wayback.py b/bbot/test/test_step_2/module_tests/test_module_wayback.py
--- a/bbot/test/test_step_2/module_tests/test_module_wayback.py
+++ b/bbot/test/test_step_2/module_tests/test_module_wayback.py
@@ -1,5 +1,5 @@
 import re
-from urllib.parse import unquote
+from urllib.parse import unquote, urlparse
 
 from werkzeug.wrappers import Response
 
@@ -310,8 +310,10 @@
             assert "archive_url" in finding.data, (
                 f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
             )
-            assert "web.archive.org" in finding.data["archive_url"], (
-                f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
+            archive_url_host = urlparse(finding.data["archive_url"]).hostname
+            assert archive_url_host == "web.archive.org", (
+                f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
+                f"full URL: {finding.data['archive_url']}"
             )
 
         # WEB_PARAMETERs from archived content should also have archive_url
EOF
@@ -1,5 +1,5 @@
import re
from urllib.parse import unquote
from urllib.parse import unquote, urlparse

from werkzeug.wrappers import Response

@@ -310,8 +310,10 @@
assert "archive_url" in finding.data, (
f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
)
assert "web.archive.org" in finding.data["archive_url"], (
f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
archive_url_host = urlparse(finding.data["archive_url"]).hostname
assert archive_url_host == "web.archive.org", (
f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
f"full URL: {finding.data['archive_url']}"
)

# WEB_PARAMETERs from archived content should also have archive_url
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bro its a draft step off

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 19, 2026

📊 Performance Benchmark Report

Comparing blasthttp-integration-clean (baseline) vs wayback-upgrade (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name 📏 Base 📏 Current 📈 Change 🎯 Status
Bloom Filter Dns Mutation Tracking Performance 4.16ms 4.21ms +1.0%
Bloom Filter Large Scale Dns Brute Force 17.26ms 19.17ms +11.1% 🟡🟡 ⚠️
Large Closest Match Lookup 349.94ms 355.80ms +1.7%
Realistic Closest Match Workload 191.90ms 190.59ms -0.7%
Event Memory Medium Scan 1782 B/event 1781 B/event -0.1%
Event Memory Large Scan 1768 B/event 1768 B/event +0.0%
Event Validation Full Scan Startup Small Batch 411.20ms 409.05ms -0.5%
Event Validation Full Scan Startup Large Batch 575.78ms 582.40ms +1.2%
Make Event Autodetection Small 30.68ms 30.55ms -0.4%
Make Event Autodetection Large 312.38ms 312.88ms +0.2%
Make Event Explicit Types 13.65ms 13.76ms +0.8%
Excavate Single Thread Small 3.884s 3.957s +1.9%
Excavate Single Thread Large 9.398s 9.565s +1.8%
Excavate Parallel Tasks Small 4.092s 4.265s +4.2%
Excavate Parallel Tasks Large 6.481s 6.657s +2.7%
Is Ip Performance 3.15ms 3.25ms +3.3%
Make Ip Type Performance 11.46ms 11.81ms +3.1%
Mixed Ip Operations 4.51ms 4.58ms +1.6%
Memory Use Web Crawl 160.7 MB 150.8 MB -6.1%
Memory Use Subdomain Enum 19.4 MB 19.4 MB +0.1%
Scan Throughput 100 4.743s 4.633s -2.3%
Scan Throughput 1000 36.429s 37.073s +1.8%
Typical Queue Shuffle 62.89µs 63.77µs +1.4%
Priority Queue Shuffle 710.77µs 735.64µs +3.5%

🎯 Performance Summary

! 1 regression ⚠️
  23 unchanged ✅

🔍 Significant Changes (>10%)

  • Bloom Filter Large Scale Dns Brute Force: 11.1% 🐌 slower

🐍 Python Version 3.11.15

@liquidsec liquidsec changed the base branch from dev to 3.0 February 28, 2026 18:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 86.11825% with 108 lines in your changes missing coverage. Please review.
✅ Project coverage is 91%. Comparing base (6670cc1) to head (246f74e).

Files with missing lines Patch % Lines
bbot/modules/wayback.py 78% 90 Missing ⚠️
bbot/test/conftest.py 0% 7 Missing ⚠️
bbot/core/event/base.py 74% 5 Missing ⚠️
bbot/test/test_step_1/test_helpers.py 77% 3 Missing ⚠️
bbot/modules/http.py 67% 2 Missing ⚠️
...st/test_step_2/module_tests/test_module_wayback.py 100% 1 Missing ⚠️
Additional details and impacted files
@@                     Coverage Diff                      @@
##           blasthttp-integration-clean   #2909    +/-   ##
============================================================
- Coverage                           91%     91%    -0%     
============================================================
  Files                              443     443            
  Lines                            38305   39030   +725     
============================================================
+ Hits                             34621   35243   +622     
- Misses                            3684    3787   +103     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liquidsec liquidsec marked this pull request as ready for review March 23, 2026 18:32
Comment on lines +218 to +234
async def run_in_executor_mp(self, callback, *args, **kwargs):
"""
Same as run_in_executor() except with a process pool executor
Use only in cases where callback is CPU-bound
Same as run_in_executor() except with a process pool executor.
Use only in cases where callback is CPU-bound.

Includes a timeout (default 300s) to prevent indefinite hangs if a
child process dies or the pool enters a broken state.

Pass ``_timeout=seconds`` to override the default timeout.

Examples:
Execute callback:
>>> result = await self.helpers.run_in_executor_mp(callback_fn, arg1, arg2)
"""
timeout = kwargs.pop("_timeout", 300)
callback = partial(callback, **kwargs)
return self.loop.run_in_executor(self.process_pool, callback, *args)
future = self.loop.run_in_executor(self.process_pool, callback, *args)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout needs to be applied one level deeper, because it cancels the awaiting of the coroutine, but leaves the stuck task executing in the process pool, thereby taking up a child process indefinitely.

  1. We should be executing network/api requests in the main thread
  2. We should only be submitting cpu-bound tasks (i.e. bulk URL parsing) to the process pool

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was already only cpu-bound tasks, there was no network requests here

Copy link
Copy Markdown
Collaborator

@TheTechromancer TheTechromancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@liquidsec liquidsec changed the title Wayback upgrade Major wayback module upgrade + gowitness fix + baddns 2.0 / badsecrets 1.0 Mar 25, 2026
@liquidsec liquidsec changed the title Major wayback module upgrade + gowitness fix + baddns 2.0 / badsecrets 1.0 Wayback upgrade Mar 25, 2026
- wayback: override _incoming_dedup_hash for URL events to prevent
  subdomain_enum's domain-based dedup from collapsing distinct URLs
- wayback: fix FINDING confidence "MODERATE" -> "MEDIUM" (valid level)
- wayback: use individual requests instead of request_batch for
  interesting file HEAD checks
- subdomain_enum: revert is_target exemption from wildcard rejection
The _hang_forever worker process outlives the test and blocks
Python's threading._shutdown via the ProcessPoolExecutor
management thread. Terminate stuck workers after the test and
add a safety net in pytest_sessionfinish.
Previously, asyncio.wait_for() only cancelled the awaiting coroutine
but left the child process running indefinitely. On a 4-core machine,
just 4 stuck workers would permanently stall the scan. Now on timeout
we terminate all workers, replace the pool, and continue cleanly.
…tense to wayback-heavy

URL_UNVERIFIED/URL events changed from string to dict data in 3.0 merge.
Fix event.data -> event.url for hash, clean_url, and test assertions.
Add filter_event override to skip subdomain_enum filtering for URL events.
Stop auto-copying archive_url from parent to child event data dicts,
which could infect live HTTP responses downstream. Instead, use the
from-wayback tag as a signal and traverse upward to find the nearest
archive_url when needed (via new event.archive_url property).
@liquidsec liquidsec changed the base branch from 3.0 to blasthttp-integration-clean April 3, 2026 19:19
…st→targets, VULNERABILITY→FINDING, httpx→http module refs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants