Wayback upgrade by liquidsec · Pull Request #2909 · blacklanternsecurity/bbot

liquidsec · 2026-02-19T03:01:24Z

Summary

Major upgrade to the wayback module with new capabilities for URL discovery, parameter extraction, and archived content retrieval.

Wayback Module Overhaul

URL discovery: CDX API now fetches up to 100k URLs (configurable via max_records) with server-side filtering, bloom filter dedup, and garbage/crawler-trap detection
Parameter extraction (parameters=true): Emits WEB_PARAMETER events for query params found in archived URLs, paired with live URL events for downstream modules like lightfuzz
Archive retrieval (archive=true): Fetches archived snapshots of dead URLs from the Wayback Machine, emits them as HTTP_RESPONSE events for excavate to process (extracting params, secrets, etc.)
Interesting file detection: HEAD-checks archived URLs with notable extensions (.zip, .sql, .bak, .env, etc.) and emits FINDINGs for files that still exist
Rate limiting & resilience: Reactive rate limiting with exponential backoff on 429s, retry logic, HEAD pre-checks before full archive fetches
New presets: wayback (URL discovery) and wayback-intense (full archive + parameter extraction)

Excavate Enhancements

Archive-aware host/URL resolution: new _event_host() and _event_base_url() methods so excavate correctly attributes findings from archived content to the original target (not archive.org)
New DirectoryListingExtractor YARA-based submodule detecting Apache, Nginx, IIS, and Python directory listings
from-wayback tag propagates through the event chain

Core / Infrastructure

Process pool hardening: run_in_executor_mp now has a 300s timeout; on timeout, stuck workers are killed and the pool is replaced. max_tasks_per_child=25 on Python 3.11+ prevents memory leaks
Scan shutdown: Process pool workers are now terminated/killed on shutdown instead of waiting indefinitely
BBOTTarget pickle support: Added __getstate__/__setstate__ for proper serialization across process boundaries
httpx module: Discard 4xx responses containing WAF strings
New WAF string: "You don't have permission to access "

+        ), "Failed to emit FINDING for interesting archived file"
+        for e in events:
+            if e.type == "FINDING" and "site.zip" in e.data.get("description", ""):
+                assert "web.archive.org" in e.data["url"]


In general, to fix incomplete URL substring sanitization, you should parse the URL and inspect its hostname (and possibly scheme) rather than checking for an allowed domain as a raw substring. For example, use urllib.parse.urlparse to extract hostname and compare it exactly to web.archive.org or check for allowed subdomains.

For this specific test, we should avoid checking "web.archive.org" in e.data["url"] and instead parse e.data["url"] with urlparse, get the hostname, and assert that it equals "web.archive.org" (or, if subdomains are expected, use .endswith(".web.archive.org")). This preserves the intent—ensuring the URL actually points to web.archive.org—without relying on substring position. Concretely:

Add an import for urlparse from urllib.parse at the top of bbot/test/test_step_2/module_tests/test_module_wayback.py (alongside the existing unquote import).

Replace the assertion at line 86 with logic that parses e.data["url"] and asserts that the parsed hostname is web.archive.org.

Keep the rest of the test logic unchanged so existing behavior is preserved.

+                f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}"
+            )
+            # archive_url should contain the archive.org provenance URL
+            assert "web.archive.org" in e.data.get("archive_url", ""), (


In general, the problem is that the code checks for "web.archive.org" as a substring of the whole URL string. To fix this class of issues, the URL should be parsed with urllib.parse.urlparse, and the check should be applied to the hostname component (or path, if that is what matters), rather than the entire string. That ensures we are actually verifying the domain and not an arbitrary substring.

The best minimal fix here is to parse archive_url and assert on its hostname. Specifically, in TestWaybackArchiveHostField.check, instead of:

assert "web.archive.org" in e.data.get("archive_url", ""), ...

we should:

Retrieve archive_url string.

Parse it with urllib.parse.urlparse.

Assert that parsed.hostname == "web.archive.org" (or at least endswith .web.archive.org if subdomains should be allowed).

This change keeps the existing semantics—verifying that the provenance URL is indeed pointing at archive.org—while avoiding substring-based URL checks. To implement this, we need to import urlparse (or urlsplit) from urllib.parse at the top of bbot/test/test_step_2/module_tests/test_module_wayback.py and update the assertion accordingly. No other functional changes are required.

liquidsec · 2026-02-19T20:27:26Z

+            assert "archive_url" in finding.data, (
+                f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
+            )
+            assert "web.archive.org" in finding.data["archive_url"], (


In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.

In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.

bro its a draft step off

github-actions · 2026-02-19T03:52:19Z

📊 Performance Benchmark Report

Comparing blasthttp-integration-clean (baseline) vs wayback-upgrade (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name	📏 Base	📏 Current	📈 Change	🎯 Status
Bloom Filter Dns Mutation Tracking Performance	`4.16ms`	`4.21ms`	+1.0% ⚪	✅
Bloom Filter Large Scale Dns Brute Force	`17.26ms`	`19.17ms`	+11.1% 🟡🟡	⚠️
Large Closest Match Lookup	`349.94ms`	`355.80ms`	+1.7% ⚪	✅
Realistic Closest Match Workload	`191.90ms`	`190.59ms`	-0.7% ⚪	✅
Event Memory Medium Scan	`1782 B/event`	`1781 B/event`	-0.1% ⚪	✅
Event Memory Large Scan	`1768 B/event`	`1768 B/event`	+0.0% ⚪	✅
Event Validation Full Scan Startup Small Batch	`411.20ms`	`409.05ms`	-0.5% ⚪	✅
Event Validation Full Scan Startup Large Batch	`575.78ms`	`582.40ms`	+1.2% ⚪	✅
Make Event Autodetection Small	`30.68ms`	`30.55ms`	-0.4% ⚪	✅
Make Event Autodetection Large	`312.38ms`	`312.88ms`	+0.2% ⚪	✅
Make Event Explicit Types	`13.65ms`	`13.76ms`	+0.8% ⚪	✅
Excavate Single Thread Small	`3.884s`	`3.957s`	+1.9% ⚪	✅
Excavate Single Thread Large	`9.398s`	`9.565s`	+1.8% ⚪	✅
Excavate Parallel Tasks Small	`4.092s`	`4.265s`	+4.2% ⚪	✅
Excavate Parallel Tasks Large	`6.481s`	`6.657s`	+2.7% ⚪	✅
Is Ip Performance	`3.15ms`	`3.25ms`	+3.3% ⚪	✅
Make Ip Type Performance	`11.46ms`	`11.81ms`	+3.1% ⚪	✅
Mixed Ip Operations	`4.51ms`	`4.58ms`	+1.6% ⚪	✅
Memory Use Web Crawl	`160.7 MB`	`150.8 MB`	-6.1% ⚪	✅
Memory Use Subdomain Enum	`19.4 MB`	`19.4 MB`	+0.1% ⚪	✅
Scan Throughput 100	`4.743s`	`4.633s`	-2.3% ⚪	✅
Scan Throughput 1000	`36.429s`	`37.073s`	+1.8% ⚪	✅
Typical Queue Shuffle	`62.89µs`	`63.77µs`	+1.4% ⚪	✅
Priority Queue Shuffle	`710.77µs`	`735.64µs`	+3.5% ⚪	✅

🎯 Performance Summary

! 1 regression ⚠️
  23 unchanged ✅

🔍 Significant Changes (>10%)

Bloom Filter Large Scale Dns Brute Force: 11.1% 🐌 slower

🐍 Python Version 3.11.15

codecov · 2026-03-06T22:12:07Z

Codecov Report

❌ Patch coverage is 86.11825% with 108 lines in your changes missing coverage. Please review.
✅ Project coverage is 91%. Comparing base (6670cc1) to head (246f74e).

Files with missing lines	Patch %	Lines
bbot/modules/wayback.py	78%	90 Missing ⚠️
bbot/test/conftest.py	0%	7 Missing ⚠️
bbot/core/event/base.py	74%	5 Missing ⚠️
bbot/test/test_step_1/test_helpers.py	77%	3 Missing ⚠️
bbot/modules/http.py	67%	2 Missing ⚠️
...st/test_step_2/module_tests/test_module_wayback.py	100%	1 Missing ⚠️

Additional details and impacted files

@@                     Coverage Diff                      @@
##           blasthttp-integration-clean   #2909    +/-   ##
============================================================
- Coverage                           91%     91%    -0%     
============================================================
  Files                              443     443            
  Lines                            38305   39030   +725     
============================================================
+ Hits                             34621   35243   +622     
- Misses                            3684    3787   +103

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheTechromancer · 2026-03-24T19:18:23Z

+    async def run_in_executor_mp(self, callback, *args, **kwargs):
        """
-        Same as run_in_executor() except with a process pool executor
-        Use only in cases where callback is CPU-bound
+        Same as run_in_executor() except with a process pool executor.
+        Use only in cases where callback is CPU-bound.
+
+        Includes a timeout (default 300s) to prevent indefinite hangs if a
+        child process dies or the pool enters a broken state.
+
+        Pass ``_timeout=seconds`` to override the default timeout.

        Examples:
            Execute callback:
            >>> result = await self.helpers.run_in_executor_mp(callback_fn, arg1, arg2)
        """
+        timeout = kwargs.pop("_timeout", 300)
        callback = partial(callback, **kwargs)
-        return self.loop.run_in_executor(self.process_pool, callback, *args)
+        future = self.loop.run_in_executor(self.process_pool, callback, *args)


This timeout needs to be applied one level deeper, because it cancels the awaiting of the coroutine, but leaves the stuck task executing in the process pool, thereby taking up a child process indefinitely.

We should be executing network/api requests in the main thread

We should only be submitting cpu-bound tasks (i.e. bulk URL parsing) to the process pool

it was already only cpu-bound tasks, there was no network requests here

TheTechromancer

.

Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.

…etching

- Add max_records option (default 100000) for CDX API limit - Only retry archive fetches on connection errors/429, not on definitive HTTP status codes - Change "Loading archived URLs" message from hugeinfo to verbose - Update retry test to use ReadError instead of 503

- wayback: override _incoming_dedup_hash for URL events to prevent subdomain_enum's domain-based dedup from collapsing distinct URLs - wayback: fix FINDING confidence "MODERATE" -> "MEDIUM" (valid level) - wayback: use individual requests instead of request_batch for interesting file HEAD checks - subdomain_enum: revert is_target exemption from wildcard rejection

…te limiting

The _hang_forever worker process outlives the test and blocks Python's threading._shutdown via the ProcessPoolExecutor management thread. Terminate stuck workers after the test and add a safety net in pytest_sessionfinish.

Previously, asyncio.wait_for() only cancelled the awaiting coroutine but left the child process running indefinitely. On a 4-core machine, just 4 stuck workers would permanently stall the scan. Now on timeout we terminate all workers, replace the pool, and continue cleanly.

…tense to wayback-heavy URL_UNVERIFIED/URL events changed from string to dict data in 3.0 merge. Fix event.data -> event.url for hash, clean_url, and test assertions. Add filter_event override to skip subdomain_enum filtering for URL events.

Stop auto-copying archive_url from parent to child event data dicts, which could infect live HTTP responses downstream. Instead, use the from-wayback tag as a signal and traverse upward to find the nearest archive_url when needed (via new event.archive_url property).

…st→targets, VULNERABILITY→FINDING, httpx→http module refs

Filter ftp:// and other non-HTTP URLs in _pre_process_urls() before they enter the archive cache. Truncate event data in ValidationError messages to 200 chars to prevent terminal flooding.

liquidsec marked this pull request as draft February 19, 2026 03:01

liquidsec mentioned this pull request Feb 19, 2026

Wayback misbehaving #1478

Closed

github-advanced-security AI found potential problems Feb 19, 2026

View reviewed changes

liquidsec changed the base branch from dev to 3.0 February 28, 2026 18:31

liquidsec requested a review from TheTechromancer March 23, 2026 18:26

liquidsec marked this pull request as ready for review March 23, 2026 18:32

TheTechromancer reviewed Mar 24, 2026

View reviewed changes

liquidsec changed the title ~~Wayback upgrade~~ Major wayback module upgrade + gowitness fix + baddns 2.0 / badsecrets 1.0 Mar 25, 2026

liquidsec changed the title ~~Major wayback module upgrade + gowitness fix + baddns 2.0 / badsecrets 1.0~~ Wayback upgrade Mar 25, 2026

liquidsec added 18 commits April 3, 2026 13:53

add parameter emmision to wayback

39d8a5a

mods to the wayback parameter extraction

27f1409

more features / bug fixes for new wayback

aac9674

allow from-wayback tag to propagate

251d48d

update docs for wayback

ca211fc

add waf string 4xx filtering

de97820

add Akamai WAF string to waf_strings helper

d448bca

add directory listing excavate submodule

9b81791

improve wayback CDX error logging and increase timeout

1991e28

Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.

add rate limiting, retry, and bloom filter dedup to wayback archive f…

5d0cde7

…etching

add CDX server-side filters and 100k URL limit to wayback module

c0ccda5

fixing wayback rate limiting

b55d2a1

improving wayback delay system

ab154c8

make cpu heavy processing non-blocking

454f169

fix _event_host() using resolved IP instead of URL hostname

630bbb5

skip URL collapse when there are no URLs to process

b11d3ef

ruff format

4ab6588

liquidsec added 8 commits April 3, 2026 14:35

add timeout and recovery protections to run_in_executor_mp

786bc09

ruff check fixes

d4e7a58

Speed up wayback archive fetching with HEAD pre-check and reactive ra…

e72cc93

…te limiting

Fix pytest hanging after test_run_in_executor_mp

cb53d78

The _hang_forever worker process outlives the test and blocks Python's threading._shutdown via the ProcessPoolExecutor management thread. Terminate stuck workers after the test and add a safety net in pytest_sessionfinish.

liquidsec force-pushed the wayback-upgrade branch from 93fee25 to 9aefafc Compare April 3, 2026 19:19

liquidsec changed the base branch from 3.0 to blasthttp-integration-clean April 3, 2026 19:19

Fix tests for blasthttp migration: httpx_mock→blasthttp_mock, whiteli…

bce4b86

…st→targets, VULNERABILITY→FINDING, httpx→http module refs

liquidsec force-pushed the wayback-upgrade branch from 9aefafc to bce4b86 Compare April 3, 2026 19:27

liquidsec added 6 commits April 3, 2026 16:38

Merge branch 'blasthttp-integration-clean' into wayback-upgrade

5758f3b

Merge branch 'blasthttp-integration-clean' into wayback-upgrade

75c7cb9

Merge branch 'blasthttp-integration-clean' into wayback-upgrade

25d60bc

Merge branch 'blasthttp-integration-clean' into wayback-upgrade

670766d

Skip non-HTTP archived URLs in wayback, truncate sanitization warnings

a23a312

Filter ftp:// and other non-HTTP URLs in _pre_process_urls() before they enter the archive cache. Truncate event data in ValidationError messages to 200 chars to prevent terminal flooding.

Merge branch 'blasthttp-integration-clean' into wayback-upgrade

246f74e

@@ -1,5 +1,5 @@
             import re
-            from urllib.parse import unquote
+            from urllib.parse import unquote, urlparse
             from werkzeug.wrappers import Response
@@ -83,7 +83,8 @@
                     ), "Failed to emit FINDING for interesting archived file"
                     for e in events:
                         if e.type == "FINDING" and "site.zip" in e.data.get("description", ""):
-                            assert "web.archive.org" in e.data["url"]
+                            parsed_url = urlparse(e.data["url"])
+                            assert parsed_url.hostname == "web.archive.org"
             class TestWaybackArchive(ModuleTestBase):

@@ -1,5 +1,5 @@
             import re
-            from urllib.parse import unquote
+            from urllib.parse import unquote, urlparse
             from werkzeug.wrappers import Response
@@ -251,8 +251,10 @@
                             f"HTTP_RESPONSE url should contain original host, got: {e.data['url']}"
                         )
                         # archive_url should contain the archive.org provenance URL
-                        assert "web.archive.org" in e.data.get("archive_url", ""), (
-                            f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {e.data.get('archive_url')}"
+                        archive_url = e.data.get("archive_url", "")
+                        archive_host = urlparse(archive_url).hostname if archive_url else None
+                        assert archive_host == "web.archive.org", (
+                            f"HTTP_RESPONSE archive_url should be the archive.org URL, got: {archive_url}"
                         )
                         # event.host should be the original host
                         assert str(e.host) != "web.archive.org", f"event.host should be original host, got: {e.host}"

@@ -1,5 +1,5 @@
             import re
-            from urllib.parse import unquote
+            from urllib.parse import unquote, urlparse
             from werkzeug.wrappers import Response
@@ -310,8 +310,10 @@
                         assert "archive_url" in finding.data, (
                             f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
                         )
-                        assert "web.archive.org" in finding.data["archive_url"], (
-                            f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
+                        archive_url_host = urlparse(finding.data["archive_url"]).hostname
+                        assert archive_url_host == "web.archive.org", (
+                            f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
+                            f"full URL: {finding.data['archive_url']}"
                         )
                     # WEB_PARAMETERs from archived content should also have archive_url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wayback upgrade#2909

Wayback upgrade#2909
liquidsec wants to merge 33 commits intoblasthttp-integration-cleanfrom
wayback-upgrade

liquidsec commented Feb 19, 2026 •

edited

Loading

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

liquidsec Feb 19, 2026

Uh oh!

github-actions Bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

TheTechromancer Mar 24, 2026

Uh oh!

liquidsec Mar 25, 2026

Uh oh!

TheTechromancer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

liquidsec commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Wayback Module Overhaul

Excavate Enhancements

Core / Infrastructure

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

liquidsec Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Benchmark Report

🎯 Performance Summary

🔍 Significant Changes (>10%)

Uh oh!

codecov Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheTechromancer Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

liquidsec Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

TheTechromancer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liquidsec commented Feb 19, 2026 •

edited

Loading

github-actions Bot commented Feb 19, 2026 •

edited

Loading

codecov Bot commented Mar 6, 2026 •

edited

Loading