Guac session auth, EVTX periodic snapshots, and web fixes#2964
Conversation
- Replace client-side VNC port/IP passing with server-side libvirt lookup and database-backed session tokens - Add GuacSession model for tying browser sessions to task lifecycle - WebSocket consumer validates session cookie, checks task is running, and auto-disconnects when task ends - Remove guest_ip/vncport from JS client (no longer sent to browser) - Scale VNC display to fit browser viewport with correct mouse mapping
- Collect EVTX snapshots every 30s to protect against log clearing - Add regex search filtering for event log viewer - Group snapshot files by channel with proper EVTX header validation - Filter analyzer process noise from displayed events
- Add missing crispy_forms_tags load to account templates - Recover original filename from task DB on resubmit
- Handle multiple evtx zip files (snapshots + final) instead of single evtx.zip - Extract each snapshot into separate subdirectory for zircolite processing - Add evtx_dump-based pre-filtering to strip analyzer process noise - Load exclude patterns from filters.json pre_filters config - Formatting fixes (black/ruff compliance) Companion to kevoreilly/CAPEv2#2964 which adds the periodic EVTX snapshot collection in the guest analyzer.
There was a problem hiding this comment.
Code Review
This pull request introduces periodic EVTX snapshots in the Windows analyzer to protect against log wiping by malware and adds regex search capabilities for event logs in the web UI. It also implements a more secure Guacamole session management system using database-backed tokens tied to task lifecycles. Key feedback includes addressing lexicographical sorting issues for snapshots that would break chronological order after nine snapshots, refactoring duplicated noise filter logic to improve performance, and using more appropriate database API calls for retrieving task metadata.
|
|
||
| with zipfile.ZipFile(self.evtx_dump, "w", zipfile.ZIP_DEFLATED) as zip_obj: | ||
| if os.path.isdir(snapshot_base): | ||
| for snap_name in sorted(os.listdir(snapshot_base)): |
There was a problem hiding this comment.
Sorting directory names lexicographically will result in an incorrect chronological order when the snapshot count exceeds 9 (e.g., '10' will come before '2'). This affects the order in which events are merged into the final archive and subsequently displayed in the web UI.
| for snap_name in sorted(os.listdir(snapshot_base)): | |
| for snap_name in sorted(os.listdir(snapshot_base), key=lambda x: int(x) if x.isdigit() else x): |
There was a problem hiding this comment.
Fixed in 4a93260 — using numeric sort key.
| members_to_extract.append(m) | ||
| if not members_to_extract: | ||
| raise ValueError(f"No EVTX members found for channel: {channel}") | ||
| members_to_extract.sort() |
There was a problem hiding this comment.
Lexicographical sorting of snapshot-prefixed filenames will cause events to be chained in the wrong order once the snapshot count reaches double digits (e.g., 10_Security.evtx will be processed before 2_Security.evtx). A numeric sort based on the snapshot prefix should be used to maintain chronological integrity.
| members_to_extract.sort() | |
| members_to_extract.sort(key=lambda x: int(x.split("_")[0]) if "_" in x and x.split("_")[0].isdigit() else x) |
There was a problem hiding this comment.
Fixed in 4a93260 — numeric sort on snapshot prefix.
| evtx_paths = sorted( | ||
| os.path.join(tmpdir, m) for m in members_to_extract | ||
| if os.path.exists(os.path.join(tmpdir, m)) | ||
| ) |
There was a problem hiding this comment.
This sorted() call is redundant if members_to_extract is already sorted correctly. Furthermore, it re-introduces the lexicographical sorting issue on the full file paths. It should be replaced with a simple list comprehension that preserves the existing order.
| evtx_paths = sorted( | |
| os.path.join(tmpdir, m) for m in members_to_extract | |
| if os.path.exists(os.path.join(tmpdir, m)) | |
| ) | |
| evtx_paths = [ | |
| os.path.join(tmpdir, m) for m in members_to_extract | |
| if os.path.exists(os.path.join(tmpdir, m)) | |
| ] |
There was a problem hiding this comment.
Fixed in 4a93260 — replaced with list comprehension preserving pre-sorted order.
| # Load analyzer noise filter from shared config | ||
| _ANALYZER_PARENTS = set() | ||
| _ANALYZER_IMAGES = set() | ||
| try: | ||
| import os as _os | ||
| _cape_root = _os.path.dirname(_os.path.dirname(_os.path.dirname(_os.path.abspath(__file__)))) | ||
| for _fp in ["data/sigma/filters_local.json", "data/sigma/filters.json"]: | ||
| _full = _os.path.join(_cape_root, _fp) | ||
| if _os.path.exists(_full): | ||
| with open(_full) as _f: | ||
| _data = json.load(_f) | ||
| _pf = _data.get("pre_filters", {}) | ||
| for _p in _pf.get("exclude_parent_processes", []): | ||
| _ANALYZER_PARENTS.add(_p.lower()) | ||
| for _p in _pf.get("exclude_image_processes", []): | ||
| _ANALYZER_IMAGES.add(_p.lower()) | ||
| _ANALYZER_PATHS = set() | ||
| for _fp2 in ["data/sigma/filters_local.json", "data/sigma/filters.json"]: | ||
| _full2 = _os.path.join(_cape_root, _fp2) | ||
| if _os.path.exists(_full2): | ||
| with open(_full2) as _f2: | ||
| _data2 = json.load(_f2) | ||
| for _p in _data2.get("pre_filters", {}).get("exclude_target_paths", []): | ||
| _ANALYZER_PATHS.add(_p.lower()) | ||
| except Exception: | ||
| pass | ||
| if not _ANALYZER_PARENTS: | ||
| _ANALYZER_PARENTS = {"icacls.exe", "python.exe", "wevtutil.exe"} | ||
| if not _ANALYZER_IMAGES: | ||
| _ANALYZER_IMAGES = {"wevtutil.exe", "conhost.exe"} |
There was a problem hiding this comment.
The logic for loading analyzer noise filters from JSON files is duplicated in _load_evtx_channel_page_cached and _count_evtx_channel_events_cached. It also performs redundant file I/O and manual path calculations instead of using the existing CUCKOO_ROOT constant. This should be refactored into a helper function or a module-level cached property to improve maintainability and performance.
There was a problem hiding this comment.
Fixed in 4a93260 — extracted into _load_evtx_noise_filters() helper using CUCKOO_ROOT.
| if task_id: | ||
| resubmit_tasks = db.find_sample(task_id=task_id) | ||
| if resubmit_tasks: | ||
| original_target = resubmit_tasks[0].to_dict().get("target", "") | ||
| if original_target: | ||
| original_filename = sanitize_filename(os.path.basename(original_target)) | ||
| filename = base_dir + "/" + (original_filename or sanitize_filename(hash)) |
There was a problem hiding this comment.
Using db.find_sample(task_id=task_id) to retrieve the original filename of a task is suboptimal. In the standard database API, db.view_task(task_id) should be used to get the Task object, which contains the target field. Additionally, find_sample typically returns Sample objects which do not have a target attribute in the standard schema.
if task_id:
task = db.view_task(task_id)
if task and task.target:
original_filename = sanitize_filename(os.path.basename(task.target))
filename = base_dir + "/" + (original_filename or sanitize_filename(hash))There was a problem hiding this comment.
Fixed in 4a93260 — switched to db.view_task().
There was a problem hiding this comment.
Pull request overview
This PR enhances CAPEv2’s interactive Guacamole desktop access and Windows EVTX log retention/viewing by moving VNC connection details server-side with DB-backed session tokens, adding periodic EVTX snapshotting during analysis, and addressing a few web UI regressions.
Changes:
- Add DB-backed Guacamole session tokens validated by the WebSocket consumer, with server-side libvirt VNC port lookup.
- Implement periodic (30s) EVTX snapshots on the Windows analyzer and update the web EVTX viewer to group snapshot files by channel and support regex filtering + analyzer-noise removal.
- Fix account templates missing
crispy_forms_tagsand preserve original filenames on task resubmit.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| web/web/urls.py | Routes /guac/ into the main Django URL configuration. |
| web/web/settings.py | Enables Channels/Guac app in the main Django settings. |
| web/web/guac_settings.py | Initializes CAPE DB for guac ASGI settings and attempts table creation. |
| web/templates/analysis/eventlogs/index.html | Adds client-side regex search UI and passes search to the EVTX loader endpoint. |
| web/templates/analysis/eventlogs/_evtx_channel.html | Displays active regex filter in the EVTX channel header. |
| web/templates/account/password_reset.html | Loads crispy_forms_tags to prevent template errors. |
| web/templates/account/password_reset_from_key.html | Loads crispy_forms_tags to prevent template errors. |
| web/templates/account/password_change.html | Loads crispy_forms_tags to prevent template errors. |
| web/templates/account/email.html | Loads crispy_forms_tags to prevent template errors. |
| web/submission/views.py | Attempts to recover original filename during task resubmission. |
| web/static/js/guac-main.js | Removes client-side guest IP/VNC port usage; connects using session-only params. |
| web/guac/views.py | Creates DB token + cookie for guac session and renders guac page without exposing VNC details. |
| web/guac/templates/guac/index.html | Updates JS invocation to match new GuacMe() signature. |
| web/guac/consumers.py | Validates cookie token, checks task running, looks up VNC port via libvirt, and auto-disconnects. |
| web/analysis/views.py | Groups snapshot-prefixed EVTX files by channel; adds regex filtering and analyzer-noise filtering. |
| lib/cuckoo/core/database.py | Adds DB helper methods for creating/reading/deleting guac session tokens. |
| lib/cuckoo/core/data/guac_session.py | Introduces GuacSession SQLAlchemy model. |
| analyzer/windows/modules/auxiliary/evtx.py | Implements periodic EVTX snapshot export/wipe and merges snapshots into evtx.zip on stop. |
| def create_guac_session(self, token, task_id, vm_label, guest_ip): | ||
| """Create a new guac session for a task.""" | ||
| from lib.cuckoo.core.data.guac_session import GuacSession | ||
| session = self.session() | ||
| try: | ||
| guac = GuacSession(token=str(token), task_id=task_id, vm_label=vm_label, guest_ip=guest_ip) | ||
| session.add(guac) | ||
| session.commit() | ||
| return guac | ||
| except Exception: | ||
| session.rollback() | ||
| raise |
There was a problem hiding this comment.
The new Guac session helpers use a new scoped session instance and call commit()/rollback() directly. This can break when these helpers are called from Django views wrapped by DBTransactionMiddleware (which already manages a transaction with db.session.begin()), potentially causing invalid transaction state/errors. Prefer using the existing thread-local self.session within a nested transaction (e.g., begin_nested()) and avoid explicit commits here, letting the outer transaction middleware commit; also ensure the session is properly removed/closed when used outside request middleware (e.g., from async consumers).
| _db = init_database(exists_ok=True) | ||
|
|
||
| # Create guac_sessions table if guacamole is enabled | ||
| if _CapeConfig('web').guacamole.get('enabled', False): | ||
| from lib.cuckoo.core.data.guac_session import GuacSession # noqa: F401 | ||
| from lib.cuckoo.core.data.db_common import Base | ||
| Base.metadata.create_all(_db.engine) | ||
|
|
There was a problem hiding this comment.
This settings module performs CAPE DB initialization and then calls Base.metadata.create_all() again. _Database.__init__() already calls Base.metadata.create_all(self.engine), so this is redundant and adds extra side effects at settings-import time (and can race in multi-worker startup). Consider removing the extra create_all() here and rely on init_database() (with the GuacSession model imported) to create the table.
| _db = init_database(exists_ok=True) | |
| # Create guac_sessions table if guacamole is enabled | |
| if _CapeConfig('web').guacamole.get('enabled', False): | |
| from lib.cuckoo.core.data.guac_session import GuacSession # noqa: F401 | |
| from lib.cuckoo.core.data.db_common import Base | |
| Base.metadata.create_all(_db.engine) | |
| # Import GuacSession before database initialization so init_database() | |
| # creates the table as part of the normal metadata setup. | |
| if _CapeConfig('web').guacamole.get('enabled', False): | |
| from lib.cuckoo.core.data.guac_session import GuacSession # noqa: F401 | |
| _db = init_database(exists_ok=True) |
| @@ -40,55 +130,41 @@ async def connect(self): | |||
| guest_username = web_cfg.guacamole.username or "" | |||
| guest_password = web_cfg.guacamole.password or "" | |||
There was a problem hiding this comment.
Several config values are coerced with int(...) and then or-defaulted (e.g., guacd_port = int(...) or 4822). If the config value is missing/None/empty, int() will raise and the connection will fail. Use a safer defaulting pattern (check for truthy value before casting, or catch (TypeError, ValueError)) for guacd_port/guest_width/guest_height/guest_rdp_port.
| channel_members[channel].append(normalized) | ||
| if not channel_has_records[channel]: | ||
| # Read just the first 32 bytes to check the header | ||
| header = zf.read(member)[:32] |
There was a problem hiding this comment.
zf.read(member)[:32] reads the entire EVTX file into memory and then slices it, which is expensive for large logs and repeated across channels. Use zf.open(member) and read only the needed header bytes (32) to determine whether records exist.
| header = zf.read(member)[:32] | |
| with zf.open(member) as member_fp: | |
| header = member_fp.read(32) |
| for channel, member_list in sorted(channel_members.items()): | ||
| if not channel_has_records.get(channel, False): | ||
| continue | ||
| member_list.sort() | ||
| members.append({ | ||
| "member": member_list[0], | ||
| "members": member_list, | ||
| "channel": channel, | ||
| }) |
There was a problem hiding this comment.
Snapshot member ordering is currently lexicographic (member_list.sort()), so after 9 snapshots you'll get 10_... ordered before 2_.... This can misorder snapshots in the UI and later parsing. Consider sorting by the numeric snapshot prefix when present (and then by filename) so snapshots remain chronological.
| channel = _evtx_member_display_name(member) | ||
| members_to_extract = [] | ||
| for m in zf.namelist(): | ||
| if _evtx_member_display_name(m) == channel and m.lower().endswith(".evtx"): | ||
| members_to_extract.append(m) | ||
| if not members_to_extract: | ||
| raise ValueError(f"No EVTX members found for channel: {channel}") | ||
| members_to_extract.sort() | ||
|
|
There was a problem hiding this comment.
members_to_extract.sort() sorts snapshot-prefixed filenames lexicographically, which can put 10_*.evtx before 2_*.evtx and cause records to be chained out of chronological order. Sort using the numeric snapshot prefix (when present) to preserve snapshot ordering during parsing.
| def _compile_evtx_search_pattern(search_query): | ||
| search_query = (search_query or "").strip() | ||
| if not search_query: | ||
| return None, "" | ||
|
|
||
| try: | ||
| return re.compile(search_query, re.IGNORECASE), "" | ||
| except re.error as e: | ||
| return None, str(e) |
There was a problem hiding this comment.
The regex search feature compiles and applies user-provided patterns against raw event JSON. When re2 isn't installed, Python's re is used, which is vulnerable to catastrophic backtracking (ReDoS) with crafted patterns and large inputs. Consider requiring re2 for this feature, or enforcing strict limits (pattern length, rejecting nested quantifiers) to prevent CPU exhaustion.
| try: | ||
| for channel in self.windows_logs: | ||
| safe_name = channel.replace("/", "%4") + ".evtx" | ||
| export_path = os.path.join(snapshot_dir, safe_name) | ||
| try: | ||
| cmd = f'wevtutil epl "{channel}" "{export_path}" /ow:true' | ||
| subprocess.call(cmd, startupinfo=self.startupinfo, timeout=30) | ||
| except Exception: | ||
| pass | ||
|
|
||
| self.wipe_windows_logs() | ||
| return True | ||
| return False | ||
| log.debug("Took evtx snapshot %d", self.snapshot_count) |
There was a problem hiding this comment.
take_snapshot() wipes all event logs unconditionally after attempting exports. If wevtutil epl fails for any/all channels, this will still clear the logs and permanently lose events (opposite of the goal of snapshots). Only wipe logs for channels that were successfully exported (or skip wiping entirely if the snapshot export failed).
| with zipfile.ZipFile(self.evtx_dump, "w", zipfile.ZIP_DEFLATED) as zip_obj: | ||
| if os.path.isdir(snapshot_base): | ||
| for snap_name in sorted(os.listdir(snapshot_base)): | ||
| snap_dir = os.path.join(snapshot_base, snap_name) | ||
| if not os.path.isdir(snap_dir): | ||
| continue | ||
| for evtx_file in os.listdir(snap_dir): | ||
| if not evtx_file.lower().endswith(".evtx"): | ||
| continue | ||
| full_path = os.path.join(snap_dir, evtx_file) | ||
| if os.path.getsize(full_path) == 0: | ||
| continue | ||
| # Add with snapshot number prefix to avoid name collisions | ||
| arc_name = f"{snap_name}_{evtx_file}" | ||
| try: | ||
| zip_obj.write(full_path, arc_name) |
There was a problem hiding this comment.
Snapshot directories are merged using sorted(os.listdir(snapshot_base)), which is lexicographic and will order 10 before 2. This affects archive naming (arc_name) and can later lead to misordered snapshot processing. Sort snapshot directory names numerically when they are digits.
| # Load analyzer noise filter from shared config | ||
| _ANALYZER_PARENTS = set() | ||
| _ANALYZER_IMAGES = set() | ||
| try: | ||
| import os as _os | ||
| _cape_root = _os.path.dirname(_os.path.dirname(_os.path.dirname(_os.path.abspath(__file__)))) | ||
| for _fp in ["data/sigma/filters_local.json", "data/sigma/filters.json"]: | ||
| _full = _os.path.join(_cape_root, _fp) | ||
| if _os.path.exists(_full): | ||
| with open(_full) as _f: | ||
| _data = json.load(_f) | ||
| _pf = _data.get("pre_filters", {}) | ||
| for _p in _pf.get("exclude_parent_processes", []): | ||
| _ANALYZER_PARENTS.add(_p.lower()) | ||
| for _p in _pf.get("exclude_image_processes", []): | ||
| _ANALYZER_IMAGES.add(_p.lower()) | ||
| _ANALYZER_PATHS = set() | ||
| for _fp2 in ["data/sigma/filters_local.json", "data/sigma/filters.json"]: | ||
| _full2 = _os.path.join(_cape_root, _fp2) | ||
| if _os.path.exists(_full2): | ||
| with open(_full2) as _f2: | ||
| _data2 = json.load(_f2) | ||
| for _p in _data2.get("pre_filters", {}).get("exclude_target_paths", []): | ||
| _ANALYZER_PATHS.add(_p.lower()) | ||
| except Exception: |
There was a problem hiding this comment.
The analyzer-noise filter loading/parsing logic is duplicated in both _load_evtx_channel_page_cached() and _count_evtx_channel_events_cached(). This increases the risk of the two paths drifting (e.g., different defaults / keys) and makes the hot path harder to maintain. Consider extracting a shared helper (and possibly caching the parsed filter sets) so both code paths use exactly the same filter data without repeated file IO/JSON parsing.
- Use numeric sort for snapshot directories and filenames (fixes ordering at 10+) - Extract analyzer noise filter loading into shared helper function - Use db.view_task() instead of db.find_sample() for resubmit filename recovery - Remove redundant sorted() on evtx_paths (preserve pre-sorted order)
Removed unused import statement for itertools.
|
Was hoping to merge but hit some more ruff errors: |
Addresses ruff F401 (evtx.py) and F811 (database.py) flagged in PR review.
Summary
crispy_forms_tagsto account templates; recover original filename on task resubmit.New files
lib/cuckoo/core/data/guac_session.py— GuacSession SQLAlchemy model (table auto-created on startup viacreate_all)Test plan
Note
The companion
modules/processing/sigma.pychanges for multi-snapshot EVTX support will be submitted as a separate PR against CAPEsandbox/community.