loader: fix recycled TID overwriting matched calls in dynamic layout#2882
loader: fix recycled TID overwriting matched calls in dynamic layout#2882devs6186 wants to merge 2 commits intomandiant:masterfrom
Conversation
Summary of ChangesHello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the dynamic analysis loader where the recycling of thread IDs by the operating system could lead to data loss and runtime errors. Specifically, it prevents the overwriting of matched calls associated with a recycled TID and ensures that such TIDs are not registered multiple times in the layout. These changes enhance the robustness and accuracy of dynamic analysis results, particularly in environments where TIDs are frequently reused. Additionally, the pull request includes a new test to cover this specific scenario and updates the documentation to provide a clearer overview of capa's output consumption methods. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where recycled thread IDs in dynamic analysis could lead to overwritten data and subsequent errors. The fix correctly uses setdefault to accumulate calls from all instances of a recycled TID and adds a check to prevent duplicate thread registration. A comprehensive test case has been added to validate the fix. The overall implementation is correct and of high quality. The PR also includes significant code formatting changes, which improve readability. For future PRs, it would be beneficial to separate such large-scale formatting changes from functional bug fixes to make reviewing easier. The documentation updates are also a welcome addition.
72dbf97 to
29bb083
Compare
mike-hunhoff
left a comment
There was a problem hiding this comment.
Hi @devs6186, thanks for looking into the data overwriting issue. While this fix prevents matched calls from being lost when a TID is recycled, it doesn't quite address the requirement for unique tracking of distinct thread and process lifecycles. By accumulating all calls under the same ThreadAddress, the layout still fuses separate executions into a single entry, which doesn't solve the core problem described in #2361.
Because we want to avoid merging partial solutions for this component, we'd like to see a more comprehensive fix that addresses the uniqueness and de-duplication problem entirely for both thread and process lifecycles (for example, by incorporating sequence IDs or timestamps into the Address classes).
If you're interested in fully addressing these requirements, we'd love to see those updates in this PR. Alternatively, if you don't have the time or interest to take on the broader scope right now, let us know and we can close this for now.
There was a problem hiding this comment.
This file contains many format-based changes. Please revert these if they are not necessary to pass linting checks.
i appreciate your review, i will look into it moe broadly, i wanna solve a problem completely instead of just submitting a pr, give me days time or two maximum, ill come back with a better solution. Thank you ! |
944f89c to
90ba1b0
Compare
…D lifecycles Adds an optional `id` field to `ProcessAddress` and `ThreadAddress` that sandbox backends can populate with a sandbox-specific unique identifier (e.g. VMRay monitor_id, or a sequential counter for CAPE). When set, this field becomes part of equality/hashing so that two process or thread instances that share the same OS-assigned PID/TID are treated as distinct addresses throughout capa's pipeline. This comprehensively fixes the ValueError crash in render (mandiant#2619) by solving the root uniqueness problem described in mandiant#2361: rather than merging recycled lifecycles into a single entry, each instance now gets its own identity. Changes: - address.py: add optional `id` to ProcessAddress and ThreadAddress; update __eq__, __hash__, __lt__, __repr__ accordingly; backward-compatible (id=None by default) - freeze/__init__.py: extend from_capa/to_capa to encode/decode the new id fields using extended tuple lengths; old 2/3/4-element tuples still decoded correctly for backward compatibility - vmray/extractor.py: pass monitor_id as id to both ProcessAddress and ThreadAddress so each VMRay monitor instance is uniquely tracked - cape/file.py: detect PID reuse via two-pass counting and assign sequential ids; processes with unique PIDs keep id=None (no behavior change) - render/verbose.py: add _format_process_fields / _format_thread_fields helpers that include the id in rendered output when present - tests/test_address_uniqueness.py: 35 unit tests covering identity, hashing, sorting, freeze roundtrip (incl. backward compat), and compute_dynamic_layout behavior for both recycled TIDs and recycled PIDs
90ba1b0 to
e4aa79a
Compare
mike-hunhoff
left a comment
There was a problem hiding this comment.
thanks @devs6186 . I started a pass and quickly realized that there appears to be a lot of code added to main backwards compatibility. This change is going to be considered a breaking change requiring a major release, so let's not over complicate the code to maintan backwards compatibility. Also, please ensure that process and thread uniqueness is addressed for all supported sandboxes. Presently, it looks like you only have process and thread covered for VMRay and process covered for CAPE.
| # only assign ids when reuse is detected; otherwise keep id=None | ||
| # for backward compatibility with existing addresses and freeze files |
There was a problem hiding this comment.
Please explain why the value must be None to maintain backwards compatibility? Isn't the id field being optional enough to maintain backwards compatibility. Otherwise, we should be assigning IDs moving forward.
| for process in report.behavior.processes: | ||
| addr = ProcessAddress(pid=process.process_id, ppid=process.parent_id) | ||
| yield ProcessHandle(address=addr, inner=process) | ||
| key = (process.parent_id, process.process_id) | ||
| counts[key] = counts.get(key, 0) + 1 | ||
|
|
||
| # check for pid and ppid reuse | ||
| if addr not in seen_processes: | ||
| seen_processes[addr] = [process] | ||
| else: | ||
| logger.warning( | ||
| "pid and ppid reuse detected between process %s and process%s: %s", | ||
| process, | ||
| "es" if len(seen_processes[addr]) > 1 else "", | ||
| seen_processes[addr], | ||
| # second pass: yield handles with sequential ids for reused pairs | ||
| seq: dict[tuple[int, int], int] = {} | ||
| for process in report.behavior.processes: | ||
| key = (process.parent_id, process.process_id) | ||
| seq[key] = seq.get(key, 0) + 1 |
There was a problem hiding this comment.
Please explain why two passes are required? vs. assigning unique IDs from the start.
There was a problem hiding this comment.
What about CAPE threads?
capa/features/freeze/__init__.py
Outdated
| if a.id is not None: | ||
| return cls(type=AddressType.PROCESS, value=(a.ppid, a.pid, a.id)) | ||
| else: | ||
| return cls(type=AddressType.PROCESS, value=(a.ppid, a.pid)) |
There was a problem hiding this comment.
Is this to maintain backwards compatibility? Generally, this change will be considered a breaking change regardless, so let's not over complicate the code to main backwards compatibility.
- CAPE file.py: single pass with sequential IDs (no two-pass) - CAPE process.py: add thread uniqueness with sequential IDs - Drakvuf helpers.py: assign id=0 to process/thread addresses - freeze: always include id in tuples, remove backwards-compat branching - freeze: revert format-only changes in loads_static/loads_dynamic/main - verbose.py: remove over-engineered pid-id display logic - tests: update for simplified API, remove backwards-compat tests
|
thanks @mike-hunhoff, really appreciate the detailed review. pushed the changes in 2585abf addressing everything: dropped all the backwards-compat complexity: CAPE processes - single pass: CAPE threads: Drakvuf: reverted format-only changes: sandbox coverage now:
|
|
I hope you're doing well, I know this is a busy period and I genuinely appreciate the time you've already put into the project. I wanted to send a brief follow-up on Discussion #99. The GSoC application deadline is approaching and I want to make sure I'm heading in the right direction before I finalize my proposal. I'm not asking for an exhaustive review , even a quick "this looks promising / you're off-track on X" would be enormously helpful at this stage. Thank you |
@devs6186 please keep all GSoC-related correspondence in the flare-gsoc discussions so we can better track them. |
closes #2619
Root cause
When an OS recycles a thread ID (TID) within the same process,
compute_dynamic_layout()iterated the sameThreadAddresstwice. On the second encounter it executed:The matched calls from the first thread instance were still present in
matched_calls(added by the rule engine), but had been erased from the layout, causing_get_call_name()to raise:Additionally, if both the first and the second instance had matched calls,
t.addresswas appended tothreads_by_process[p.address]twice, creating a duplicate thread entry in the layout.Fix
Two targeted changes in
compute_dynamic_layout()(capa/loader.py):Replace the overwriting assignment with
setdefault()so calls from all instances of a recycled TID accumulate under the same key:Guard the thread-registration block with a membership check so the thread address is only added to
matched_threads/threads_by_processonce, regardless of how many times the TID is recycled:This is the source-level fix requested by the maintainer: the data is now consistent before it reaches the rendering stage, so no exception handling in the renderer is needed.
Test
tests/test_loader.py::test_compute_dynamic_layout_recycled_tidconstructs a minimal mockDynamicFeatureExtractorthat yields the sameThreadAddresstwice (simulating TID recycling), with a matched call only in the first instance. It asserts thatcompute_dynamic_layout()returns a layout containing exactly one thread entry with the matched call intact.Checklist