feat: implement session-based tracking for malicious behavior#430
feat: implement session-based tracking for malicious behavior#430Jazzcort wants to merge 1 commit into
Conversation
|
For team members: test commit |
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
owtaylor
left a comment
There was a problem hiding this comment.
Hmm, not sure this is going to be effective:
- A malicious model can easily evade a "consecutive malicious scripts" checks - also a malicious script for readonly that modifies the system or installs things from the internet is currently often tagged as MODIFIES_SYSTEM or POLICY, so a malicious
- 10 tries is a lot, by then the game might be up.
Maybe:
-
Any MALICIOUS script triggers confirmation of the next run_script (but not the whole session)
-
For the mcp-apps case, we display a boxed warning in that confirmation:
Suspicious activity detected - your chat client may be under attack. Please examine previous tool calls in detail, and if you have any doubts, do not approve this command and terminate this chat session.
There's not much we can do about the non-mcp-apps case.
| A session is flagged as malicious if it accumulates 10+ total malicious actions | ||
| or MAX_CONSECUTIVE_MALICIOUS_ACTIONS consecutive malicious actions. |
There was a problem hiding this comment.
A little odd to have one a constant and the other not
There was a problem hiding this comment.
Got it! I'll have a MAX_TOTAL_MALICIOUS_ACTIONS for that. What would be the suitable count for it? 4 maybe? 🤔
| """ | ||
|
|
||
| def __init__(self): | ||
| self._recent_record: deque[GatekeeperStatus] = deque(maxlen=MAX_CONSECUTIVE_MALICIOUS_ACTIONS) |
There was a problem hiding this comment.
deque seems unnecessary - can you just have _consecutive_malicious_action_acounts and reset it to zero if you get some other status.
There was a problem hiding this comment.
You're right! It can be simplified 😁
5439742 to
0951ca0
Compare
|
For team members: test commit |
|
|
||
| @property | ||
| def is_previous_action_malicious(self) -> bool: | ||
| return self._previous_action_status == GatekeeperStatus.MALICIOUS |
There was a problem hiding this comment.
I think this should not be cleared except by an approval - that is:
validate_script => MALICIOUS - sets flag
validate_script => OK => flag stays set
execute_script => flag is cleared
There was a problem hiding this comment.
Make sense! Will change!
|
|
||
| @property | ||
| def is_security_compromised(self) -> bool: | ||
| return self._is_security_compromised |
There was a problem hiding this comment.
If the model is security compromised, then we have no business allowing the user to execute anything.
What if we unified the two flags into "malicious_activity_warning" . which gets set transiently on a single MALICIOUS and permanently passed the thresholds.
This differs from your setup in that we'd display the scary warning for a single malicious activity detection, but I think that's probably appropriate. An attacker only has to get through once.
There was a problem hiding this comment.
Sounds great! I'm always bad at naming these. Will change!
0951ca0 to
cf62e35
Compare
|
For team members: test commit |
Add BehaviorRecord and BehaviorRecordManager to track gatekeeper verdicts per session. A single MALICIOUS verdict sets a temporary warning that forces human confirmation on the next script; the warning clears once the human approves. If a session accumulates 4 total or 3 consecutive malicious verdicts, it is permanently flagged and all subsequent scripts require confirmation. Integrate behavior tracking into validate_script, run_script, run_script_interactive, run_script_with_confirmation, and execute_script. Emit malicious_activity_warning through the RunScriptInteractiveResult for the mcp-app UI. Add documentation for session-based behavior tracking in guarded-command-execution.md. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
cf62e35 to
bf4a203
Compare
|
For team members: test commit |
Adds
BehaviorRecordandBehaviorRecordManagerto monitor gatekeeper verdicts per session. If a session accumulates 10 total or 3 consecutive malicious actions, it is permanently flagged as malicious. Once flagged, all subsequent scripts executed in that session will be forced to require user confirmation, regardless of the script's default requirements.