You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -9,12 +9,12 @@ As AI agents grow in capability, ensuring they operate safely, securely, and ali
9
9
1.**Identity and Authorization**: Control who the agent **acts as** by defining agent and user auth.
10
10
2.**Guardrails to screen inputs and outputs:** Control your model and tool calls precisely.
11
11
12
-
**In-Tool Guardrails:* Design tools defensively, using developer-set tool context to enforce policies (e.g., allowing queries only on specific tables).
13
-
**Built-in Gemini Safety Features:* If using Gemini models, benefit from content filters to block harmful outputs and system Instructions to guide the model's behavior and safety guidelines
12
+
**In-Tool Guardrails:* Design tools defensively, using developer-set tool context to enforce policies (e.g., allowing queries only on specific tables).
13
+
**Built-in Gemini Safety Features:* If using Gemini models, benefit from content filters to block harmful outputs and system Instructions to guide the model's behavior and safety guidelines
14
14
**Callbacks and Plugins:* Validate model and tool calls before or after execution, checking parameters against agent state or external policies.
15
15
**Using Gemini as a safety guardrail:* Implement an additional safety layer using a cheap and fast model (like Gemini Flash Lite) configured via callbacks to screen inputs and outputs.
16
16
17
-
3.**Sandboxed code execution:** Prevent model-generated code to cause security issues by sandboxing the environment
17
+
3.**Sandboxed code execution:** Prevent model-generated code to cause security issues by sandboxing the environment
18
18
4.**Evaluation and tracing**: Use evaluation tools to assess the quality, relevance, and correctness of the agent's final output. Use tracing to gain visibility into agent actions to analyze the steps an agent takes to reach a solution, including its choice of tools, strategies, and the efficiency of its approach.
19
19
5.**Network Controls and VPC-SC:** Confine agent activity within secure perimeters (like VPC Service Controls) to prevent data exfiltration and limit the potential impact radius.
20
20
@@ -25,20 +25,20 @@ Before implementing safety measures, perform a thorough risk assessment specific
25
25
***Sources*****of risk** include:
26
26
27
27
* Ambiguous agent instructions
28
-
* Prompt injection and jailbreak attempts from adversarial users
28
+
* Prompt injection and jailbreak attempts from adversarial users
29
29
* Indirect prompt injections via tool use
30
30
31
31
**Risk categories** include:
32
32
33
-
***Misalignment & goal corruption**
34
-
* Pursuing unintended or proxy goals that lead to harmful outcomes ("reward hacking")
35
-
* Misinterpreting complex or ambiguous instructions
33
+
***Misalignment & goal corruption**
34
+
* Pursuing unintended or proxy goals that lead to harmful outcomes ("reward hacking")
35
+
* Misinterpreting complex or ambiguous instructions
36
36
***Harmful content generation, including brand safety**
Gemini models come with in-built safety mechanisms that can be leveraged to improve content and brand safety.
190
268
191
-
***Content safety filters**: [Content filters](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes) can help block the output of harmful content. They function independently from Gemini models as part of a layered defense against threat actors who attempt to jailbreak the model. Gemini models on Vertex AI use two types of content filters:
192
-
***Non-configurable safety filters** automatically block outputs containing prohibited content, such as child sexual abuse material (CSAM) and personally identifiable information (PII).
193
-
***Configurable content filters** allow you to define blocking thresholds in four harm categories (hate speech, harassment, sexually explicit, and dangerous content,) based on probability and severity scores. These filters are default off but you can configure them according to your needs.
269
+
***Content safety filters**: [Content filters](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes) can help block the output of harmful content. They function independently from Gemini models as part of a layered defense against threat actors who attempt to jailbreak the model. Gemini models on Vertex AI use two types of content filters:
270
+
***Non-configurable safety filters** automatically block outputs containing prohibited content, such as child sexual abuse material (CSAM) and personally identifiable information (PII).
271
+
***Configurable content filters** allow you to define blocking thresholds in four harm categories (hate speech, harassment, sexually explicit, and dangerous content,) based on probability and severity scores. These filters are default off but you can configure them according to your needs.
194
272
***System instructions for safety**: [System instructions](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/safety-system-instructions) for Gemini models in Vertex AI provide direct guidance to the model on how to behave and what type of content to generate. By providing specific instructions, you can proactively steer the model away from generating undesirable content to meet your organization’s unique needs. You can craft system instructions to define content safety guidelines, such as prohibited and sensitive topics, and disclaimer language, as well as brand safety guidelines to ensure the model's outputs align with your brand's voice, tone, values, and target audience.
195
273
196
274
While these measures are robust against content safety, you need additional checks to reduce agent misalignment, unsafe actions, and brand safety risks.
@@ -211,22 +289,22 @@ When modifications to the tools to add guardrails aren't possible, the [**`Befor
211
289
args: Dict[str, Any],
212
290
tool_context: ToolContext
213
291
) -> Optional[Dict]: # Correct return type for before_tool_callback
214
-
292
+
215
293
print(f"Callback triggered for tool: {tool.name}, args: {args}")
216
-
294
+
217
295
# Example validation: Check if a required user ID from state matches an arg
However, when adding security guardrails to your agent applications, plugins are the recommended approach for implementing policies that are not specific to a single agent. Plugins are designed to be self-contained and modular, allowing you to create individual plugins for specific security policies, and apply them globally at the runner level. This means that a security plugin can be configured once and applied to every agent that uses the runner, ensuring consistent security guardrails across your entire application without repetitive code.
0 commit comments