Skip to content

Commit 606bcb4

Browse files
committed
Add OCI Guardrails SDK benchmark
1 parent 6e6ef62 commit 606bcb4

50 files changed

Lines changed: 12235 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
Copyright (c) 2026 Oracle and/or its affiliates.
2+
3+
The Universal Permissive License (UPL), Version 1.0
4+
5+
Subject to the condition set forth below, permission is hereby granted to any
6+
person obtaining a copy of this software, associated documentation and/or data
7+
(collectively the "Software"), free of charge and under any and all copyright
8+
rights in the Software, and any and all patent rights owned or freely
9+
licensable by each licensor hereunder covering either (i) the unmodified
10+
Software as contributed to or provided by such licensor, or (ii) the Larger
11+
Works (as defined below), to deal in both
12+
13+
(a) the Software, and
14+
15+
(b) any piece of software and/or hardware listed in the lrg_coverage.txt file,
16+
if one is included with the Software (each a "Covered Work", collectively
17+
"Covered Works"), that is available under a license listed in the
18+
lrg_coverage.txt file, if one is included with the Software (the "Covered
19+
License"), and
20+
21+
without limitation, the right to copy, create derivative works of, display,
22+
perform, and distribute the Software and make, use, sell, offer for sale,
23+
import, export, have made, and have sold the Software and the Covered Works,
24+
in each case subject to the conditions and limitations set forth herein.
25+
26+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
27+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
28+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
29+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
30+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
31+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
32+
SOFTWARE.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# OCI Generative AI Safety Benchmark — Guardrails SDK & Model Refusal Testing
2+
3+
*A benchmark suite for evaluating LLM safety features on OCI, testing both model self-refusal behaviour and the OCI Guardrails SDK's ability to detect harmful content, PII, and prompt injection across multiple models (Cohere Command, Llama, Grok, Gemini, GPT).*
4+
5+
Author: Brona Nilsson
6+
7+
Reviewed: 12.02.2026
8+
9+
# When to use this asset?
10+
11+
*Use this asset when you need to evaluate OCI Generative AI model safety behaviour and the effectiveness of the OCI Guardrails SDK.*
12+
13+
### Who
14+
- AI/ML engineers evaluating OCI Generative AI model safety
15+
- Security teams assessing guardrails effectiveness for content moderation
16+
- Solution architects comparing safety features across OCI-hosted models
17+
18+
### When
19+
- Benchmarking model refusal rates for harmful, PII, or prompt-injection prompts
20+
- Evaluating OCI Guardrails SDK detection accuracy (pre- and post-inference)
21+
- Comparing safety behaviour across models (Cohere, Llama, Grok, Gemini, GPT)
22+
- Demonstrating the added value of guardrails on top of model self-refusal
23+
24+
# How to use this asset?
25+
26+
*The asset runs benchmark prompts against OCI-hosted models and analyses the results with charts and summaries.*
27+
28+
1. Configure OCI credentials and model OCIDs in `.env` (see `.env.example`)
29+
2. Run benchmarks using the provided shell scripts
30+
3. Analyse results with the unified analysis script
31+
32+
See the detailed [README](files/README.md) in the `files/` folder for full setup and usage instructions.
33+
34+
### Key Capabilities
35+
- Tests 5 prompt categories: harmful, PII, prompt injection, ambiguous, and edge cases
36+
- Supports Cohere models with STRICT/CONTEXTUAL safety modes
37+
- Measures both model refusal and OCI Guardrails SDK detection (pre- and post-inference)
38+
- Generates 6 analysis charts and a summary report
39+
40+
### File Structure
41+
```
42+
.
43+
├── README.md # This file
44+
├── LICENSE
45+
└── files/
46+
├── README.md # Detailed setup and usage guide
47+
├── cohere_benchmark.py
48+
├── generic_benchmark.py
49+
├── analyze_results.py
50+
├── run_cohere.sh
51+
├── run_generic.sh
52+
├── .env.example
53+
├── prompts/ # Test prompt sets
54+
├── results/ # Benchmark results (CSV)
55+
└── charts/ # Generated visualisations
56+
```
57+
58+
# Useful Links
59+
60+
- [OCI Generative AI Documentation](https://docs.oracle.com/en-us/iaas/Content/generative-ai/home.htm)
61+
- [OCI Generative AI Guardrails](https://docs.oracle.com/en-us/iaas/Content/generative-ai/guardrails.htm)
62+
63+
# License
64+
65+
Copyright (c) 2026 Oracle and/or its affiliates.
66+
Licensed under the Universal Permissive License (UPL), Version 1.0.
67+
68+
See [LICENSE](LICENSE) for more details.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# OCI Configuration
2+
COMPARTMENT_ID=ocid1.compartment.oc1..aaaaaaaadr37v5isc5gx76kesg7kipgcem6xo44qscpxxl2yrznxkg53u6ya
3+
OCI_PROFILE=DEFAULT
4+
5+
# Endpoints
6+
ENDPOINT_EU=https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com
7+
ENDPOINT_US=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com
8+
9+
# =============================================================================
10+
# COHERE MODELS (use cohere_benchmark.py - has STRICT/CONTEXTUAL modes)
11+
# =============================================================================
12+
# Command-R+ 08-2024
13+
MODEL_COHERE_COMMAND_R_PLUS=ocid1.generativeaimodel.oc1.eu-frankfurt-1.amaaaaaask7dceyabdu6rjjmg75pixtecqvjen4x4st4mhs2a4zzfx5cgkmq
14+
# Command-A
15+
MODEL_COHERE_COMMAND_A=ocid1.generativeaimodel.oc1.eu-frankfurt-1.amaaaaaask7dceyaaypm2hg4db3evqkmjfdli5mggcxrhp2i4qmhvggyb4ja
16+
# Command-Vision (placeholder - replace with actual OCID when available)
17+
MODEL_COHERE_COMMAND_VISION=
18+
19+
# =============================================================================
20+
# GENERIC MODELS (use generic_benchmark.py)
21+
# =============================================================================
22+
# Llama 3.3 (EU endpoint)
23+
MODEL_LLAMA_3_3=ocid1.generativeaimodel.oc1.eu-frankfurt-1.amaaaaaask7dceyaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
24+
# Grok-3 (US endpoint)
25+
MODEL_GROK_3=ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask7dceya6dvgvvj3ovy4lerdl6fvx525x3yweacnrgn4ryfwwcoq
26+
27+
# Placeholder model IDs - replace with actual OCIDs when available
28+
MODEL_GROK_4=
29+
MODEL_GEMINI=
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# OCI Generative AI Safety Benchmark
2+
3+
Benchmark suite for testing LLM safety features and OCI Guardrails SDK efficacy.
4+
5+
## Overview
6+
7+
This benchmark tests:
8+
1. **Model refusal behavior** - How well models refuse harmful prompts
9+
2. **OCI Guardrails SDK** - Detection of harmful content, PII, and prompt injection
10+
11+
## Quick Start
12+
13+
```bash
14+
# 1. Set up environment
15+
cp .env.example .env
16+
# Edit .env with your OCI model OCIDs and compartment ID
17+
18+
# 2. Install dependencies
19+
python -m venv venv
20+
source venv/bin/activate
21+
pip install oci pandas python-dotenv openpyxl
22+
23+
# 3. Run benchmarks
24+
./run_generic.sh # Llama, Grok, Gemini, GPT models
25+
./run_cohere.sh # Cohere models (with STRICT/CONTEXTUAL modes)
26+
27+
# 4. Analyze results
28+
python analyze_results.py
29+
```
30+
31+
## Project Structure
32+
33+
```
34+
.
35+
├── cohere_benchmark.py # Cohere models (STRICT/CONTEXTUAL modes)
36+
├── generic_benchmark.py # All other models (Llama, Grok, Gemini, GPT)
37+
├── run_cohere.sh # Run all Cohere models
38+
├── run_generic.sh # Run all generic models
39+
├── analyze_results.py # Unified analysis: charts + summary for refusal & guardrails
40+
├── .env # Model IDs and configuration (not committed)
41+
├── .env.example # Template for .env
42+
├── prompts/ # Test prompt sets
43+
│ ├── harmful_prompts.py
44+
│ ├── pii_prompts.py
45+
│ ├── promptinjection_prompts.py
46+
│ ├── ambiguous_prompts.py
47+
│ └── edge_cases_prompts.py
48+
├── results/ # Benchmark results (CSV)
49+
├── results_v2/ # Benchmark results v2 (CSV)
50+
└── charts*/ # Generated visualizations
51+
```
52+
53+
## Configuration
54+
55+
### .env File
56+
57+
```bash
58+
# OCI Configuration
59+
COMPARTMENT_ID=ocid1.compartment.oc1..xxxxx
60+
OCI_PROFILE=DEFAULT
61+
62+
# Endpoints
63+
ENDPOINT_EU=https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com
64+
ENDPOINT_US=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com
65+
66+
# Model OCIDs
67+
MODEL_COHERE_COMMAND_R_PLUS=ocid1.generativeaimodel.oc1...
68+
MODEL_LLAMA_3_3=ocid1.generativeaimodel.oc1...
69+
MODEL_GROK_3=ocid1.generativeaimodel.oc1...
70+
# ... etc
71+
```
72+
73+
## Running Benchmarks
74+
75+
### Full Benchmark
76+
```bash
77+
./run_generic.sh # All generic models
78+
./run_cohere.sh # All Cohere models
79+
```
80+
81+
### Test Mode (2 prompts per set)
82+
```bash
83+
./run_generic.sh --test
84+
./run_cohere.sh --test
85+
```
86+
87+
### Single Model
88+
```bash
89+
python generic_benchmark.py \
90+
--model-name "my-model" \
91+
--model-id "ocid1.generativeaimodel..." \
92+
--compartment-id "$COMPARTMENT_ID" \
93+
--endpoint "https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com"
94+
```
95+
96+
### Options
97+
98+
| Flag | Description |
99+
|------|-------------|
100+
| `--test` | Run only 2 prompts per set |
101+
| `--overwrite` | Overwrite existing result files |
102+
| `--skip-guardrails` | Skip OCI guardrails calls (model-only) |
103+
| `--skip-model` | Skip model inference (guardrails-only) |
104+
| `--output-dir DIR` | Output directory (default: results_v2) |
105+
106+
## Adding New Prompts
107+
108+
Create a new file in `prompts/` following this pattern:
109+
110+
```python
111+
# prompts/my_prompts.py
112+
113+
my_prompts = [
114+
"First test prompt...",
115+
"Second test prompt...",
116+
# Add more prompts
117+
]
118+
```
119+
120+
The benchmark automatically discovers all `*_prompts.py` files and variables ending with `_prompts`.
121+
122+
## Adding New Models
123+
124+
1. Add the model OCID to `.env`:
125+
```bash
126+
MODEL_MY_NEW_MODEL=ocid1.generativeaimodel.oc1...
127+
```
128+
129+
2. Add to the appropriate run script (`run_generic.sh` or `run_cohere.sh`):
130+
```bash
131+
if [ -n "${MODEL_MY_NEW_MODEL:-}" ]; then
132+
python generic_benchmark.py \
133+
--model-name "my-new-model" \
134+
--model-id "$MODEL_MY_NEW_MODEL" \
135+
--compartment-id "$COMPARTMENT_ID" \
136+
--endpoint "$ENDPOINT_EU" \
137+
--output-dir "$OUTPUT_DIR" \
138+
"$@"
139+
fi
140+
```
141+
142+
## Output Format
143+
144+
Results are saved as CSV with these columns:
145+
146+
| Column | Description |
147+
|--------|-------------|
148+
| `Prompt` | The test prompt |
149+
| `Model` | Model name |
150+
| `Mode` | Safety mode (Cohere only: STRICT/CONTEXTUAL) |
151+
| `Refused` | Did model refuse? (yes/no/error) |
152+
| `LatencyMs` | Response time in milliseconds |
153+
| `ModelOutput` | Model's response |
154+
| `Pre_OCIFlagged` | Guardrails flagged the prompt? (yes/no) |
155+
| `Pre_FlaggedCategories` | Categories detected in prompt |
156+
| `Pre_DetectedPIITypes` | PII types found in prompt |
157+
| `Pre_PromptInjectionScore` | Prompt injection score (0-1) |
158+
| `Post_OCIFlagged` | Guardrails flagged the response? (yes/no) |
159+
| `Post_*` | Same fields for model response |
160+
161+
## Analyzing Results
162+
163+
A single script produces all charts and a printed summary:
164+
165+
```bash
166+
python analyze_results.py # auto-detects results dir
167+
python analyze_results.py --results-dir results_v2 # explicit dir
168+
python analyze_results.py --output-dir my_charts # custom output dir
169+
```
170+
171+
Generates 6 charts:
172+
1. Model self-refusal rate by model
173+
2. Guardrails detection rate by model (Guardrails ON only)
174+
3. Guardrails detection rate by prompt type
175+
4. Model refusal vs Guardrails vs Combined comparison
176+
5. Pre (prompt) vs Post (response) guardrails detection
177+
6. Combined blocked-rate heatmap by model and prompt type
178+
179+
## Key Findings
180+
181+
The OCI Guardrails SDK detects:
182+
- **PII**: ~80-86% detection (names, addresses, emails)
183+
- **Prompt Injection**: ~70-75% detection
184+
- **Violence**: Explicit violence keywords only (~15%)
185+
- **Other harmful content**: Limited detection for drugs, CSAM, terrorism, etc.
186+
187+
The guardrails add value on top of model refusals:
188+
- Model refusal alone: ~20-25%
189+
- Guardrails detection alone: ~45-55%
190+
- Combined (either): ~50-65%
191+
192+
## Requirements
193+
194+
- Python 3.11+
195+
- OCI CLI configured (`~/.oci/config`)
196+
- OCI Generative AI access with model deployments
197+
198+
## Dependencies
199+
200+
```
201+
oci
202+
pandas
203+
python-dotenv
204+
openpyxl
205+
matplotlib
206+
```

0 commit comments

Comments
 (0)