|
| 1 | +# OCI Generative AI Safety Benchmark |
| 2 | + |
| 3 | +Benchmark suite for testing LLM safety features and OCI Guardrails SDK efficacy. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This benchmark tests: |
| 8 | +1. **Model refusal behavior** - How well models refuse harmful prompts |
| 9 | +2. **OCI Guardrails SDK** - Detection of harmful content, PII, and prompt injection |
| 10 | + |
| 11 | +## Quick Start |
| 12 | + |
| 13 | +```bash |
| 14 | +# 1. Set up environment |
| 15 | +cp .env.example .env |
| 16 | +# Edit .env with your OCI model OCIDs and compartment ID |
| 17 | + |
| 18 | +# 2. Install dependencies |
| 19 | +python -m venv venv |
| 20 | +source venv/bin/activate |
| 21 | +pip install oci pandas python-dotenv openpyxl |
| 22 | + |
| 23 | +# 3. Run benchmarks |
| 24 | +./run_generic.sh # Llama, Grok, Gemini, GPT models |
| 25 | +./run_cohere.sh # Cohere models (with STRICT/CONTEXTUAL modes) |
| 26 | + |
| 27 | +# 4. Analyze results |
| 28 | +python analyze_results.py |
| 29 | +``` |
| 30 | + |
| 31 | +## Project Structure |
| 32 | + |
| 33 | +``` |
| 34 | +. |
| 35 | +├── cohere_benchmark.py # Cohere models (STRICT/CONTEXTUAL modes) |
| 36 | +├── generic_benchmark.py # All other models (Llama, Grok, Gemini, GPT) |
| 37 | +├── run_cohere.sh # Run all Cohere models |
| 38 | +├── run_generic.sh # Run all generic models |
| 39 | +├── analyze_results.py # Unified analysis: charts + summary for refusal & guardrails |
| 40 | +├── .env # Model IDs and configuration (not committed) |
| 41 | +├── .env.example # Template for .env |
| 42 | +├── prompts/ # Test prompt sets |
| 43 | +│ ├── harmful_prompts.py |
| 44 | +│ ├── pii_prompts.py |
| 45 | +│ ├── promptinjection_prompts.py |
| 46 | +│ ├── ambiguous_prompts.py |
| 47 | +│ └── edge_cases_prompts.py |
| 48 | +├── results/ # Benchmark results (CSV) |
| 49 | +├── results_v2/ # Benchmark results v2 (CSV) |
| 50 | +└── charts*/ # Generated visualizations |
| 51 | +``` |
| 52 | + |
| 53 | +## Configuration |
| 54 | + |
| 55 | +### .env File |
| 56 | + |
| 57 | +```bash |
| 58 | +# OCI Configuration |
| 59 | +COMPARTMENT_ID=ocid1.compartment.oc1..xxxxx |
| 60 | +OCI_PROFILE=DEFAULT |
| 61 | + |
| 62 | +# Endpoints |
| 63 | +ENDPOINT_EU=https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com |
| 64 | +ENDPOINT_US=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com |
| 65 | + |
| 66 | +# Model OCIDs |
| 67 | +MODEL_COHERE_COMMAND_R_PLUS=ocid1.generativeaimodel.oc1... |
| 68 | +MODEL_LLAMA_3_3=ocid1.generativeaimodel.oc1... |
| 69 | +MODEL_GROK_3=ocid1.generativeaimodel.oc1... |
| 70 | +# ... etc |
| 71 | +``` |
| 72 | + |
| 73 | +## Running Benchmarks |
| 74 | + |
| 75 | +### Full Benchmark |
| 76 | +```bash |
| 77 | +./run_generic.sh # All generic models |
| 78 | +./run_cohere.sh # All Cohere models |
| 79 | +``` |
| 80 | + |
| 81 | +### Test Mode (2 prompts per set) |
| 82 | +```bash |
| 83 | +./run_generic.sh --test |
| 84 | +./run_cohere.sh --test |
| 85 | +``` |
| 86 | + |
| 87 | +### Single Model |
| 88 | +```bash |
| 89 | +python generic_benchmark.py \ |
| 90 | + --model-name "my-model" \ |
| 91 | + --model-id "ocid1.generativeaimodel..." \ |
| 92 | + --compartment-id "$COMPARTMENT_ID" \ |
| 93 | + --endpoint "https://inference.generativeai.eu-frankfurt-1.oci.oraclecloud.com" |
| 94 | +``` |
| 95 | + |
| 96 | +### Options |
| 97 | + |
| 98 | +| Flag | Description | |
| 99 | +|------|-------------| |
| 100 | +| `--test` | Run only 2 prompts per set | |
| 101 | +| `--overwrite` | Overwrite existing result files | |
| 102 | +| `--skip-guardrails` | Skip OCI guardrails calls (model-only) | |
| 103 | +| `--skip-model` | Skip model inference (guardrails-only) | |
| 104 | +| `--output-dir DIR` | Output directory (default: results_v2) | |
| 105 | + |
| 106 | +## Adding New Prompts |
| 107 | + |
| 108 | +Create a new file in `prompts/` following this pattern: |
| 109 | + |
| 110 | +```python |
| 111 | +# prompts/my_prompts.py |
| 112 | + |
| 113 | +my_prompts = [ |
| 114 | + "First test prompt...", |
| 115 | + "Second test prompt...", |
| 116 | + # Add more prompts |
| 117 | +] |
| 118 | +``` |
| 119 | + |
| 120 | +The benchmark automatically discovers all `*_prompts.py` files and variables ending with `_prompts`. |
| 121 | + |
| 122 | +## Adding New Models |
| 123 | + |
| 124 | +1. Add the model OCID to `.env`: |
| 125 | + ```bash |
| 126 | + MODEL_MY_NEW_MODEL=ocid1.generativeaimodel.oc1... |
| 127 | + ``` |
| 128 | + |
| 129 | +2. Add to the appropriate run script (`run_generic.sh` or `run_cohere.sh`): |
| 130 | + ```bash |
| 131 | + if [ -n "${MODEL_MY_NEW_MODEL:-}" ]; then |
| 132 | + python generic_benchmark.py \ |
| 133 | + --model-name "my-new-model" \ |
| 134 | + --model-id "$MODEL_MY_NEW_MODEL" \ |
| 135 | + --compartment-id "$COMPARTMENT_ID" \ |
| 136 | + --endpoint "$ENDPOINT_EU" \ |
| 137 | + --output-dir "$OUTPUT_DIR" \ |
| 138 | + "$@" |
| 139 | + fi |
| 140 | + ``` |
| 141 | + |
| 142 | +## Output Format |
| 143 | + |
| 144 | +Results are saved as CSV with these columns: |
| 145 | + |
| 146 | +| Column | Description | |
| 147 | +|--------|-------------| |
| 148 | +| `Prompt` | The test prompt | |
| 149 | +| `Model` | Model name | |
| 150 | +| `Mode` | Safety mode (Cohere only: STRICT/CONTEXTUAL) | |
| 151 | +| `Refused` | Did model refuse? (yes/no/error) | |
| 152 | +| `LatencyMs` | Response time in milliseconds | |
| 153 | +| `ModelOutput` | Model's response | |
| 154 | +| `Pre_OCIFlagged` | Guardrails flagged the prompt? (yes/no) | |
| 155 | +| `Pre_FlaggedCategories` | Categories detected in prompt | |
| 156 | +| `Pre_DetectedPIITypes` | PII types found in prompt | |
| 157 | +| `Pre_PromptInjectionScore` | Prompt injection score (0-1) | |
| 158 | +| `Post_OCIFlagged` | Guardrails flagged the response? (yes/no) | |
| 159 | +| `Post_*` | Same fields for model response | |
| 160 | + |
| 161 | +## Analyzing Results |
| 162 | + |
| 163 | +A single script produces all charts and a printed summary: |
| 164 | + |
| 165 | +```bash |
| 166 | +python analyze_results.py # auto-detects results dir |
| 167 | +python analyze_results.py --results-dir results_v2 # explicit dir |
| 168 | +python analyze_results.py --output-dir my_charts # custom output dir |
| 169 | +``` |
| 170 | + |
| 171 | +Generates 6 charts: |
| 172 | +1. Model self-refusal rate by model |
| 173 | +2. Guardrails detection rate by model (Guardrails ON only) |
| 174 | +3. Guardrails detection rate by prompt type |
| 175 | +4. Model refusal vs Guardrails vs Combined comparison |
| 176 | +5. Pre (prompt) vs Post (response) guardrails detection |
| 177 | +6. Combined blocked-rate heatmap by model and prompt type |
| 178 | + |
| 179 | +## Key Findings |
| 180 | + |
| 181 | +The OCI Guardrails SDK detects: |
| 182 | +- **PII**: ~80-86% detection (names, addresses, emails) |
| 183 | +- **Prompt Injection**: ~70-75% detection |
| 184 | +- **Violence**: Explicit violence keywords only (~15%) |
| 185 | +- **Other harmful content**: Limited detection for drugs, CSAM, terrorism, etc. |
| 186 | + |
| 187 | +The guardrails add value on top of model refusals: |
| 188 | +- Model refusal alone: ~20-25% |
| 189 | +- Guardrails detection alone: ~45-55% |
| 190 | +- Combined (either): ~50-65% |
| 191 | + |
| 192 | +## Requirements |
| 193 | + |
| 194 | +- Python 3.11+ |
| 195 | +- OCI CLI configured (`~/.oci/config`) |
| 196 | +- OCI Generative AI access with model deployments |
| 197 | + |
| 198 | +## Dependencies |
| 199 | + |
| 200 | +``` |
| 201 | +oci |
| 202 | +pandas |
| 203 | +python-dotenv |
| 204 | +openpyxl |
| 205 | +matplotlib |
| 206 | +``` |
0 commit comments