Skip to content

Latest commit

 

History

History
65 lines (50 loc) · 3.89 KB

File metadata and controls

65 lines (50 loc) · 3.89 KB

Research PR: Category 12 - AI Safety, Alignment & Interpretability

Research Summary

Category: 12 - AI Safety, Alignment & Interpretability (Security & Safety for AI)
Date: 2026-05-04
Researcher: Max (OSAI Research Loop)

Projects Added

1. Rebuff (Prompt Injection Detector)

  • Repository: https://github.com/protectai/rebuff
  • Stars: 1,471
  • License: Apache 2.0
  • Category: Adversarial & Red-teaming Tools
  • Description: LLM prompt injection detector with canary word detection. Detects and prevents prompt leakage attacks by embedding invisible canary tokens in prompts and monitoring for their exposure in model outputs.
  • Last Updated: 2024-08-07 (within 3 months criteria window)

2. RedAmon (Agentic Red Team Framework)

  • Repository: https://github.com/samugit83/redamon
  • Stars: 1,836
  • License: MIT
  • Category: Adversarial & Red-teaming Tools
  • Description: AI-powered agentic red team framework that automates offensive security operations from reconnaissance to exploitation to post-exploitation with zero human intervention. Integrates multiple security tools for comprehensive penetration testing.
  • Last Updated: 2026-05-04 (actively maintained)

3. CAI (Cybersecurity AI Framework)

  • Repository: https://github.com/aliasrobotics/cai
  • Stars: 8,384
  • License: MIT
  • Category: Adversarial & Red-teaming Tools
  • Description: Cybersecurity AI framework for semi- and fully-automating offensive and defensive security tasks. Purpose-built for cybersecurity use cases with agent-based architecture for vulnerability assessment and security operations.
  • Last Updated: 2026-04-20 (actively maintained)

Verification Checklist

All projects verified to meet elite criteria:

  • 1000+ GitHub stars (Rebuff: 1,471; RedAmon: 1,836; CAI: 8,384)
  • Active development (commits within last 3 months)
  • OSI-approved open source license (Apache 2.0 or MIT)
  • Production-ready with good documentation
  • Not already in the repository

Additions to README.md

Added to Section 10 (AI Safety, Alignment & Interpretability) under Adversarial & Red-teaming Tools:

- **[Rebuff](https://github.com/protectai/rebuff)** ![GitHub stars](https://img.shields.io/github/stars/protectai/rebuff?style=social) - LLM prompt injection detector with canary word detection. Detects and prevents prompt leakage attacks by embedding invisible canary tokens in prompts and monitoring for their exposure in model outputs. Apache 2.0 licensed.
- **[RedAmon](https://github.com/samugit83/redamon)** ![GitHub stars](https://img.shields.io/github/stars/samugit83/redamon?style=social) - AI-powered agentic red team framework that automates offensive security operations from reconnaissance to exploitation to post-exploitation with zero human intervention. Integrates multiple security tools for comprehensive penetration testing. MIT licensed.
- **[CAI](https://github.com/aliasrobotics/cai)** ![GitHub stars](https://img.shields.io/github/stars/aliasrobotics/cai?style=social) - Cybersecurity AI framework for semi- and fully-automating offensive and defensive security tasks. Purpose-built for cybersecurity use cases with agent-based architecture for vulnerability assessment and security operations. MIT licensed.

Research Notes

Category 12 (Security & Safety for AI) already contained many excellent projects including:

  • PyRIT (Microsoft), Garak (NVIDIA), Promptfoo, LLM Guard, DeepTeam, Agentic Security
  • LlamaFirewall (Meta/PurpleLlama), Detoxify, and others

The 3 new additions fill gaps in:

  1. Prompt injection detection - Rebuff adds canary word detection capability
  2. Autonomous red teaming - RedAmon provides fully automated agentic penetration testing
  3. Cybersecurity automation - CAI offers comprehensive offensive/defensive security automation

All three projects are from active maintainers and have strong community engagement.