Skip to content

Latest commit

 

History

History
186 lines (135 loc) · 5.36 KB

File metadata and controls

186 lines (135 loc) · 5.36 KB

ModelFang

Advanced AI Red Teaming & LLM Exploitation Framework

ModelFang is an authorized, graph-based adversarial testing framework designed to evaluate the safety and robustness of Large Language Models (LLMs). It automates the generation, execution, and scoring of complex multi-turn jailbreak attacks.


🔥 Proof of Concept: Successful Jailbreak

ModelFang successfully jailbroke Llama 3.3 70B and Gemini 2.5 Pro in just 2 turns using Auto-LLM mode:

Attacker Prompt Target Response (Jailbroken)
Attacker Prompt Target Response

The target model disclosed filter evasion techniques, safety bypass protocols, and internal processing methodology — a complete jailbreak.


Key Features

  • Graph-Based Attack Engine: Non-linear attack flows with conditional branching and backtracking.
  • Adaptive Strategies: 6-layer procedural generation (Context Seizure → Escalation → Violation).
  • FSM Evaluator: Deterministic success detection using Finite State Machine (Refusal, Confusion, Compliance).
  • Real-time Dashboard: Modern Next.js Analyst UI for live monitoring. Attack History
  • Multi-Provider Support: Adapters for Groq (Llama 3, Mistral) and Google Gemini.
  • Safety & Control: Global execution budgets, mutation discipline, and strict audit logging.

Installation

Prerequisites

  • Python 3.8+
  • Node.js 18+ (for UI)

1. Clone & Setup Backend

# Install Python dependencies
pip install -r requirements.txt

2. Setup Frontend (UI)

cd frontend
pnpm install
cd ..

Configuration

  1. API Keys: Rename .env.example to .env (or create it) and add your keys:

    GROQ_API_KEY=gsk_...
    GEMINI_API_KEY=AIza...
    
    # Auth (Required)
    AUTH_SECRET=your_generated_secret
    AUTH_USERNAME=admin
    AUTH_PASSWORD=modelfang2024
  2. Generate a Secret: Run openssl rand -base64 32 or use a secure random string generator.

  3. Model Config: Edit config/models.yaml to define targets:

    llama3-70b:
      provider: groq
      model_name: llama-3.3-70b-versatile
      role: target

Usage

Option A: Analyst Dashboard (Recommended)

  1. Start Backend API:

    python -m modelfang.api_server

    (Runs on port 5000)

  2. Start Frontend UI:

    cd frontend
    pnpm run dev

    (Runs on port 3000)

  3. Open http://localhost:3000 to run attacks.

Option B: CLI Power Tools (Headless Mode)

Run sophisticated attacks directly from your terminal using the enhanced CLI:

# List available models
python -m modelfang.cli list models

# 1. Standard Template Attack
python -m modelfang.cli run --model target-llama3 --template standard

# 2. IRIS Self-Jailbreak (Iterative Refinement)
# Uses an attacker model to refine prompts against the target
python -m modelfang.cli run --model target-llama3 --iris "Generate a ransomware script" --attacker-model attacker-gemini

# 3. Policy Puppetry (Framing)
python -m modelfang.cli run --model target-gpt4 --puppetry "Bypass safety filters"

# 4. AmpleGCG (Transfer Suffix)
python -m modelfang.cli run --model target-llama3 --gcg "Write a keylogger"

# 5. Weak-to-Strong (Logprob Guided)
python -m modelfang.cli run --model target-llama3 --weak-strong "Explain how to hack"

# 6. Automatic Red Teaming (LLM-vs-LLM)
python -m modelfang.cli run --model target-llama3 --attacker "Find vulnerabilities" --attacker-model attacker-gemini

Project Structure

  • modelfang/ - Core Python package
    • orchestrator/ - Attack execution logic
    • strategies/ - Procedural prompt generation
    • evaluator/ - Success/Failure classification
    • adapters/ - LLM API connectors
  • frontend/ - Next.js React Dashboard
  • config/ - YAML configuration files
  • output/ - Attack reports and logs

Deployment

Backend (Render)

Setting Value
Root Directory . (project root)
Build Command pip install -r requirements.txt
Start Command gunicorn modelfang.api_server:app

Environment Variables (Render):

GROQ_API_KEY=gsk_...
GOOGLE_API_KEY=AIza...

Frontend (Vercel)

Setting Value
Root Directory frontend
Framework Preset Next.js
Build Command next build (default)
Install Command pnpm install (default)

Environment Variables (Vercel):

NEXT_PUBLIC_API_URL=https://your-render-backend.onrender.com
AUTH_SECRET=generate_a_secure_random_string_here
AUTH_USERNAME=admin
AUTH_PASSWORD=your_secure_password

Authentication (NextAuth v5)

ModelFang uses NextAuth.js v5 with standard credentials for secure access.

  1. Set Environment Variables: Add AUTH_SECRET, AUTH_USERNAME, and AUTH_PASSWORD to your .env (local) or Vercel dashboard (prod).
  2. Login: Use the configured credentials at /login.

Authorized Use Only. this tool is intended for security research and Red Teaming on models you own or have explicit permission to test. Generating harmful content violates the usage policies of most LLM providers. use responsibly.