A comprehensive tool for evaluating AI-powered scanners in detecting various types of content using CalypsoAI API.
Major Release: v2.0.0 - October 2025
This major release introduces significant improvements to dataset handling, format support, and tooling. Key highlights include multiple dataset formats (JSONL, CSV, TSV, Parquet), enhanced migration tools, improved Hugging Face integration, and comprehensive documentation updates.
📋 View Complete Release Notes - Detailed changelog, migration guide, and new features
- Professional Reports: Beautiful, modern PDF reports with comprehensive analysis
- Visual Charts: Confusion matrices, bar charts, and performance visualizations
- Complete Analysis: Detailed breakdowns, examples, and recommendations
- Performance Metrics: Enhanced reports include throughput, latency percentiles (p50/p95/p99), and distribution charts
- Easy Sharing: Presentation-ready format for stakeholders and documentation
- Throughput Metrics: Processing speed in prompts/second for capacity planning
- Latency Percentiles: p50 (median), p95, and p99 for understanding response time distribution
- Consistent Reporting: Both sequential and concurrent evaluators report the same metrics
- Visual Performance Charts: Latency distribution charts in PDF reports
- Percentile Interpretation: Built-in explanations help understand what metrics mean
- Parallel Processing: Evaluate datasets using multiple concurrent workers
- Load Testing: Simulate real-world multi-user scenarios
- Enhanced Metrics: Queue time analysis, worker distribution, and throughput under load
- Configurable: Adjust concurrency levels and rate limiting
- Safety Guidelines: Built-in infrastructure impact warnings and best practices
- Multiple Format Support: JSONL, CSV, TSV, and Parquet formats
- Proper Escaping: No more issues with special characters (pipes, quotes, newlines)
- Metadata Support: Store IDs, timestamps, categories, and more
- Safe Migration: Convert existing datasets without data loss
- Backward Compatibility: All existing scripts work unchanged
prompt_evaluator_concurrent.py: Concurrent evaluator for load testing and high-speed processingreport_generator.py: Generate professional PDF reports from evaluation resultstools/improved_dataset_converter.py: Convert between all formats with validationtools/enhanced_dataset_reader.py: Unified reader that auto-detects formatstools/migrate_datasets.py: Batch migration tool for existing datasetstools/demo_improved_formats.py: Complete demonstration of new features
# Sequential evaluation with performance metrics and auto-generated report
python prompt_evaluator.py --input datasets/test_dataset.jsonl
# Concurrent evaluation for faster processing (10 workers)
python prompt_evaluator_concurrent.py -i datasets/test_dataset.jsonl -c 10
# Migrate datasets to JSONL format
python tools/migrate_datasets.py --list # See current formats
python tools/migrate_datasets.py --all --output-format jsonl # Migrate all
# Generate PDF report manually
python report_generator.py --dataset test_dataset-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile with your CalypsoAI API credentials:CALYPSOAI_URL=https://calypsoai.app CALYPSOAI_TOKEN=your_api_token_here -
Run evaluation:
python prompt_evaluator.py --input datasets/prompt_inject_dataset.csv
-
PDF report is automatically generated! Check
results/directory -
Generate reports manually (optional):
# Use just the dataset name (not the full path) python report_generator.py --dataset pii_dataset
- Flexible Evaluation: Works with multiple dataset formats (JSONL, CSV, TSV, Parquet)
- Batch Processing: Efficiently process large datasets with progress tracking
- CalypsoAI Integration: Direct integration with CalypsoAI API
- Comprehensive Metrics: Accuracy, precision, recall, F1 score
- False Positives/Negatives Export: Automatic export in multiple formats for easy analysis
- Performance Metrics: Throughput, latency statistics, percentiles (p50/p95/p99)
- Sequential Evaluation: Baseline performance testing, one request at a time
- Concurrent Evaluation: Parallel processing with configurable workers (5-50+)
- Load Testing: Simulate real-world multi-user scenarios with rate limiting
- Queue Analysis: Track worker utilization and queue times under load
- Auto-Generated PDF Reports: Professional reports created after each evaluation
- Confusion Matrices: Visual and tabular confusion matrix analysis
- Performance Charts: Latency distribution, metrics comparison, throughput graphs
- Example Errors: Sample false positives and negatives with context
- Percentile Explanations: Built-in interpretation guide for performance metrics
- Multiple Format Support: JSONL (recommended), CSV, TSV, Parquet, legacy pipe-delimited
- Auto-Format Detection: Automatically detects and reads any supported format
- Enhanced Data Handling: Proper escaping for special characters (pipes, quotes, newlines)
- Metadata Support: Store IDs, timestamps, categories, and custom fields
- Safe Migration Tools: Convert between formats with automatic backups and validation
- Hugging Face Integration: Download datasets directly from Hugging Face
- Backward Compatibility: All existing scripts work with new formats unchanged
- Multiple Output Formats: JSON (default), CSV, TSV, and Parquet support
- Comprehensive Documentation: Detailed guides, examples, and best practices
- Testing Framework: Simplified test runner with core functionality tests
- Makefile Commands: Common operations accessible via
makecommands
- 📋 Release Notes v2.0 - Complete changelog and migration guide
- Installation Guide - Detailed setup instructions
- Usage Guide - Complete usage examples and options
- Evaluation Metrics - Understanding accuracy and performance metrics
- ⚡ Sequential vs Concurrent - When to use each evaluator and comparing results
⚠️ Concurrent Evaluation - Load testing guide with safety guidelines- 🎨 PDF Report Features - Complete guide to PDF reports and metrics
- Dataset Management - Working with datasets and Hugging Face
- 🆕 Improved Dataset Formats - New dataset formats and migration guide
- 🆕 Dataset Tools - New dataset tools and utilities
- Troubleshooting - Common issues and solutions
- Contributing - How to contribute to the project
- 📝 Documentation Organization - How documentation is structured
| Tool | Description | Speed | Best For |
|---|---|---|---|
prompt_evaluator.py |
Sequential evaluation with full metrics | Baseline | Accuracy testing, small datasets, establishing baseline |
prompt_evaluator_prompts.py |
Advanced evaluation with LLM responses | Baseline | Detailed analysis requiring full response text |
prompt_evaluator_concurrent.py |
Parallel evaluation with workers | 5-20x faster | Large datasets, load testing, production scenarios |
evaluate_existing_results.py |
Re-calculate metrics without API calls | Instant | Re-analyzing existing results, metric verification |
| Tool | Purpose | Output |
|---|---|---|
report_generator.py |
Generate professional PDF reports | PDF with charts, confusion matrices, performance metrics |
download_hf_datasets.py |
Download datasets from Hugging Face | Dataset files in local datasets/ directory |
| Tool | Purpose | Key Features |
|---|---|---|
enhanced_dataset_reader.py |
Read any dataset format | Auto-detection, unified interface, metadata support |
improved_dataset_converter.py |
Convert between formats | Validation, proper escaping, metadata preservation |
migrate_datasets.py |
Batch migrate datasets | Automatic backups, format detection, safety checks |
demo_improved_formats.py |
Interactive demonstration | Shows all formats, migration examples, comparisons |
⚠️ Important: The concurrent evaluator can trigger infrastructure auto-scaling. See Concurrent Evaluation Documentation for safety guidelines before use.
# Run evaluation on a dataset (auto-generates PDF report)
python prompt_evaluator.py --input datasets/pii_dataset.jsonl
# Limit number of prompts processed
python prompt_evaluator.py -i datasets/large_dataset.jsonl -l 100
# Specify output format for false positive/negative files
python prompt_evaluator.py -i datasets/test.jsonl --format csv# Run with default concurrency (10 workers)
python prompt_evaluator_concurrent.py -i datasets/pii_dataset.jsonl
# Adjust concurrency level
python prompt_evaluator_concurrent.py -i datasets/large_dataset.jsonl -c 20
# Add rate limiting (50 requests/second)
python prompt_evaluator_concurrent.py -i datasets/test.jsonl --rate-limit 50
# Quick test with small sample
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -l 50 -c 5# 1. Establish baseline with sequential
python prompt_evaluator.py -i datasets/test.jsonl
# 2. Compare with concurrent
python prompt_evaluator_concurrent.py -i datasets/test.jsonl -c 10
# 3. Review both reports to compare throughput and latency
# Check results/test_results.jsonl vs results/test_concurrent_results.jsonl# Check current dataset formats
python tools/migrate_datasets.py --list
# Migrate all datasets to JSONL
python tools/migrate_datasets.py --all --output-format jsonl
# Migrate specific dataset
python tools/migrate_datasets.py --input datasets/old_format.csv --output-format jsonl
# Convert without backups (use with caution)
python tools/migrate_datasets.py --input datasets/test.csv --output-format jsonl --no-backup# Reports are auto-generated after evaluation, but you can also generate manually:
# Generate from dataset name (looks in results/ directory)
python report_generator.py --dataset pii_dataset
# Auto-detect if only one results file exists
python report_generator.py# Re-calculate metrics from existing results (no API calls)
python evaluate_existing_results.py --input results/pii_dataset_results.jsonl
# Useful after fixing dataset labels or for metric verification# Advanced evaluation with full LLM responses
python prompt_evaluator_prompts.py --input datasets/test.jsonl
# Download datasets from Hugging Face
python download_hf_datasets.py
# Convert dataset formats with validation
python tools/improved_dataset_converter.py input.csv output.jsonl --validate
# Run full demo of dataset features
python tools/demo_improved_formats.py# Run concurrent evaluation
make run-concurrent DATASET=datasets/pii_dataset.jsonl
# With custom settings
make run-concurrent DATASET=datasets/test.jsonl CONCURRENCY=20 LINES=100
# Run tests
make test-core # Core functionality tests
make test-concurrent # Test concurrent evaluator
# Code quality
make lint # Run linting
make format-fix # Format code with black
make pre-commit # Run all pre-commit checksThe project includes several sample datasets in the datasets/sample-datasets/ folder:
codesagar_malicious_llm_prompts_v4_test.jsonl- Prompt injection examplespii_dataset.jsonl- Personally identifiable information examplesfin_advice_dataset.jsonl- Financial advice promptseu-ai-act-prompts.jsonl- EU AI Act compliance promptsxTRam1_safe_guard_prompt_injection_test.jsonl- Additional prompt injection test cases
Throughput: Measures how many prompts you can process per second. Essential for capacity planning.
- Sequential: 1-3 prompts/sec (typical)
- Concurrent (10 workers): 5-20x faster
Latency Percentiles: Show the distribution of response times, not just averages.
- p50 (median): Typical user experience
- p95: 95% of requests are faster than this
- p99: Worst-case scenario for most users
See Evaluation Metrics Guide for detailed explanations.
Use Sequential When:
- Establishing baseline API performance
- Testing with small datasets (< 100 prompts)
- You don't want to risk infrastructure impact
- Simple, straightforward execution is needed
Use Concurrent When:
- Processing large datasets efficiently (> 100 prompts)
- Load testing API capacity
- Simulating multi-user production scenarios
- Time is critical and you need results quickly
Important: Always check with infrastructure teams before running concurrent evaluation against shared environments.
See Sequential vs Concurrent Comparison for detailed guidance.
Use JSONL (Recommended):
- No escaping issues with special characters
- Supports metadata (IDs, timestamps, categories)
- Easy to parse and stream
- Human-readable
Use CSV/TSV When:
- Compatibility with spreadsheet tools needed
- Smaller file sizes preferred
- No complex metadata required
Use Parquet When:
- Very large datasets (> 100k records)
- Maximum compression needed
- Working with data science tools
See Improved Dataset Formats for migration guide.
- Start with sequential to establish baseline performance
- Review PDF report for accuracy metrics and initial performance
- Test concurrent with low concurrency first (-c 5)
- Gradually increase workers to find optimal concurrency
- Compare metrics between sequential and concurrent runs
- Archive reports for historical tracking and trend analysis
- Migrate to JSONL for new datasets
- Use automatic backups when migrating (default behavior)
- Validate after conversion using
--validateflag - Include metadata (IDs, timestamps) for better tracking
- Keep original formats as backups until migration is verified
- Coordinate with infrastructure before load testing shared environments
- Start conservatively with low concurrency and small datasets
- Use rate limiting to avoid overwhelming APIs
- Monitor both accuracy and performance - don't sacrifice one for the other
- Document findings in PDF reports for future reference
- Focus on F1 score for balanced evaluation
- Check percentiles (p95/p99) not just averages
- Review false positives/negatives files for patterns
- Compare over time to track performance trends
- Share with stakeholders - reports are presentation-ready
For common issues and solutions, see the Troubleshooting Guide.
Quick fixes:
- "CALYPSOAI_TOKEN not found": Create
.envfile with API credentials - "Format not detected": Use
--format-hintparameter or check file extension - High error rates in concurrent: Reduce concurrency or add rate limiting
- Report generation fails: Ensure dataset name (not path) is provided
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please see our Contributing Guide for details.