This document explains the improved dataset formats available in the prompt evaluator, addressing the limitations of the original pipe-delimited format.
The original pipe-delimited format ("prompt"|expected) had several issues:
- Manual parsing: Required custom string splitting and quote stripping
- No escaping: Long prompts with pipes, quotes, or newlines could break the format
- Inconsistent quoting: Some datasets used triple quotes
""", others used single quotes - No metadata: Couldn't store additional information like IDs, timestamps, etc.
- Error-prone: Manual string manipulation was prone to errors
Format: One JSON object per line
{"prompt": "This is a normal prompt", "expected": "false", "id": "sample_1", "metadata": {"source": "example"}}
{"prompt": "This prompt has commas, quotes \"like this\", and pipes | everywhere", "expected": "true", "id": "sample_2"}Benefits:
- ✅ Handles complex data structures
- ✅ Preserves metadata and timestamps
- ✅ No escaping issues with special characters
- ✅ Easy to parse programmatically
- ✅ Supports streaming processing
Format: Standard CSV with quoted fields
prompt,expected,id,timestamp
"This is a normal prompt",false,sample_1,2024-01-01T00:00:00
"This prompt has ""quotes"" and, commas",true,sample_2,2024-01-01T00:00:00Benefits:
- ✅ Standard format, widely supported
- ✅ Proper escaping with quotes
- ✅ Works with Excel and other tools
- ✅ Human-readable
Format: Tab-separated values
prompt expected id timestamp
This is a normal prompt false sample_1 2024-01-01T00:00:00
This prompt has tabs and newlines true sample_2 2024-01-01T00:00:00Benefits:
- ✅ Less likely to conflict with data content
- ✅ Better than pipe-delimited
- ✅ Easy to parse
- ✅ Good for data with commas
Format: Binary columnar format
- Not human-readable
- Excellent for large datasets
- Built-in compression
- Schema evolution support
# Convert a single dataset
python improved_dataset_converter.py input.csv output.jsonl --output-format jsonl
# Convert with validation
python improved_dataset_converter.py input.csv output.jsonl --validate
# Auto-detect input format
python improved_dataset_converter.py input.csv output.jsonl# List all datasets and their formats
python migrate_datasets.py --list
# Migrate all datasets to JSONL format
python migrate_datasets.py --all --output-format jsonl
# Migrate without creating backups
python migrate_datasets.py --all --no-backupfrom enhanced_dataset_reader import EnhancedDatasetReader
# Auto-detect format and read
reader = EnhancedDatasetReader()
records = reader.read_dataset("my_dataset.jsonl")
# Specify format explicitly
records = reader.read_dataset("my_dataset.csv", format_hint="csv")
# Limit number of records
records = reader.read_dataset("my_dataset.jsonl", max_lines=100)python migrate_datasets.py --listThis will show you all datasets and their current formats.
# Migrate all datasets to JSONL (recommended)
python migrate_datasets.py --all --output-format jsonl
# Or migrate a specific dataset
python migrate_datasets.py --input datasets/my_dataset.csv --output-format jsonlThe prompt evaluator scripts have been updated to automatically detect and read the new formats. No changes needed to your existing code!
| Format | Human Readable | Special Chars | Metadata | Size | Performance |
|---|---|---|---|---|---|
| JSONL | ✅ | ✅ | ✅ | Medium | Fast |
| CSV | ✅ | ✅ | Small | Fast | |
| TSV | ✅ | ✅ | Small | Fast | |
| Pipe | ✅ | ❌ | ❌ | Small | Slow |
| Parquet | ❌ | ✅ | ✅ | Very Small | Very Fast |
- Use JSONL format for maximum flexibility and robustness
- Include metadata like source, timestamp, and category
- Use meaningful IDs for each record
- Consider Parquet format for better performance
- Use streaming processing for very large files
- Implement proper error handling
- CSV format is most widely supported
- Include a README with format documentation
- Provide sample data for testing
The enhanced dataset reader maintains full backward compatibility:
- ✅ Reads existing pipe-delimited datasets
- ✅ Works with all existing prompt evaluator scripts
- ✅ No changes needed to existing code
- ✅ Gradual migration supported
# 1. Analyze your current datasets
python migrate_datasets.py --list
# 2. Migrate to JSONL format
python migrate_datasets.py --all --output-format jsonl
# 3. Validate the migration
python improved_dataset_converter.py datasets/my_dataset.jsonl --validate
# 4. Use with prompt evaluator (no changes needed!)
python prompt_evaluator.py --input datasets/my_dataset.jsonl --output results.csv-
"Unsupported format" error
- Check file extension or use
--formatparameter - Ensure file is not corrupted
- Check file extension or use
-
"No records loaded" error
- Check file format and content
- Verify file permissions
-
Memory issues with large datasets
- Use
max_linesparameter to limit records - Consider Parquet format for very large datasets
- Use
# Show help for converter
python improved_dataset_converter.py --help
# Show help for migrator
python migrate_datasets.py --help
# Test with sample data
python example_improved_formats.py- For small datasets (< 10MB): Use JSONL or CSV
- For medium datasets (10MB - 100MB): Use JSONL or TSV
- For large datasets (> 100MB): Use Parquet
- For streaming processing: Use JSONL
- For human editing: Use CSV
The improved formats solve the original problems while maintaining full backward compatibility. Choose the format that best fits your needs!