All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Batch verification script (
run_batch_sequential.sh): New script for running pass@10 verification tests- Supports 10 iterations by default for statistical significance
- Auto-generates metrics reports and baseline comparisons
- Parameters:
--module,--url,--model,--api-key,--max-workers,--mm-model, etc.
- Added
expected_tool_call_total_countto summary statistics invalidator/tool_calls.py - Added batch testing documentation to README.md and README_CN.md
- New script dependencies:
scripts/batch_verify.py: Batch verification executorscripts/calculate_batch_metrics.py: Metrics aggregation and calculationscripts/compare_with_baseline.py: Baseline comparison toolscripts/calculate_toolcall_similarity.py: Tool call similarity calculator
- ToolCalls-Match-Rate denominator: Now uses
expected_tool_calllabel count fromsample.jsonlas the denominator for more accurate match rate calculation - All Chinese comments and outputs converted to English for better internationalization
calculate_batch_metrics.pynow readsexpected_tool_callstatistics directly fromsample.jsonl
- Fixed path consistency issues in shell script calling Python scripts
- ToolCalls-Match-Rate redefined: Changed from simple "proportion of triggered tool calls" to a match rate based on expected labels
- New formula:
tool_calls_accuracy = (tool_calls_finish_tool_calls + stop_finish_stop) / success_count - i.e., proportion of cases where actual result matches expected result
- New formula:
- Added
expected_tool_calllabel field to test setsample.jsonl, indicating whether each case is expected to trigger a tool call - Added confusion matrix statistics:
tool_calls_finish_tool_calls: expected tool_call, actual tool_call (True Positive)tool_calls_finish_stop: expected tool_call, actual stop (False Negative)stop_finish_tool_calls: expected stop, actual tool_call (False Positive)stop_finish_stop: expected stop, actual stop (True Negative)
- Added
expected_tool_callfield to results, recording the expected label from the original request
- Backward compatible with historical data without
expected_tool_calllabel (incremental mode)
- Stable release
- Initial release of MiniMax Provider Verifier
- Support for multiple validators (ToolCalls, Russian Characters, Repeat N-Gram)
- Concurrent request processing
- Batch provider testing
- Incremental mode for rerunning failed requests
- Dynamic validator selection based on check_type
- Detailed test reports and statistical summaries
- Custom validator support