FEAT implement batching for memory interface#1325
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to address scaling limits in the PyRIT MemoryInterface by introducing “batched” filtering logic for large IN (...) queries, along with unit tests intended to validate behavior when many IDs/values are passed.
Changes:
- Added
_SQLITE_MAX_BIND_VARSand_batched_in_conditionhelper inMemoryInterfaceand applied it to several query filters. - Updated multiple retrieval methods (e.g., scores, message pieces, attack results, scenario results) to use the new helper for
INfiltering. - Added a new unit test module to exercise “batching” behavior and large-input queries against the SQLite-backed memory instance.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pyrit/memory/memory_interface.py |
Introduces a helper intended to avoid bind-variable limits and applies it across several query paths. |
tests/unit/memory/memory_interface/test_batching_scale.py |
Adds tests intended to validate that large ID/value inputs don’t break memory queries. |
|
@romanlutz eagerly waiting for your feedback! |
I apologize for taking ages. There were A LOT of PRs to work through. |
Applies PR microsoft#1325 changes onto latest main with conflict resolution: - Add _SQLITE_MAX_BIND_VARS constant and _execute_batched_query helper - Add batching to get_scores, get_message_pieces, get_attack_results, get_scenario_results - Preserve main's dedup-by-conversation-id in get_attack_results (all paths) - Preserve main's identifier_filters, attack_class, converter_classes params - Fix test helper: use constructor role param instead of mutating _role post-creation - Extract _dedup_attack_entries helper to avoid mypy no-redef errors Co-authored-by: maifeeulasad <maifeeulasad@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
683acd7 to
3420468
Compare
- Move _SQLITE_MAX_BIND_VARS to class attribute _MAX_BIND_VARS on MemoryInterface - Override _MAX_BIND_VARS to 2000 in AzureSQLMemory (Azure SQL supports 2100) - Add early return for score_ids=[] in get_scores (consistency with other methods) - Add tests for get_attack_results batching (many IDs, many sha256, empty list) - Add test for get_scores empty list handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix dedup bug in _execute_batched_query: str(None) produced truthy 'None' string, making the no-id fallback dead code. Use explicit None check instead. - Simplify get_scenario_results: remove duplicated elif/else branches, let _execute_batched_query own threshold logic for both scenario_result_ids and all_conversation_ids. - Fix Optional -> pipe syntax in _execute_batched_query signature. - Document bind-variable headroom limitation in docstring. - Fix conditional test assertion that could pass vacuously. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract shared list_params batching logic from get_message_pieces and get_attack_results into _query_with_list_params helper (~40 duplicated lines removed from each caller). - Add batch_size parameter to _execute_batched_query so callers can reduce the batch size when other conditions consume bind variables. - _query_with_list_params computes effective_batch_size by subtracting small_param bind count from _MAX_BIND_VARS, preventing cumulative overflow of the per-statement bind variable limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Test that batch size is reduced when small IN-clause params consume bind variables (effective_batch_size = _MAX_BIND_VARS - small_binds). Verifies the correct number of batched queries are issued. - Test that _execute_batched_query respects the batch_size override parameter, using batch_size=30 to verify 4 queries for 100 items. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ing-scale-maifee' into pr-1325-updated
behnam-o
left a comment
There was a problem hiding this comment.
Looks Great, Thank you! just a few minor non-blocking comments
rlundeen2
left a comment
There was a problem hiding this comment.
Approved with comments
# Conflicts: # pyrit/memory/memory_interface.py
69b1985 to
107c50f
Compare
|
Love to see this in action! I am quite thankful to all of you. You all are so friendly and helpful. |
Description
Closes #845
When
MemoryInterfacequery methods (get_scores,get_message_pieces,get_attack_results,get_scenario_results) receive large IN-clause parameter lists (e.g., 1000+ prompt IDs), the generated SQL exceeds SQLite's per-statement bind variable limit of 999, causing the query to fail. This PR adds transparent batching: large IN-clause lists are split into chunks, executed as separate queries, and the results are merged and deduplicated.What changed
Core batching infrastructure (
memory_interface.py):_MAX_BIND_VARSclass attribute onMemoryInterface(default: 500). Subclasses can override —AzureSQLMemorysets it to 2000 since Azure SQL supports 2100 params._execute_batched_query— Low-level helper that splits a single IN-clause into batches, runs separate queries, and deduplicates results by primary key. Accepts an optionalbatch_sizeoverride for callers that need a smaller batch._query_with_list_params— Higher-level helper used byget_message_piecesandget_attack_results. Classifies list parameters as "small" (fit in one query) or "large" (need batching), adds small params to SQL conditions, batches on the first large param, and filters remaining large params in Python. Computes an effective batch size that accounts for bind variables consumed by small params, preventing cumulative overflow._dedup_attack_entries— Static helper extracted fromget_attack_resultsto deduplicateAttackResultEntryrows byconversation_id(legacy data integrity workaround). Applied consistently across all code paths (batched and non-batched).Method changes:
get_scores— Batchesscore_idsvia_execute_batched_query. Adds early return forscore_ids=[]for consistency with other methods.get_message_pieces— Routes list params (prompt_ids,original_values,converted_values,converted_value_sha256) through_query_with_list_params.get_attack_results— Routesattack_result_idsandobjective_sha256through_query_with_list_params. Preserves dedup-by-conversation-id on all paths.get_scenario_results— Batchesscenario_result_idsand innerall_conversation_idsvia_execute_batched_query.azure_sql_memory.py:_MAX_BIND_VARS = 2000so Azure SQL uses larger batches (its limit is 2100).Why this approach
OR(col.IN(batch1), col.IN(batch2), ...)in a single statement, but this still binds all parameters in one statement and doesn't solve the limit.Tests and Documentation
New test file:
tests/unit/memory/memory_interface/test_batching_scale.py(586 lines, 25 tests)Covers:
get_message_pieceswith many prompt_ids, original_values, sha256 hashes (single and multiple large params)get_scoreswith many score_ids and empty list edge caseget_attack_resultswith many IDs, many objective_sha256, and empty list edge case_execute_batched_querywith small lists (single query), large lists (multiple queries), exact batch boundary, custombatch_sizeoverride_query_entriesspyAll 220
tests/unit/memory/memory_interface/tests pass. Pre-commit hooks (ruff, mypy strict) clean.