Skip to content

Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory #446

@dan504512

Description

@dan504512

Description

QAPromptGenerator.render() concatenates description + examples + chunk_text into a single string per prompt. When few-shot examples are large, this duplicates the shared preamble (description + formatted examples) into every prompt string independently:

# prompting.py:115-138
def render(self, question, additional_context=None):
    prompt_lines = [f"{self.template.description}\n"]
    if self.template.examples:
        prompt_lines.append(self.examples_heading)
        for ex in self.template.examples:
            prompt_lines.append(self.format_example_as_text(ex))
    prompt_lines.append(f"{self.question_prefix}{question}")
    prompt_lines.append(self.answer_prefix)
    return "\n".join(prompt_lines)

Then in _annotate_documents_single_pass (annotation.py:370-375), all prompts for a batch are materialized simultaneously:

prompts = [
    prompt_builder.build_prompt(chunk.chunk_text, chunk.document_id, chunk.additional_context)
    for chunk in batch
]

For large example sets, this creates extreme memory pressure:

  • Example preamble: ~300,000 characters formatted (text + extraction outputs)
  • Python string encoding: if examples contain any non-ASCII characters, Python uses UCS-2 (2 bytes/char) or UCS-4 (4 bytes/char), multiplying memory 2-4×
  • Observed per-prompt size: ~640 KB
  • At batch_length=10000: 10,000 × 640 KB = 6.4 GB per batch just for prompt strings
  • Due to delayed cyclic GC between batches: two batches' prompts can overlap at peak, reaching 12.8 GB

Profiling data

memray flamegraph from a 1000-document extraction run (batch_length=10000, ~170 KB raw example text):

Call site Peak memory Allocations
render (prompt strings) 12,820 MB 20,000
Tokenization 509 MB 915
Batch/chunk infrastructure 512 MB 922
Total peak 14,400 MB

Prompt string allocation accounts for 89% of peak memory usage.

Proposed optimization: use multi-part contents in batch requests

The Gemini API's contents[0].parts field accepts multiple parts. Instead of concatenating the preamble into each prompt string, the shared preamble can be stored once and referenced by all requests:

# Instead of:
"contents": [{"role": "user", "parts": [{"text": "preamble...chunk_text"}]}]

# Use:
"contents": [{"role": "user", "parts": [
    {"text": preamble_shared},   # one instance, shared by reference across all requests
    {"text": chunk_text},         # unique per prompt
]}]

This is semantically equivalent from Gemini's perspective (parts within a single turn are concatenated). Python's reference semantics mean all 10,000 requests share the same preamble_shared string object in memory.

Expected impact:

  • Per-batch prompt memory: 10,000 × 640 KB → 1 × 640 KB + 10,000 × ~1 KB = ~10 MB (640× reduction)
  • No change to model behavior (same content presented to the model)
  • No change to Vertex billing (same token count per request)

Implementation scope

  1. PromptBuilder.build_prompt returns a structured value (e.g., PromptParts(preamble, chunk_text)) instead of a concatenated string
  2. _build_request in gemini_batch.py constructs multi-part contents from the structured value
  3. Non-batch providers (_process_single_prompt) can still concatenate for the existing generate_content call, or also use multi-part contents
  4. Cache key computation needs to hash the parts equivalently to the old concatenated form (or include a migration/versioning scheme)

Alternatives considered

  • Context caching (Vertex AI cached content): also eliminates per-prompt duplication AND reduces token billing. Larger implementation surface (cache lifecycle management, TTL, minimum size thresholds). Could be a follow-up.
  • systemInstruction for examples: moves examples to a different API field. Changes prompt semantics — model may respond differently to examples in system vs user turn.
  • Streamed prompt construction: build and serialize prompts one-at-a-time into the JSONL file without materializing the full list. Reduces peak memory but doesn't reduce total allocation volume. More invasive refactor across annotation.py → model.infer → gemini_batch.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions