Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory

### Description

`QAPromptGenerator.render()` concatenates `description + examples + chunk_text` into a single string per prompt. When few-shot examples are large, this duplicates the shared preamble (description + formatted examples) into every prompt string independently:

```python
# prompting.py:115-138
def render(self, question, additional_context=None):
    prompt_lines = [f"{self.template.description}\n"]
    if self.template.examples:
        prompt_lines.append(self.examples_heading)
        for ex in self.template.examples:
            prompt_lines.append(self.format_example_as_text(ex))
    prompt_lines.append(f"{self.question_prefix}{question}")
    prompt_lines.append(self.answer_prefix)
    return "\n".join(prompt_lines)
```

Then in `_annotate_documents_single_pass` (annotation.py:370-375), all prompts for a batch are materialized simultaneously:

```python
prompts = [
    prompt_builder.build_prompt(chunk.chunk_text, chunk.document_id, chunk.additional_context)
    for chunk in batch
]
```

For large example sets, this creates extreme memory pressure:

- **Example preamble**: ~300,000 characters formatted (text + extraction outputs)
- **Python string encoding**: if examples contain any non-ASCII characters, Python uses UCS-2 (2 bytes/char) or UCS-4 (4 bytes/char), multiplying memory 2-4×
- **Observed per-prompt size**: ~640 KB
- **At `batch_length=10000`**: 10,000 × 640 KB = **6.4 GB per batch** just for prompt strings
- **Due to delayed cyclic GC between batches**: two batches' prompts can overlap at peak, reaching **12.8 GB**

### Profiling data

memray flamegraph from a 1000-document extraction run (`batch_length=10000`, ~170 KB raw example text):

| Call site | Peak memory | Allocations |
|---|---|---|
| `render` (prompt strings) | 12,820 MB | 20,000 |
| Tokenization | 509 MB | 915 |
| Batch/chunk infrastructure | 512 MB | 922 |
| **Total peak** | **14,400 MB** | |

Prompt string allocation accounts for **89%** of peak memory usage.

### Proposed optimization: use multi-part `contents` in batch requests

The Gemini API's `contents[0].parts` field accepts multiple parts. Instead of concatenating the preamble into each prompt string, the shared preamble can be stored once and referenced by all requests:

```python
# Instead of:
"contents": [{"role": "user", "parts": [{"text": "preamble...chunk_text"}]}]

# Use:
"contents": [{"role": "user", "parts": [
    {"text": preamble_shared},   # one instance, shared by reference across all requests
    {"text": chunk_text},         # unique per prompt
]}]
```

This is semantically equivalent from Gemini's perspective (parts within a single turn are concatenated). Python's reference semantics mean all 10,000 requests share the same `preamble_shared` string object in memory.

**Expected impact**:
- Per-batch prompt memory: 10,000 × 640 KB → 1 × 640 KB + 10,000 × ~1 KB = **~10 MB** (640× reduction)
- No change to model behavior (same content presented to the model)
- No change to Vertex billing (same token count per request)

### Implementation scope

1. `PromptBuilder.build_prompt` returns a structured value (e.g., `PromptParts(preamble, chunk_text)`) instead of a concatenated string
2. `_build_request` in `gemini_batch.py` constructs multi-part `contents` from the structured value
3. Non-batch providers (`_process_single_prompt`) can still concatenate for the existing `generate_content` call, or also use multi-part `contents`
4. Cache key computation needs to hash the parts equivalently to the old concatenated form (or include a migration/versioning scheme)

### Alternatives considered

- **Context caching** (Vertex AI cached content): also eliminates per-prompt duplication AND reduces token billing. Larger implementation surface (cache lifecycle management, TTL, minimum size thresholds). Could be a follow-up.
- **`systemInstruction` for examples**: moves examples to a different API field. Changes prompt semantics — model may respond differently to examples in system vs user turn.
- **Streamed prompt construction**: build and serialize prompts one-at-a-time into the JSONL file without materializing the full list. Reduces peak memory but doesn't reduce total allocation volume. More invasive refactor across annotation.py → model.infer → gemini_batch.

### Environment

- langextract built from `main` at a281692 (post-1.2.1, includes #442 LCS alignment)
- Python 3.11
- Vertex AI batch mode, `batch_length=10000`
- ~170 KB raw few-shot example text (~640 KB per prompt after formatting + Python string encoding)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory #446

Description

Profiling data

Proposed optimization: use multi-part `contents` in batch requests

Implementation scope

Alternatives considered

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Call site	Peak memory	Allocations
`render` (prompt strings)	12,820 MB	20,000
Tokenization	509 MB	915
Batch/chunk infrastructure	512 MB	922
Total peak	14,400 MB

Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory #446

Description

Description

Profiling data

Proposed optimization: use multi-part contents in batch requests

Implementation scope

Alternatives considered

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposed optimization: use multi-part `contents` in batch requests