[Bug]: File and image cache keys collide because BufferedReader.peek() only reads the buffered prefix

### Current Behavior

GPTCache uses `BufferedReader.peek()` when deriving cache inputs for file/image requests in `gptcache/processor/pre.py`.

Affected functions:

- `get_file_bytes()`
- `get_input_str()`
- `get_image_question()`

`peek()` does not read the whole file. It only returns bytes currently available in the internal buffer, commonly around the first 8192 bytes. As a result, two different files that share the same initial buffered prefix can produce the same cache key/pre-embedding input.

This can cause GPTCache to return a cached response for a different file or image. In shared cache deployments, this may enable cache poisoning or disclosure of cached answers across users.

Example impact:

1. A request with `image_A` and `question_Q` is processed and cached.
2. Another request uses `image_B` with the same first buffered bytes as `image_A`, but different remaining content, plus the same `question_Q`.
3. GPTCache treats the requests as equivalent and returns the cached answer for `image_A`, without evaluating the full contents of `image_B`.

### Expected Behavior


Cache keys for file/image inputs should represent the full file content, not only a buffered prefix.

Different files with the same header/prefix but different remaining bytes should not map to the same cache entry. The file pointer should also be restored after preprocessing so downstream model calls can still read the full file content.

### Steps To Reproduce

```markdown
import io


prefix = b"A" * 8192
file_a = io.BufferedReader(io.BytesIO(prefix + b"first file content"))
file_b = io.BufferedReader(io.BytesIO(prefix + b"second file content"))

peek_a = file_a.peek()
peek_b = file_b.peek()

print(peek_a == peek_b)  # True on typical BufferedReader behavior
print(file_a.read() == file_b.read())  # False


Then use these streams through a GPTCache preprocessing path that relies on `peek()`, such as:

- `get_file_bytes({"file": stream})`
- `get_input_str({"input": {"image": stream, "question": "what is in this image?"}})`
- `get_image_question({"image": stream, "question": "what is in this image?"})`

Because the cache input is derived from `peek()`, two different files can be treated as identical if their buffered prefixes match.
```

### Environment

```markdown
Observed in GPTCache before the proposed fix in PR #678.

Relevant component:


gptcache/processor/pre.py


Relevant adapters/usages:


gptcache/adapter/openai.py      -> get_file_bytes()
gptcache/adapter/replicate.py   -> get_input_str()
gptcache/adapter/minigpt4.py    -> get_image_question()
```

### Anything else?

Related PR: https://github.com/zilliztech/GPTCache/pull/678

The PR fixes this by replacing `peek()`-based cache inputs with a streaming SHA-256 hash of the full file content in the affected functions, and resetting the file pointer with `seek(0)` after hashing.

It also fixes a resource leak in `get_image_question()` where a file path was opened without being closed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: File and image cache keys collide because BufferedReader.peek() only reads the buffered prefix #684

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: File and image cache keys collide because BufferedReader.peek() only reads the buffered prefix #684

Description

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions