Skip to content

[Bug]: File and image cache keys collide because BufferedReader.peek() only reads the buffered prefix #684

@3em0

Description

@3em0

Current Behavior

GPTCache uses BufferedReader.peek() when deriving cache inputs for file/image requests in gptcache/processor/pre.py.

Affected functions:

  • get_file_bytes()
  • get_input_str()
  • get_image_question()

peek() does not read the whole file. It only returns bytes currently available in the internal buffer, commonly around the first 8192 bytes. As a result, two different files that share the same initial buffered prefix can produce the same cache key/pre-embedding input.

This can cause GPTCache to return a cached response for a different file or image. In shared cache deployments, this may enable cache poisoning or disclosure of cached answers across users.

Example impact:

  1. A request with image_A and question_Q is processed and cached.
  2. Another request uses image_B with the same first buffered bytes as image_A, but different remaining content, plus the same question_Q.
  3. GPTCache treats the requests as equivalent and returns the cached answer for image_A, without evaluating the full contents of image_B.

Expected Behavior

Cache keys for file/image inputs should represent the full file content, not only a buffered prefix.

Different files with the same header/prefix but different remaining bytes should not map to the same cache entry. The file pointer should also be restored after preprocessing so downstream model calls can still read the full file content.

Steps To Reproduce

import io


prefix = b"A" * 8192
file_a = io.BufferedReader(io.BytesIO(prefix + b"first file content"))
file_b = io.BufferedReader(io.BytesIO(prefix + b"second file content"))

peek_a = file_a.peek()
peek_b = file_b.peek()

print(peek_a == peek_b)  # True on typical BufferedReader behavior
print(file_a.read() == file_b.read())  # False


Then use these streams through a GPTCache preprocessing path that relies on `peek()`, such as:

- `get_file_bytes({"file": stream})`
- `get_input_str({"input": {"image": stream, "question": "what is in this image?"}})`
- `get_image_question({"image": stream, "question": "what is in this image?"})`

Because the cache input is derived from `peek()`, two different files can be treated as identical if their buffered prefixes match.

Environment

Observed in GPTCache before the proposed fix in PR #678.

Relevant component:


gptcache/processor/pre.py


Relevant adapters/usages:


gptcache/adapter/openai.py      -> get_file_bytes()
gptcache/adapter/replicate.py   -> get_input_str()
gptcache/adapter/minigpt4.py    -> get_image_question()

Anything else?

Related PR: #678

The PR fixes this by replacing peek()-based cache inputs with a streaming SHA-256 hash of the full file content in the affected functions, and resetting the file pointer with seek(0) after hashing.

It also fixes a resource leak in get_image_question() where a file path was opened without being closed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions