Current Behavior
GPTCache uses BufferedReader.peek() when deriving cache inputs for file/image requests in gptcache/processor/pre.py.
Affected functions:
get_file_bytes()
get_input_str()
get_image_question()
peek() does not read the whole file. It only returns bytes currently available in the internal buffer, commonly around the first 8192 bytes. As a result, two different files that share the same initial buffered prefix can produce the same cache key/pre-embedding input.
This can cause GPTCache to return a cached response for a different file or image. In shared cache deployments, this may enable cache poisoning or disclosure of cached answers across users.
Example impact:
- A request with
image_A and question_Q is processed and cached.
- Another request uses
image_B with the same first buffered bytes as image_A, but different remaining content, plus the same question_Q.
- GPTCache treats the requests as equivalent and returns the cached answer for
image_A, without evaluating the full contents of image_B.
Expected Behavior
Cache keys for file/image inputs should represent the full file content, not only a buffered prefix.
Different files with the same header/prefix but different remaining bytes should not map to the same cache entry. The file pointer should also be restored after preprocessing so downstream model calls can still read the full file content.
Steps To Reproduce
import io
prefix = b"A" * 8192
file_a = io.BufferedReader(io.BytesIO(prefix + b"first file content"))
file_b = io.BufferedReader(io.BytesIO(prefix + b"second file content"))
peek_a = file_a.peek()
peek_b = file_b.peek()
print(peek_a == peek_b) # True on typical BufferedReader behavior
print(file_a.read() == file_b.read()) # False
Then use these streams through a GPTCache preprocessing path that relies on `peek()`, such as:
- `get_file_bytes({"file": stream})`
- `get_input_str({"input": {"image": stream, "question": "what is in this image?"}})`
- `get_image_question({"image": stream, "question": "what is in this image?"})`
Because the cache input is derived from `peek()`, two different files can be treated as identical if their buffered prefixes match.
Environment
Observed in GPTCache before the proposed fix in PR #678.
Relevant component:
gptcache/processor/pre.py
Relevant adapters/usages:
gptcache/adapter/openai.py -> get_file_bytes()
gptcache/adapter/replicate.py -> get_input_str()
gptcache/adapter/minigpt4.py -> get_image_question()
Anything else?
Related PR: #678
The PR fixes this by replacing peek()-based cache inputs with a streaming SHA-256 hash of the full file content in the affected functions, and resetting the file pointer with seek(0) after hashing.
It also fixes a resource leak in get_image_question() where a file path was opened without being closed.
Current Behavior
GPTCache uses
BufferedReader.peek()when deriving cache inputs for file/image requests ingptcache/processor/pre.py.Affected functions:
get_file_bytes()get_input_str()get_image_question()peek()does not read the whole file. It only returns bytes currently available in the internal buffer, commonly around the first 8192 bytes. As a result, two different files that share the same initial buffered prefix can produce the same cache key/pre-embedding input.This can cause GPTCache to return a cached response for a different file or image. In shared cache deployments, this may enable cache poisoning or disclosure of cached answers across users.
Example impact:
image_Aandquestion_Qis processed and cached.image_Bwith the same first buffered bytes asimage_A, but different remaining content, plus the samequestion_Q.image_A, without evaluating the full contents ofimage_B.Expected Behavior
Cache keys for file/image inputs should represent the full file content, not only a buffered prefix.
Different files with the same header/prefix but different remaining bytes should not map to the same cache entry. The file pointer should also be restored after preprocessing so downstream model calls can still read the full file content.
Steps To Reproduce
Environment
Observed in GPTCache before the proposed fix in PR #678. Relevant component: gptcache/processor/pre.py Relevant adapters/usages: gptcache/adapter/openai.py -> get_file_bytes() gptcache/adapter/replicate.py -> get_input_str() gptcache/adapter/minigpt4.py -> get_image_question()Anything else?
Related PR: #678
The PR fixes this by replacing
peek()-based cache inputs with a streaming SHA-256 hash of the full file content in the affected functions, and resetting the file pointer withseek(0)after hashing.It also fixes a resource leak in
get_image_question()where a file path was opened without being closed.