Is your feature request related to a problem? Please describe.
Yes. The current caption_image function hardcodes the initialization, API payload mapping, and error-handling logic specifically for OpenAI. If we want to expand our RAG pipeline to support other Vision-Language Models (VLMs) like Anthropic Claude, Google Gemini, or local models via Ollama/LLaVA, this function will rapidly grow into a deeply nested, brittle block of if/elif statements. This violates the Open-Closed Principle, making maintenance and scaling difficult.
Describe the solution you'd like
We need to decouple the core image captioning pipeline from individual vendor SDK implementations by introducing a Strategy Pattern:
Create a dedicated directory structure (e.g., app/vision/providers/).
Define a standard abstract base class or interface (e.g., BaseVisionProvider) requiring a .caption(image_bytes: bytes) -> str method.
Refactor the existing OpenAI code into its own class (OpenAIVisionProvider) conforming to this interface.
Implement a simple factory or registry lookup inside caption_image that instantiates the correct provider dynamically based on the VISION_PROVIDER string setting.
Describe alternatives you've considered
Keeping it inline: Continuing to append elif provider == "anthropic": blocks directly inside caption_image. This was rejected because vendor-specific error handling and dependencies will clutter the core pipeline file.
Function-mapping dict: Mapping string keys to simple standalone helper functions within the same file. While cleaner than nested conditions, it still leaves the file bloated with vendor-specific configuration code.
Additional Context
GSSoC '26
Is your feature request related to a problem? Please describe.
Yes. The current caption_image function hardcodes the initialization, API payload mapping, and error-handling logic specifically for OpenAI. If we want to expand our RAG pipeline to support other Vision-Language Models (VLMs) like Anthropic Claude, Google Gemini, or local models via Ollama/LLaVA, this function will rapidly grow into a deeply nested, brittle block of if/elif statements. This violates the Open-Closed Principle, making maintenance and scaling difficult.
Describe the solution you'd like
We need to decouple the core image captioning pipeline from individual vendor SDK implementations by introducing a Strategy Pattern:
Create a dedicated directory structure (e.g., app/vision/providers/).
Define a standard abstract base class or interface (e.g., BaseVisionProvider) requiring a .caption(image_bytes: bytes) -> str method.
Refactor the existing OpenAI code into its own class (OpenAIVisionProvider) conforming to this interface.
Implement a simple factory or registry lookup inside caption_image that instantiates the correct provider dynamically based on the VISION_PROVIDER string setting.
Describe alternatives you've considered
Keeping it inline: Continuing to append elif provider == "anthropic": blocks directly inside caption_image. This was rejected because vendor-specific error handling and dependencies will clutter the core pipeline file.
Function-mapping dict: Mapping string keys to simple standalone helper functions within the same file. While cleaner than nested conditions, it still leaves the file bloated with vendor-specific configuration code.
Additional Context
GSSoC '26