Bug Description
When HINDSIGHT_API_EMBEDDINGS_PROVIDER is set to an external provider (e.g., openai), the hindsight-api Python process still loads PyTorch and local model files at startup, resulting in ~1.15 GB RSS baseline with no local model configured — only ~950 MB less than a BGE-local instance.
Steps to Reproduce
-
Configure hindsight-api with HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai and a valid HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY
-
Start the daemon (no local embedding model needed or configured)
-
Check mapped files: grep -c "torch|bge" /proc//maps
-
Observe PyTorch and BGE file mappings present despite no local model being used
-
Hindsight version: v0.6.5
-
Operating system: Ubuntu 24.04.4 LTS, Linux 6.8.0-111-generic x86_64
-
Install method: uvx hindsight-api@latest via uv, running as background daemon process
-
Model: N/A (embedding-provider configuration issue, not LLM)
-
Provider / routing chain: HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai, HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small. No local sentence-transformers model configured.
Additional context
Discovered while migrating from local BGE-small embeddings to OpenAI external embeddings to reduce CPU load and RSS. The migration successfully eliminates embedding inference CPU saturation (the primary goal), but the memory savings were significantly smaller than expected due to PyTorch loading unconditionally.
Likely cause: import torch or from sentence_transformers import ... at module top-level in embeddings.py or a dependency, rather than inside the LocalEmbeddings class initializer. A lazy import pattern (importing only when EMBEDDINGS_PROVIDER=local) would fix this.
Expected Behavior
When an external embedding provider is configured, PyTorch and local model weights should not be imported or loaded. RSS baseline should reflect only the API server, database client, and LLM client — expected ~200–400 MB, not ~1.15 GB.
Actual Behavior
64 PyTorch/BGE file mappings present in /proc//maps on a daemon configured exclusively for OpenAI embeddings. RSS sits at ~1,152 MB at idle — consistent with PyTorch runtime being fully loaded despite serving zero inference requests.
'''
On daemon with HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
grep -c "torch|bge" /proc/249656/maps
Output: 64
'''
'''
RSS
grep VmRSS /proc/249656/status
VmRSS: 1178492 kB (~1.15 GB)
'''
For comparison, a BGE-local instance on the same host:
'''
grep VmRSS /proc/142163/status
VmRSS: 2166572 kB (~2.1 GB)
'''
Switching to external embeddings saves ~950 MB — but the expected saving was ~1.5–1.8 GB (PyTorch runtime + model weights). PyTorch appears to be imported unconditionally at module load time rather than lazily when a local provider is actually selected.
Version
Hindsight version: v0.6.5, Operating system: Ubuntu 24.04.4 LTS, Linux 6.8.0-111-generic x86_64
LLM Provider
OpenAI
Bug Description
When HINDSIGHT_API_EMBEDDINGS_PROVIDER is set to an external provider (e.g., openai), the hindsight-api Python process still loads PyTorch and local model files at startup, resulting in ~1.15 GB RSS baseline with no local model configured — only ~950 MB less than a BGE-local instance.
Steps to Reproduce
Configure hindsight-api with HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai and a valid HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY
Start the daemon (no local embedding model needed or configured)
Check mapped files: grep -c "torch|bge" /proc//maps
Observe PyTorch and BGE file mappings present despite no local model being used
Hindsight version: v0.6.5
Operating system: Ubuntu 24.04.4 LTS, Linux 6.8.0-111-generic x86_64
Install method: uvx hindsight-api@latest via uv, running as background daemon process
Model: N/A (embedding-provider configuration issue, not LLM)
Provider / routing chain: HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai, HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small. No local sentence-transformers model configured.
Additional context
Discovered while migrating from local BGE-small embeddings to OpenAI external embeddings to reduce CPU load and RSS. The migration successfully eliminates embedding inference CPU saturation (the primary goal), but the memory savings were significantly smaller than expected due to PyTorch loading unconditionally.
Likely cause: import torch or from sentence_transformers import ... at module top-level in embeddings.py or a dependency, rather than inside the LocalEmbeddings class initializer. A lazy import pattern (importing only when EMBEDDINGS_PROVIDER=local) would fix this.
Expected Behavior
When an external embedding provider is configured, PyTorch and local model weights should not be imported or loaded. RSS baseline should reflect only the API server, database client, and LLM client — expected ~200–400 MB, not ~1.15 GB.
Actual Behavior
64 PyTorch/BGE file mappings present in /proc//maps on a daemon configured exclusively for OpenAI embeddings. RSS sits at ~1,152 MB at idle — consistent with PyTorch runtime being fully loaded despite serving zero inference requests.
'''
On daemon with HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
grep -c "torch|bge" /proc/249656/maps
Output: 64
'''
'''
RSS
grep VmRSS /proc/249656/status
VmRSS: 1178492 kB (~1.15 GB)
'''
For comparison, a BGE-local instance on the same host:
'''
grep VmRSS /proc/142163/status
VmRSS: 2166572 kB (~2.1 GB)
'''
Switching to external embeddings saves ~950 MB — but the expected saving was ~1.5–1.8 GB (PyTorch runtime + model weights). PyTorch appears to be imported unconditionally at module load time rather than lazily when a local provider is actually selected.
Version
Hindsight version: v0.6.5, Operating system: Ubuntu 24.04.4 LTS, Linux 6.8.0-111-generic x86_64
LLM Provider
OpenAI