HMLR has ONLY been tested with gpt-4.1-mini.
HMLR's architecture was designed, optimized, and benchmarked exclusively with gpt-4.1-mini. The system makes specific assumptions about:
- Instruction following consistency
- JSON formatting reliability
- Prompt response patterns
- Token usage characteristics
- Reasoning capabilities at the mini-tier
These assumptions have been validated through extensive testing with gpt-4.1-mini ONLY.
All RAGAS benchmarks achieving 1.00 Faithfulness and 1.00 Context Recall were run using:
model="gpt-4.1-mini"| Test | Model | Faithfulness | Context Recall | Result |
|---|---|---|---|---|
| Multi-hop reasoning | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
| Temporal conflicts | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
| Cross-topic invariants | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
| Zero-keyword recall | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
| 50-turn long conversation | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
| Hydra (9-policy lethal RAG) | gpt-4.1-mini | 1.00 | 1.00 | ✅ PASS |
No other models have been tested.
If you try to use a different model, you'll see this warning:
client = HMLRClient(
api_key="...",
model="gpt-4o" # Different model!
)
# ⚠️ Output:
# UserWarning: Model 'gpt-4o' has NOT been tested!
# HMLR is only validated with 'gpt-4.1-mini'.
# You may experience:
# - Incorrect memory retrieval
# - Malformed internal JSON
# - Failed fact extraction
# - Bridge block corruption
# Proceed at your own risk.HMLR components generate structured JSON for internal operations. Other models may:
- Use different formatting
- Include extra commentary
- Fail to follow JSON schema exactly
The FactScrubber uses specific prompts optimized for GPT-4.1-mini. Other models may:
- Extract incorrect facts
- Miss important facts
- Hallucinate non-existent facts
The LatticeCrawler and Governor rely on specific reasoning patterns. Other models may:
- Retrieve irrelevant context
- Miss critical memories
- Fail to resolve temporal conflicts
The bridge block architecture depends on consistent LLM behavior. Other models may:
- Create malformed bridge blocks
- Fail to detect topic shifts
- Break conversation continuity
We may add support for other models in future versions, but this will require:
- Extensive re-testing of all RAGAS benchmarks
- Prompt engineering specific to each model
- Validation that perfect scores are maintained
- Documentation of any behavioral differences
Until then, stick to gpt-4.1-mini.
The default model is set to gpt-4.1-mini everywhere:
# In HMLRClient
client = HMLRClient(
api_key="...",
# model="gpt-4.1-mini" # This is the default
)
# In LangChain integration
from langchain_openai import ChatOpenAI
memory = HMLRMemory(api_key="...")
llm = ChatOpenAI(model="gpt-4.1-mini") # Use matching model!GPT-4.1-mini is OpenAI's most cost-effective model while maintaining strong reasoning capabilities:
- Input: ~$0.15 per 1M tokens
- Output: ~$0.60 per 1M tokens
HMLR's architecture is designed to work efficiently within these economics:
- Average queries use < 4k tokens
- Bridge blocks minimize redundant context
- Fact extraction happens once per message
To verify what model you're using:
client = HMLRClient(
api_key="...",
db_path="memory.db",
model="gpt-4.1-mini"
)
# The model is stored in the client
print(f"Using model: {client.conversation_engine.llama_client.model}")| Aspect | Status |
|---|---|
| Supported Models | gpt-4.1-mini ONLY |
| Tested Models | gpt-4.1-mini ONLY |
| Benchmarked Models | gpt-4.1-mini ONLY |
| Other Models | ❌ NOT SUPPORTED |
| Future Support | Maybe (requires extensive testing) |
When in doubt, use gpt-4.1-mini.