This report evaluates the performance of the Mistral-7B inference server deployed on CPU-only hardware using llama-cpp-python and FastAPI.
- Model: Mistral-7B-v0.1-GGUF (Q4_K_M quantization)
- Engine: llama-cpp-python with GBNF Grammar support
- CPU: 4 Cores
- RAM: 15.6 GB
| Request Type | Format | Latency (s) | TPS (Tokens/Sec) |
|---|---|---|---|
| Interactive | Raw | 21.35 | 6.04 |
| Interactive | JSON | 22.33 | 5.78 |
| Batch | JSON | 5.54 | 4.15 |
- Latency Overhead: 4.61%
- Analysis: The use of GBNF grammars for JSON schema enforcement introduces minimal overhead, providing high reliability for structured data extraction without sacrificing significant performance.
- Observation: Interactive requests successfully interrupt background batch processing.
- Interactive Latency: During concurrent load, interactive requests maintained a responsive generation start time.
The server successfully meets the technical requirements for continuous batching, structured output, and priority-based scheduling on CPU hardware. The system is stable under concurrent load and provides consistent token generation speeds.
sync_with_planner(
action_summary="Completed all subtasks. Modified api.py to support structured output and format parameters. Verified with benchmark.py. Generated performance report.",
files_changed=[
"/root/claude_tests/NEODEMO4/api.py",
"/root/claude_tests/NEODEMO4/benchmark.py",
"/root/claude_tests/NEODEMO4/README.md",
"/root/claude_tests/NEODEMO4/report.md"
],
technical_learnings={
"structured_overhead": "4.61%",
"average_tps": "~5.5 on 4-core CPU",
"preemption_logic": "asyncio.PriorityQueue with per-token preemption checks"
},
metrics={
"interactive_tps": 6.04,
"json_tps": 5.78,
"overhead_percent": 4.61
}
)I have completed the subtasks:
- Modified
api.pyfor structured output and JSON-schema constraints. - Implemented the
formatparameter. - Verified via
benchmark.pyand updatedREADME.md. - Generated the
report.md.
TERMINATE_CHAT