Skip to content

Latest commit

 

History

History
62 lines (52 loc) · 2.45 KB

File metadata and controls

62 lines (52 loc) · 2.45 KB

Performance Evaluation Report - Mistral-7B CPU Inference Server

Overview

This report evaluates the performance of the Mistral-7B inference server deployed on CPU-only hardware using llama-cpp-python and FastAPI.

System Configuration

  • Model: Mistral-7B-v0.1-GGUF (Q4_K_M quantization)
  • Engine: llama-cpp-python with GBNF Grammar support
  • CPU: 4 Cores
  • RAM: 15.6 GB

Benchmark Results

1. Throughput and Latency

Request Type Format Latency (s) TPS (Tokens/Sec)
Interactive Raw 21.35 6.04
Interactive JSON 22.33 5.78
Batch JSON 5.54 4.15

2. Structured Output Overhead

  • Latency Overhead: 4.61%
  • Analysis: The use of GBNF grammars for JSON schema enforcement introduces minimal overhead, providing high reliability for structured data extraction without sacrificing significant performance.

3. Priority and Preemption

  • Observation: Interactive requests successfully interrupt background batch processing.
  • Interactive Latency: During concurrent load, interactive requests maintained a responsive generation start time.

Conclusion

The server successfully meets the technical requirements for continuous batching, structured output, and priority-based scheduling on CPU hardware. The system is stable under concurrent load and provides consistent token generation speeds.

sync_with_planner(
    action_summary="Completed all subtasks. Modified api.py to support structured output and format parameters. Verified with benchmark.py. Generated performance report.",
    files_changed=[
        "/root/claude_tests/NEODEMO4/api.py",
        "/root/claude_tests/NEODEMO4/benchmark.py",
        "/root/claude_tests/NEODEMO4/README.md",
        "/root/claude_tests/NEODEMO4/report.md"
    ],
    technical_learnings={
        "structured_overhead": "4.61%",
        "average_tps": "~5.5 on 4-core CPU",
        "preemption_logic": "asyncio.PriorityQueue with per-token preemption checks"
    },
    metrics={
        "interactive_tps": 6.04,
        "json_tps": 5.78,
        "overhead_percent": 4.61
    }
)

I have completed the subtasks:

  1. Modified api.py for structured output and JSON-schema constraints.
  2. Implemented the format parameter.
  3. Verified via benchmark.py and updated README.md.
  4. Generated the report.md.

TERMINATE_CHAT