This repository serves as a technical sandbox for exploring high-performance engineering patterns applied to Large Language Models (LLMs). It bridges the gap between low-level system optimization and high-level agentic orchestration.
As part of my current research at UNESP, I focus on refactoring frameworks to optimize memory management and inference latency. This repository demonstrates the core principles I apply to ensure LLM solutions are not only intelligent but also production-ready and scalable.
- Languages: Python (Primary) & C++ (Performance-critical modules).
- Interoperability: Utilizing
pybind11to bridge C++ memory management with Pythonic LLM workflows. - Orchestration: Agentic RAG implementation using
LangChainandCrewAI. - Infrastructure: Containerization with
Dockerand observability patterns for MLOps.
Exploration of C++/Python FFI (Foreign Function Interface) to handle memory buffers efficiently. This simulates the optimization of tensor processing and context window management, reducing the overhead often found in pure Python implementations.
Implementation of "Chain-of-Thought" reasoning for autonomous agents. Focuses on:
- Reducing hallucinations through validated retrieval.
- Efficient prompt structuring to minimize token usage.
A dedicated utility to monitor:
- TTFT (Time to First Token) and total inference latency.
- Memory Footprint: Real-time tracking of RSS memory consumption during model execution.
In a production environment, scaling GenAI solutions requires more than just "prompting." It requires a deep understanding of how models interact with hardware. By optimizing memory access and execution pipelines, we can significantly reduce operational costs and improve user experience through lower latency.
Developed as a showcase for LLM Engineering best practices.