Skip to content

Latest commit

 

History

History
248 lines (185 loc) · 10.5 KB

File metadata and controls

248 lines (185 loc) · 10.5 KB

Design

Vision

Generic agent platform. Skills are pluggable Docker containers. Current time is Skill #1 (platform verification). Real estate search is Skill #2.


User Experience

  1. User types: "find me a 3-bed house in Austin under $500k"
  2. Agent searches Zillow and returns matching properties as a card grid with address, price, beds, baths, sqft, and photos
  3. User clicks a card → agent automatically fetches full property details (no typing required); a detail card and map appear inline
  4. User asks for more detail: "show me more on the second one" → agent fetches full property details the same way
  5. User refines: "same but closer to downtown" → agent searches again with updated criteria
  6. User preferences (budget, location, property type) are remembered across sessions — the agent recalls them in future conversations without being told again

UI

PropSearch — React frontend

graph TD
    subgraph Browser["Browser (PropSearch)"]
        A[Chat Input] --> B[Message Stream]
        B --> C{Message Type}
        C -->|text| D[Text Bubble]
        C -->|listings| E["PropertyGrid — grid or list"]
        C -->|detail| F["PropertyGrid variant=detail"]
        F --> G["MapView — OpenStreetMap"]
        E -->|click card| H["auto-send detail request"]
        H --> F
        E --> I["View on Zillow"]
        J[Sidebar] --> K["Session list with titles"]
        K --> L["Delete session"]
        M[Toggle] --> N["grid / list"]
    end

    subgraph API
        O["POST /chat/stream (SSE)"]
        P["GET /sessions"]
        Q["GET /sessions/{id}"]
        R["DELETE /sessions/{id}"]
        S["GET /sessions/{id}/title"]
    end

    A --> O
    O --> B
    J --> P
    K --> Q
    L --> R
Loading

Views

View Triggered by Shows
Grid (default) search result Photo cards in 1–3 column responsive grid
List toggle button Compact horizontal rows: thumbnail, address, price, beds/baths/sqft
Detail click card or ask about a property Richer card (year built, lot size, HOA, Zestimate) + Leaflet map

Grid/list preference is persisted in localStorage. Map only appears for detail results — the Zillow search API does not return coordinates; only the detail API (/pro/byzpid) does.

Clicking a card

Clicking any property card (grid or list) automatically sends:

Show me the details for {address} (zpid: {zpid})

The agent calls get_property_details, and the detail card + map render inline below the response — no user typing required.


Architecture

┌─────────────────────────────────────────────────┐
│                  AGENT CORE                      │
│                                                  │
│  FastAPI  →  AgentLoop  →  LiteLLM              │
│                  │                               │
│            SkillRegistry                        │
│            MemoryManager                        │
│            SessionManager                       │
└──────────────────┬──────────────────────────────┘
                   │ HTTP localhost:{port}
        ┌──────────┼──────────┐
        ▼          ▼          ▼
   localhost:9000  localhost:9002  (future skills)
   (current_time_0) (real_estate_0)
   localhost:9001  localhost:9003
   (current_time_1) (real_estate_1)

Components

Component File Responsibility
AgentLoop core/agent.py LiteLLM call → tool dispatch → memory save → repeat
ContainerPool core/container_pool.py Start, pool, and execute skill containers
SkillRegistry core/skill_registry.py Discover skills, merge tools/prompts, route dispatch
MemoryManager core/memory.py preferences.md + ChromaDB per skill
SessionManager core/session.py Session history CRUD
Sanitizer core/sanitizer.py Scrub secrets from tool results before LLM sees them

Skill Contract

Every skill is a Docker container exposing two endpoints:

  • GET /schema{ system_prompt, tools[] } — LiteLLM-format tool definitions
  • POST /execute{ tool, params }{ result } — tool execution

Each skill directory contains two files read by the host at startup:

  • skills/<name>/SKILL.md — the --- YAML block is machine-parsed (name, image); body is human-readable documentation
  • skills/<name>/AGENT.md — optional; agent identity and hard constraints injected at L0. Skills without this file contribute no identity content.
---
name: current_time
image: current_time:latest
---

## Purpose
Returns the current date and time in UTC.

## Tools
- `get_current_time` — no parameters required

## Usage
No API keys or configuration needed.

Skill secrets go in skills/<name>/.env — injected via --env-file at container start.


Memory Model

Four layers:

Layer MemPalace Storage Scope Purpose
Identity L0 skills/{name}/AGENT.md Per-skill, on disk Agent persona and hard constraints — injected at position 0
Preferences L1 memory/{skill}/preferences.md Cross-session, per-skill Explicit user facts — always injected
Semantic history L3 memory/{skill}/chroma/ Cross-session, per-skill Past interactions retrieved by similarity
Session episodes L2 memory/sessions/{id}/chroma/ Per-session Older session exchanges indexed for relevance retrieval
Session raw memory/sessions/{id}.json Per-session Full message history on disk — source of truth

preferences.md format

Typed entries — easier for LLM to parse, supports per-entry updates:

[PREFERENCE] budget_max: $500,000
[PREFERENCE] location: Austin TX
[DECISION] Exclude condos from all searches
[OBSERVATION] User prefers larger yards — refined after first search

ChromaDB write filter

Tool results are scored 1–5 by the LLM before storage. Only results ≥ 3 are stored. Prevents errors and empty responses from polluting semantic retrieval.


Context Injection Order (every turn)

0. AGENT.md (per skill)     ← L0: identity, hard constraints — loaded by SkillRegistry at startup
1. System prompt            ← merged from all skill /schema responses
2. preferences.md           ← L1: always loaded, skill-scoped
3. ChromaDB top-N           ← L3: cross-session semantic retrieval
4. Session: last 5 turns    ← verbatim, for conversational coherence
5. Session: older turns     ← L2: semantic top-K retrieval from session episode store
6. User message

Steps 4 and 5 replace the previous single "session history" injection. The last 5 exchanges are always included verbatim for coherence. Older history is no longer compacted into a summary blob — instead each exchange is stored as an episode in a per-session ChromaDB collection and retrieved by similarity to the current message. This prevents irrelevant older context (e.g. a prior city search) from consuming tokens when the user pivots to a new topic.


Container Pool

Each skill gets pool_size (default 2) pre-warmed containers. Each container is published on a unique host port starting at 9000 (HOST_PORT_START env var), so the agent reaches them via http://localhost:{port} — no bridge network required. Ports are assigned sequentially and stored as Docker labels for recovery on restart.

After every tool call the used container is destroyed and recreated in the background — prevents side effect bleed (temp files, in-process state) while keeping the pool warm.


Context Compaction

One trigger:

  • Reactive — context overflow error caught, compact and retry

Proactive compaction of session history was removed in Phase 3. The L2 episode store retrieves only relevant older turns by similarity, which keeps token usage low without needing proactive compaction in most sessions.

Rule: never split a tool call / tool result pair across a compaction boundary.


API Endpoints

Endpoint Transport Purpose
POST /chat JSON Blocking — returns complete response including data field
POST /chat/stream SSE Streaming — yields token, data, done events
GET /skills JSON List loaded skill names
GET /sessions JSON List session IDs, most recent first
GET /sessions/{id} JSON Full message history for a session
DELETE /sessions/{id} Delete a session
GET /sessions/{id}/title JSON Generate a short title from the session's first user message

SSE event format (POST /chat/stream)

data: {"type": "token", "content": "I found "}
data: {"type": "token", "content": "3 properties..."}
data: {"type": "data", "data": {"type": "listings", "items": [...]}}
data: {"type": "hints", "hints": ["Show me with a garage", "Filter to houses only", "..."]}
data: {"type": "done", "session_id": "abc123"}

data event only fires when search_properties, get_property_details, or get_property_details_by_address returns a non-error result. chat.py uses POST /chat and is unaffected by the streaming endpoint.


Key Design Decisions

Decision Rationale
No LangChain Full transparency over agent loop; ~50 lines vs framework abstraction
LiteLLM Swap LLM provider via one env var, no code changes
Docker per skill Dependency isolation + language-agnostic skill authoring
Destroy container after use Prevents side effect bleed between calls
Typed preferences.md entries LLM parses easier; per-entry updates without rewriting file
Score before ChromaDB write Keeps semantic retrieval clean; one cheap LLM call per tool use
Per-skill .env Skill secrets never touch host env
Host port publishing (9000+) Agent reaches containers via localhost; no bridge network DNS needed
Full streaming All LLM calls stream; tool call chunks accumulated before dispatch; fewer total LLM calls than partial streaming
Structured data field Frontend renders property cards from data.items without parsing text; chat.py reads message only and is unaffected
Map only on detail view Zillow search API does not return coordinates; detail API (/pro/byzpid) does — map is shown only where data is available
Click-to-detail Clicking a card sends a pre-composed message so the agent reliably calls get_property_details; no new API surface needed