[IMPROVE] Decouple agent logic from the endpoint to support AI evals

## Background


We need a cleaner way to evaluate assistant behavior without routing every run through the HTTP endpoint.

Right now, the assistant flow is tightly coupled to the DRF view, which makes it harder to:
- run repeatable evals from the terminal,
- save outputs for later review,
- compare responses across branches or prompt changes,
- and isolate LLM behavior from endpoint overhead when needed.

The refactor enables two evaluation modes:

1. **Full-stack eval fidelity without endpoint overhead**
   - Reuse the real assistant orchestration, tool loop, and retrieval flow directly from Python.
   - This keeps eval runs close to production behavior without needing to exercise the HTTP layer.

2. **More isolated LLM evaluation**
   - By separating orchestration and tool logic, we create a path toward swapping in a mock search function or other test doubles when we only want to study model behavior.

`pytest` is useful for pass/fail assertions, but it is awkward for exploratory eval workflows where the main goal is to generate, save, and manually review model outputs. It also does not naturally support a “generate CSV, then compare results interactively” loop without extra harnessing.

That leads to a deliberate split:
- `evaluation/eval_assistant.py` — automation for generating evaluation results to CSV
- `evaluation/review.ipynb` — interactive review and comparison of those CSV outputs

The generation step is automated; the review step is interactive. They warrant different tools.

## Current State


This work:
- extracts assistant orchestration into `assistant_services.py`,
- extracts tool-loop and retrieval logic into `tool_services.py`,
- updates the DRF view to call `run_assistant(...)`,
- adds an eval script that writes results to CSV,
- adds a small notebook for side-by-side response review,
- and adds focused unit tests around the extracted service logic.

Representative evaluation questions and prior discussion:
- https://github.com/CodeForPhilly/balancer-main/issues/345#issuecomment-3433329904
- https://github.com/CodeForPhilly/balancer-main/issues/411#issuecomment-3712677508

Existing evaluation-related directory:
- https://github.com/CodeForPhilly/balancer-main/tree/develop/evaluation

## Acceptance Criteria
- [ ] Assistant orchestration can be invoked outside the HTTP endpoint through a reusable service function.
- [ ] Tool-loop and retrieval wiring are extracted into service-level modules with clear responsibilities.
- [ ] The DRF assistant endpoint continues returning the same response contract after the refactor.
- [ ] A terminal-run evaluation script can execute a representative question set and save results to CSV.
- [ ] A notebook exists for side-by-side review of evaluation outputs across runs or branches.
- [ ] Focused unit tests cover the new assistant service and tool service behavior.
- [ ] Existing assistant-related tests continue to pass after the refactor.

## Approach


Refactor first around **evaluation usability**, not just code organization.

The assistant flow is being split so the production request path can keep using the same underlying logic while eval tooling can call that logic directly. The goal is not to create a separate eval-only implementation, but to reuse the same orchestration in both contexts.

Planned / current approach:
1. Extract assistant orchestration into a reusable service entry point.
2. Extract tool schema, tool mapping, retrieval dispatch, and reasoning-loop behavior into a separate service module.
3. Keep the DRF view thin by delegating to the service entry point.
4. Add an eval script that runs representative questions and saves outputs to CSV.
5. Add a lightweight notebook for manual side-by-side comparison of generated outputs.
6. Add focused unit tests around orchestration and tool-loop behavior.

Evaluation practice should start with **error analysis**, not infrastructure. After meaningful prompt or retrieval changes, manually review a batch of roughly 20–50 outputs before investing further in eval automation. Use one domain-informed reviewer to make final quality calls when consistency matters.

Useful guidance:
- Error analysis: https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed
- Reviewer strategy / “benevolent dictator”: https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs

Potential follow-up metrics:
- total duration
- token usage
- tool calls made
- total cost

The OpenAI API Dashboard can help validate duration and cost during early iterations.

## References


- https://github.com/CodeForPhilly/balancer-main/issues/345#issuecomment-3433329904
- https://github.com/CodeForPhilly/balancer-main/issues/411#issuecomment-3712677508
- https://github.com/CodeForPhilly/balancer-main/tree/develop/evaluation
- https://hamel.dev/blog/posts/evals-faq/

## Risks and Rollback


### Risks

| Risk | Severity | Mitigation |
|------|----------|------------|
| Assistant endpoint behavior changes from the refactor because the request path now flows through `run_assistant`. | Medium | Keep the response contract unchanged and cover orchestration behavior with service-level unit tests. Spot-check a real assistant query after deploy because OpenAI and DB interactions are mocked in unit tests. |
| Missing input validation: an omitted or blank `message` may still reach the assistant layer. | Low | Treat as a follow-up if not fixed in this change. Add explicit request validation and a 400 response contract if needed. |
| Eval tooling has first-run setup friction (environment, `sys.path`, optional notebook dependencies). | Low | Keep eval tooling offline-only and out of the production request path. Document setup assumptions inline. |
| The new tests focus on logic seams rather than full request/DB integration. | Low | This is intentional for speed and clarity; add a future integration test if needed. |
| Full-suite failures may expose unrelated existing test-environment issues. | Low | Scope success criteria to the assistant-related test suite for this change, and separately document unrelated test failures if they appear. |

### Rollback

- No database migrations are expected in this change.
- Rollback is code-only: revert the merge commit or restore the pre-refactor assistant view implementation.
- The eval script and notebook are independent of the request path and can be removed separately if needed.
- Because the refactor preserves the endpoint contract, rollback risk is primarily implementation-level, not schema-level.

## Screenshots / Recordings


Optional:
- example CSV output from `eval_assistant.py`
- screenshot of side-by-side notebook comparison
- scoped test output for `api/views/assistant/`

## Related PR


https://github.com/CodeForPhilly/balancer-main/pull/499

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[IMPROVE] Decouple agent logic from the endpoint to support AI evals #490

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Risks

Rollback

Screenshots / Recordings

Related PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Risk	Severity	Mitigation
Assistant endpoint behavior changes from the refactor because the request path now flows through `run_assistant`.	Medium	Keep the response contract unchanged and cover orchestration behavior with service-level unit tests. Spot-check a real assistant query after deploy because OpenAI and DB interactions are mocked in unit tests.
Missing input validation: an omitted or blank `message` may still reach the assistant layer.	Low	Treat as a follow-up if not fixed in this change. Add explicit request validation and a 400 response contract if needed.
Eval tooling has first-run setup friction (environment, `sys.path`, optional notebook dependencies).	Low	Keep eval tooling offline-only and out of the production request path. Document setup assumptions inline.
The new tests focus on logic seams rather than full request/DB integration.	Low	This is intentional for speed and clarity; add a future integration test if needed.
Full-suite failures may expose unrelated existing test-environment issues.	Low	Scope success criteria to the assistant-related test suite for this change, and separately document unrelated test failures if they appear.

Uh oh!

Uh oh!

[IMPROVE] Decouple agent logic from the endpoint to support AI evals #490

Description

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Risks

Rollback

Screenshots / Recordings

Related PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions