refactor: modernize engine with Pydantic models, modular architecture, and session recovery#166
refactor: modernize engine with Pydantic models, modular architecture, and session recovery#166anticomputer merged 21 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the SecLab Taskflow Agent from a monolithic entrypoint into a modular engine with Pydantic v2-validated YAML grammar models, a Typer CLI, per-model API configuration (Chat Completions vs Responses), and task-level checkpoint/resume support.
Changes:
- Introduces typed Pydantic document models and updates YAML loading to validate at parse time.
- Replaces the old argparse-based entrypoint with a Typer CLI and moves execution logic into a dedicated runner with MCP lifecycle modules.
- Adds session checkpointing/resume plus new examples/docs/tests for the modernized architecture and per-model API settings.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_yaml_parser.py | Updates YAML parsing assertions to use Pydantic document attributes. |
| tests/test_template_utils.py | Switches reusable taskflow prompt access to typed models. |
| tests/test_session.py | Adds unit tests for session checkpoint/load/list behavior. |
| tests/test_models.py | Adds validation tests for Pydantic grammar models and real YAML fixtures. |
| tests/test_cli_parser.py | Updates legacy prompt parsing tests for renamed unpack variables and typed taskflow access. |
| src/seclab_taskflow_agent/template_utils.py | Adds typing/exports and adapts prompt loading to Pydantic PromptDocument. |
| src/seclab_taskflow_agent/shell_utils.py | Adds typing/docs and improves shell error messaging. |
| src/seclab_taskflow_agent/session.py | Implements taskflow session persistence for checkpoint/resume. |
| src/seclab_taskflow_agent/runner.py | New modular execution engine with model resolution, reusable tasks, prompt expansion, retries, and session handling. |
| src/seclab_taskflow_agent/render_utils.py | Adds typing/exports and docs for async output buffering. |
| src/seclab_taskflow_agent/prompt_parser.py | Extracts legacy argparse parsing used by runtime/test compatibility. |
| src/seclab_taskflow_agent/path_utils.py | Centralizes platformdirs data-dir creation and reuses it for MCP data storage. |
| src/seclab_taskflow_agent/models.py | Defines the YAML grammar via Pydantic v2 models (taskflow/personality/toolbox/model_config/prompt). |
| src/seclab_taskflow_agent/mcp_utils.py | Slims MCP utilities and re-exports transport/prompt; updates toolbox param resolution for typed models. |
| src/seclab_taskflow_agent/mcp_transport.py | Extracts MCP transport implementations (stdio wrappers + streamable process thread). |
| src/seclab_taskflow_agent/mcp_servers/memcache/memcache_backend/dictionary_file.py | Tightens type comparisons for memcache “add_state”. |
| src/seclab_taskflow_agent/mcp_servers/codeql/jsonrpyc/init.py | Replaces bare except with except Exception. |
| src/seclab_taskflow_agent/mcp_servers/codeql/client.py | Adds logging import (used for client logging). |
| src/seclab_taskflow_agent/mcp_prompt.py | Extracts system prompt construction helper. |
| src/seclab_taskflow_agent/mcp_lifecycle.py | Extracts MCP server connect/cleanup lifecycle management. |
| src/seclab_taskflow_agent/env_utils.py | Adds typing/docs and exports for env swap + temporary env context manager. |
| src/seclab_taskflow_agent/cli.py | Adds Typer CLI with --debug/--resume, concise error printing, and logging setup. |
| src/seclab_taskflow_agent/capi.py | Adds module docs/exports and typing improvements for token/endpoint/model listing. |
| src/seclab_taskflow_agent/banner.py | Adds typing/exports and docstring for startup banner. |
| src/seclab_taskflow_agent/available_tools.py | Rewrites YAML loader to cache + validate documents via Pydantic models. |
| src/seclab_taskflow_agent/agent.py | Extends TaskAgent to support Responses API, per-model endpoint/token overrides, and adds exports/docs. |
| src/seclab_taskflow_agent/main.py | Replaces monolithic implementation with a small entrypoint delegating to CLI/runner. |
| src/seclab_taskflow_agent/init.py | Adds package-level docs and re-exports key public types. |
| pyproject.toml | Updates metadata, adds console script entrypoint, and tightens Ruff configuration. |
| examples/taskflows/example_responses_api.yaml | Adds an example taskflow configured to use the Responses API. |
| examples/taskflows/echo_responses_api.yaml | Adds a Responses API echo taskflow example. |
| examples/taskflows/comprehensive_test.yaml | Adds a comprehensive taskflow exercising the full grammar and engine features. |
| examples/model_configs/responses_api.yaml | Adds model_config demonstrating per-model api_type/endpoint/token overrides. |
| doc/GRAMMAR.md | Documents per-model settings and engine-handled model_settings keys. |
| README.md | Documents the new architecture, API types, session recovery, and error output behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…r CLI - Extract __main__.py into cli.py, runner.py, mcp_lifecycle.py, models.py - Add Pydantic v2 grammar models for all YAML document types - Replace argparse with Typer CLI, add project.scripts entry point - Reduce ruff ignore list from 59 to 22 rules, fix lint issues - Add 29 model tests against real YAML files (68/68 pass) - Full grammar backwards compatibility preserved
Replace all raw dict access with typed model attributes throughout the engine. AvailableTools now returns Pydantic model instances, runner.py uses typed fields, and tests validate via model attributes.
Split mcp_utils into mcp_transport and mcp_prompt. Extract prompt_parser to break cli/runner circular import. Decompose run_main into focused helpers. Add type hints and docstrings across all modules. Add __all__ exports, pyproject keywords and classifiers.
Add api_type field to ModelConfigDocument (chat_completions|responses). Thread api_type through runner -> deploy_task_agents -> TaskAgent. TaskAgent switches between OpenAIChatCompletionsModel and OpenAIResponsesModel based on api_type. Default: chat_completions.
model_settings in model_config can now set api_type, endpoint, and token per model. token is an env var name to resolve. Allows mixing chat_completions and responses API across models and routing to different endpoints within a single taskflow.
Document api_type, endpoint, and token per-model overrides. Update architecture module listing in README. Add per-model settings reference table to GRAMMAR.md.
- Fix mutable default args in TaskAgent.__init__ (list -> None) - Replace bare except Exception with specific exception types - Add __all__ exports to all 14 submodules - Add docstrings to all hook methods in agent.py - Add return type to banner.py and prompt_parser.py - Add module docstrings to capi.py, env_utils.py, shell_utils.py - Remove B006 ruff suppression (no longer needed)
Add --debug/-d flag and TASK_AGENT_DEBUG env var. Default: one-line error chain. Debug: full traceback.
Add TaskflowSession model for task-level checkpointing. Save progress after each task, auto-retry failed tasks 3x. New --resume flag to continue from last checkpoint. Includes 9 tests for session persistence.
c22b38f to
69fc1e3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 43 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- capi: unknown endpoint fallback now type-checks response before iterating, avoids AttributeError on dict responses without data key - render_utils: flush_async_output is a no-op when no buffer exists, prevents masking the real error when an async agent fails early - session: list_sessions sorts by file mtime instead of filename so most-recent-first ordering is actually correct - runner: replace confusing issubset(set()) idiom with straightforward set difference check for unknown model settings keys - runner: catch JSONDecodeError on outer tool result parse separately from the inner text parse so malformed results get a clear error - tests: suppress S105 false positive on test token assertion
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 43 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| data = async_output[task_id] | ||
| del async_output[task_id] | ||
| # No buffered output (agent may have failed before producing any). | ||
| return |
There was a problem hiding this comment.
Any reason for changing from raising exception to just silently return?
There was a problem hiding this comment.
The old code raised ValueError if the task_id wasn't in the buffer. But an agent can fail before producing any output (e.g. connection error during setup), and flush_async_output is called unconditionally in the cleanup path. Raising there would mask the original failure. The early return with the comment on line 27 makes this a graceful no-op instead.
| if models_params and not isinstance(models_params, dict): | ||
| raise ValueError(f"Settings section of model_config file {model_config_ref} must be a dictionary") | ||
| unknown = set(models_params) - set(model_keys) | ||
| if unknown: |
There was a problem hiding this comment.
This probably isn't validated via pydantic
There was a problem hiding this comment.
It actually is ... model_settings: dict[str, dict[str, Any]] in Pydantic v2 validates both the outer dict and that each value is a dict. Removed the check along with the others; updated the test to verify Pydantic validation directly.
| self.on_output: Callable[[str], None] | None = on_output | ||
| self.on_error: Callable[[str], None] | None = on_error | ||
| self.poll_interval: float = poll_interval | ||
| self.env: dict[str, str] = _filtered_env() |
There was a problem hiding this comment.
This is the same as previous behaviour, but do we need to pass all the environment variables to the mcp process? (It now takes a deny list but not set by default). It is also not consistent with other mcp behaviour where environment variables are passed via the toolbox yaml file (This class also supports that, so I think it is unnecessary to copy the environment variables)
There was a problem hiding this comment.
MCP subprocesses need a working base environment (PATH, HOME, SHELL, LANG, etc.) to function ... without these, even basic commands and process startup fail. The toolbox YAML env: section is for adding server-specific variables on top of this inherited base. This matches the previous behavior (the old code had self.env = os.environ.copy() with a # XXX: potential for environment leak comment) but improves on it with the TASKFLOW_ENV_DENYLIST mechanism for filtering sensitive variables. Switching to an allowlist-only approach would break existing toolbox configurations that rely on a working shell environment.
There was a problem hiding this comment.
Then should we have a default list to at least exclude GH_TOKEN etc.?
src/seclab_taskflow_agent/runner.py
Outdated
| # Model settings | ||
| parallel_tool_calls = bool(os.getenv("MODEL_PARALLEL_TOOL_CALLS")) | ||
| model_params: dict[str, Any] = { | ||
| "temperature": os.getenv("MODEL_TEMP", default=0.0), |
There was a problem hiding this comment.
Note: as quick fix we can just set temperature: 1.0 in the responses_api.yaml for now.
It looks like we should not set a default temperature at least not when using models via the responses API.
e.g. using this config:
seclab-taskflow-agent:
version: "1.0"
filetype: model_config
models:
gpt_responses: gpt-5.1
model_settings:
gpt_responses:
api_type: responses
gives the error:
openai.BadRequestError: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'code': 'invalid_request_body'}}
(removing that line will make it work)
As the PR does not break existing functionality (via chat completion API), we can also handle this in a subsequent PR.
If we want to remove the default temperature in general we need to add it to the model configs everywhere where it was in use before -> this might be a bigger change, but is more transparent than having two different behaviours depending on which API is used.
There was a problem hiding this comment.
Fixed ... temperature is no longer injected by default when api_type == "responses". For chat_completions, the 0.0 default is preserved for backward compatibility. MODEL_TEMP env var and YAML model_settings still override for both API types.
- Remove redundant isinstance checks in _resolve_model_config; Pydantic ModelConfigDocument validates dict[str, dict[str, Any]] at parse time, making runtime type checks dead code. Update test to verify Pydantic validation instead of testing the removed checks. - Remove shadowing local imports of os and typer in cli.py that duplicate top-level imports. - Don't inject default temperature (0.0) when api_type is 'responses'; the responses API rejects unsupported parameters. MODEL_TEMP env var and YAML model_settings still override for both API types. Preserves backward-compatible 0.0 default for chat_completions. - Narrow task-level retry to transient network exceptions only (APIConnectionError, APITimeoutError, ConnectionError, TimeoutError). Non-retriable errors (ValueError, RuntimeError, etc.) now break immediately instead of burning through retry attempts. Prevents blind retries of deterministic failures and side-effectful repeat_prompt re-execution on non-transient errors.
|
Pushed 8b48991 addressing the review feedback — see individual thread replies for details. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 43 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/seclab_taskflow_agent/env_utils.py:63
TmpEnv.__exit__()unconditionally doesdel os.environ[k]. If code inside the context removes one of these variables (e.g.,os.environ.pop(k)),__exit__will raiseKeyErrorand can mask the original exception. Useos.environ.pop(k, None)(and then restore fromself.restore_envif present) to make cleanup idempotent and robust.
def __exit__(self, exc_type: type | None, exc_val: BaseException | None, exc_tb: Any | None) -> None:
for k, v in self.env.items():
del os.environ[k]
if k in self.restore_env:
os.environ[k] = self.restore_env[k]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The engine code was a single 671-line
__main__.pywith raw dict access everywhere, no type safety, and no error recovery. This rewrites the internals while keeping the YAML grammar fully backwards-compatible. All existing taskflows (both example and production seclab-taskflows) work without changes.What changed
Architecture.
__main__.pyis now 25 lines. The logic lives in focused modules:models.py-- Pydantic v2 models for all five YAML document types (taskflow, personality, toolbox, model_config, prompt). Validation happens at parse time instead of scattering.get()calls everywhere.runner.py-- execution engine, extracted from__main__and decomposed into_resolve_model_config(),_merge_reusable_task(),_resolve_task_model(),_build_prompts_to_run().cli.py-- Typer-based CLI replacing argparse. Adds--debugand--resumeflags.agent.py-- TaskAgent wrapper, now supports per-model API type, endpoint, and token overrides.mcp_lifecycle.py,mcp_transport.py,mcp_prompt.py-- decomposed from the old 462-linemcp_utils.py.session.py-- task-level checkpoint/resume with auto-retry.prompt_parser.py-- extracted to break a circular import between cli and runner.Pydantic grammar models. Every YAML file is validated against a typed model at load time.
extra="allow"on all models so new fields do not break old parsers. Reserved word conflicts handled via aliases (async->async_task,model_config->model_config_ref). Version normalization (1->"1.0") with validation.Per-model API configuration.
model_settingsin a model_config can now includeapi_type(chat_completions or responses),endpoint, andtoken(env var name). These are extracted by the engine before passing remaining settings to the SDK. This lets you mix endpoints and API types across models in a single taskflow.Arbitrary API endpoints. Unknown API endpoints no longer crash the agent at import time or when listing models. The model catalog and tool-call detection fall back gracefully for non-standard endpoints, so users can point at any compatible API.
Session recovery. Taskflow runs are checkpointed after each task. On unrecoverable failure, the session ID is printed and
--resumepicks up from the last successful task. Failedmust_completetasks are not advanced past on resume. Failed tasks auto-retry 3 times with backoff before giving up.Error output. Exceptions now print a concise one-line chain by default instead of dumping 80-line tracebacks through httpx/httpcore/openai internals.
--debugorTASK_AGENT_DEBUG=1gets you the full trace. SettingTASK_AGENT_DEBUG=0orTASK_AGENT_DEBUG=falsecorrectly disables debug mode.Code quality. Type hints and docstrings on all public functions.
__all__exports on every module. Fixed mutable default args inTaskAgent.__init__. Replaced bareexcept Exceptionwith specific types. Zero ruff errors across the entire codebase. Ruff ignores reduced from 59 to 21.What did not change
The YAML grammar is 100% backwards-compatible. No taskflow files need modification. All field names, nesting, template syntax, and behavior are preserved.
Testing
Unit tests for Pydantic models, runner internals (model resolution, task merging, prompt building, pop-after-render correctness), Typer CLI arg parsing, debug env handling, prompt parser edge cases, capi endpoint handling with arbitrary endpoint fallback, and session persistence including corrupt file recovery. 155 tests total, all passing, ruff clean.
Integration tested against api.githubcopilot.com: