Skip to content

Commit 03e2714

Browse files
madhu-mohan-jaishankarcafalchiojonpspri
authored
Feature/plugin multi tenancy with per tool plugin config (#4068)
* add per-context plugin manager with TenantPluginManagerFactory Introduce context-scoped plugin isolation via a new TenantPluginManagerFactory singleton. Each virtual server/tenant receives its own TenantPluginManager with an independently merged plugin config, while the __global__ manager continues to serve non-context-aware call sites. - Add TenantPluginManager (disables Borg state sharing) and TenantPluginManagerFactory (async-safe, deduplicating cache/factory) - Update framework __init__ accessors: init_plugin_manager_factory(), get_plugin_manager(server_id), shutdown/reset helpers - Wire services (tool, prompt, resource, classification) to pass server_id when resolving plugin managers - Update main.py lifespan to initialize factory eagerly at startup - Add PluginConfigOverride model and shallow config merge logic - Add architecture doc and 756-line test suite for tenant-scoped plugin execution Signed-off-by: cafalchio <mcafalchio@gmail.com> * ran pre-commit to update secrets Signed-off-by: cafalchio <mcafalchio@gmail.com> * Update architecture, linted and secrets Signed-off-by: cafalchio <mcafalchio@gmail.com> * updated secrets Signed-off-by: cafalchio <mcafalchio@gmail.com> * audited and scan secrets Signed-off-by: cafalchio <mcafalchio@gmail.com> * feat: add per-tool plugin binding API with validation and tests Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * test: improve router and service test quality for tool plugin bindings Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * fix: rebase migration onto cbedf4e580e0 to resolve multiple Alembic heads Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * chore: update secrets baseline with migration revision IDs Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * chore: mark migration revision IDs as non-secrets in baseline Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * feat: grant tools.manage_plugins to team_admin; fix service UUID and team guard Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * chore: add migration revision IDs to secrets baseline Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * feat: wire per-tool plugin bindings into tool invocation pipeline Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> * Added dynamic plugin for rust endpoitns Signed-off-by: cafalchio <mcafalchio@gmail.com> * Updated secrets and lint Signed-off-by: cafalchio <mcafalchio@gmail.com> * fixed linter Signed-off-by: cafalchio <mcafalchio@gmail.com> * test: Fix mocking problem in test_main_sighup Signed-off-by: Jonathan Springer <jps@s390x.com> --------- Signed-off-by: cafalchio <mcafalchio@gmail.com> Signed-off-by: Madhu Mohan Jaishankar <madhu.mohan.jaishankar@ibm.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Co-authored-by: cafalchio <mcafalchio@gmail.com> Co-authored-by: Jonathan Springer <jps@s390x.com>
1 parent ef9e49b commit 03e2714

51 files changed

Lines changed: 5382 additions & 769 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.secrets.baseline

Lines changed: 76 additions & 40 deletions
Large diffs are not rendered by default.
Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
# Plugin Manager Multi-Tenancy Architecture
2+
3+
## Overview
4+
5+
The plugin subsystem supports **context-scoped isolation** through a shared `TenantPluginManagerFactory`. Each resolved context gets its own `TenantPluginManager` instance with an independently merged plugin configuration, while the default `__global__` context continues to serve non-context-aware call sites.
6+
7+
The factory is intentionally **context-agnostic**. The identifier passed to `get_manager()` can represent a virtual server, tenant, tool, user, or another scoping key. The factory does not interpret the value; it only uses it to:
8+
- look up an existing cached manager,
9+
- fetch optional configuration overrides via `get_config_from_db(context_id)`,
10+
- build and initialize a `TenantPluginManager`,
11+
- cache the result for reuse.
12+
13+
In the current gateway wiring, the primary runtime usage is:
14+
- `get_plugin_manager()` or `get_plugin_manager("__global__")` for shared/global plugin execution
15+
- `get_plugin_manager(server_id)` for server-scoped execution in services such as tools, prompts, and resources
16+
17+
---
18+
19+
## Plugin Configuration Lifecycle
20+
21+
### Startup: YAML is the source of truth
22+
23+
Plugins must be **declared in the YAML configuration file at startup**. The YAML defines which plugins exist, their default `mode`, `priority`, and plugin-specific `config` keys. This base configuration is loaded once by `TenantPluginManagerFactory` and is **immutable at runtime** — it cannot be changed without a restart.
24+
25+
```yaml
26+
plugins/config.yaml
27+
└─ plugin A (mode: enforce, priority: 10, config: {...})
28+
└─ plugin B (mode: permissive, priority: 20, config: {...})
29+
```
30+
31+
Only plugins listed in the YAML participate in any context. There is no mechanism to introduce entirely new plugins at runtime.
32+
33+
### Runtime: per-context overrides via `PluginConfigOverride`
34+
35+
For each context (e.g. a virtual server), the factory may apply a list of `PluginConfigOverride` objects on top of the base YAML config. An override can:
36+
37+
- **change a plugin's `mode`** (e.g. promote from `permissive` to `enforce` for a specific server)
38+
- **change a plugin's `priority`** (re-order execution within the chain)
39+
- **add or replace keys in the plugin's `config` dict** (deep custom configuration)
40+
41+
Overrides are **additive and selective**: only the fields explicitly set in an override are applied; everything else inherits the YAML base. A plugin not mentioned in the override list is used as-is.
42+
43+
```yaml
44+
PluginConfigOverride
45+
└─ name: "plugin A"
46+
└─ mode: permissive # overrides YAML value for this context
47+
└─ config: {threshold: 5} # merged on top of base config
48+
```
49+
50+
### Fetch hook: `get_config_from_db`
51+
52+
`get_config_from_db(context_id)` is the extension point that translates a context identifier into a list of `PluginConfigOverride` objects fetched from persistent storage.
53+
54+
The base implementation always returns `None` (no overrides). Subclasses override this method to query the database — or any other store — for the per-context plugin settings associated with `context_id`.
55+
56+
The returned overrides are passed directly to `_merge_tenant_config`, which walks the base config's plugin list and applies each override: per-plugin `config` dicts are shallow-merged (override keys win), and optional `mode` and `priority` fields replace the base values when present. The result is a new `Config` object used to construct an isolated `TenantPluginManager` for that context.
57+
58+
In summary: `get_config_from_db` is the seam between the factory and your persistence layer — override it to make per-context plugin configuration dynamic.
59+
60+
```mermaid
61+
flowchart TD
62+
Y["YAML config file\n(loaded at startup, immutable)"]
63+
DB["Persistence layer\n(DB, config store, etc.)"]
64+
H["get_config_from_db(context_id)\n[override this in subclass]"]
65+
O["list[PluginConfigOverride]\n(mode / priority / config keys)"]
66+
M["_merge_tenant_config()"]
67+
C["Merged Config\n(per-context)"]
68+
T["TenantPluginManager\n(per-context instance)"]
69+
70+
Y --> M
71+
DB --> H
72+
H --> O
73+
O --> M
74+
M --> C
75+
C --> T
76+
```
77+
78+
---
79+
80+
## Architecture Summary
81+
82+
```mermaid
83+
flowchart TD
84+
APP["Application lifespan\n(mcpgateway.main)"]
85+
F["TenantPluginManagerFactory\n(singleton, holds base YAML config)"]
86+
G["TenantPluginManager\ncontext = '__global__'\n(backward-compat global manager)"]
87+
S1["TenantPluginManager\ncontext = 'server-id-1'"]
88+
S2["TenantPluginManager\ncontext = 'server-id-2'"]
89+
DB["get_config_from_db(context_id)\n(fetch per-context overrides)"]
90+
YAML["plugins/config.yaml\n(base plugin config)"]
91+
92+
APP -->|"init_plugin_manager_factory()"| F
93+
YAML -->|"loaded once at startup"| F
94+
F -->|"eager: get_manager()"| G
95+
F -->|"lazy: get_manager(server_id)"| S1
96+
F -->|"lazy: get_manager(server_id)"| S2
97+
F <-->|"override fetch"| DB
98+
```
99+
100+
### Main components
101+
102+
- **`PluginManager`**: legacy Borg-style manager with shared state across instances.
103+
- **`TenantPluginManager`**: `PluginManager` subclass that disables Borg behavior and keeps fully independent state per instance.
104+
- **`TenantPluginManagerFactory`**: async-safe cache/factory for per-context managers.
105+
- **`get_plugin_manager()`**: global accessor in `mcpgateway.plugins.framework.__init__` that returns a context manager from the singleton factory when plugins are enabled.
106+
107+
---
108+
109+
## Current Runtime Behavior
110+
111+
### Startup
112+
113+
At startup, `mcpgateway.main.lifespan()`:
114+
115+
1. enables the plugin subsystem when configured,
116+
2. initializes the global `TenantPluginManagerFactory` with:
117+
- YAML config path,
118+
- plugin timeout,
119+
- hook payload policies,
120+
- optional observability provider,
121+
3. calls `await get_plugin_manager()` to resolve the default `__global__` manager,
122+
4. leaves additional context-specific managers to be created lazily on first use.
123+
124+
This means the factory is initialized eagerly, but most tenant/server managers are initialized on demand.
125+
126+
### Request-time resolution
127+
128+
Services that support context scoping call `get_plugin_manager(server_id)` and receive:
129+
- a cached `TenantPluginManager`, or
130+
- a newly built and initialized one for that context.
131+
132+
Call sites that do not provide a context ID continue to use the default global manager.
133+
134+
---
135+
136+
## Core Types
137+
138+
### `PluginManager`
139+
140+
`PluginManager` remains the base implementation and still uses the Borg pattern.
141+
142+
| Property | Current behavior |
143+
| --- | --- |
144+
| State model | Shared `__dict__` across instances |
145+
| Primary role | Legacy/global compatibility |
146+
| Initialization | Loads YAML config and shares registry/executor state |
147+
| Reset path | `PluginManager.reset()` clears shared Borg state |
148+
149+
### `TenantPluginManager`
150+
151+
`TenantPluginManager` inherits the public API from `PluginManager` but bypasses the Borg initialization path.
152+
153+
| Property | Current behavior |
154+
| --- | --- |
155+
| State model | Independent per instance |
156+
| Config source | Either a `Config` object or YAML path |
157+
| Registry | Dedicated `PluginInstanceRegistry` per manager |
158+
| Executor | Dedicated `PluginExecutor` per manager |
159+
| Locking | Own async init/shutdown lock per manager |
160+
161+
`enable_borg()` is overridden as a no-op, so tenant managers do not share state.
162+
163+
### `TenantPluginManagerFactory`
164+
165+
Defined in `mcpgateway/plugins/framework/manager.py`.
166+
167+
| Method | Current behavior |
168+
| --- | --- |
169+
| `get_manager(context_id=None)` | Returns cached manager or creates one; defaults to `__global__` |
170+
| `_build_manager(context_id)` | Fetches overrides, merges config, initializes manager, swaps cache entry |
171+
| `_merge_tenant_config(overrides)` | Applies per-plugin override values on top of base YAML config |
172+
| `reload_tenant(context_id)` | Evicts cached manager, rebuilds it, and shuts down the old one |
173+
| `shutdown()` | Cancels in-flight builds and shuts down all cached managers |
174+
| `get_config_from_db(context_id)` | Extension hook; returns `None` in the base implementation — **subclass to enable DB-backed overrides** |
175+
176+
---
177+
178+
## Accessor Layer
179+
180+
The public accessor lives in `mcpgateway/plugins/framework/__init__.py`.
181+
182+
| Function | Current behavior |
183+
| --- | --- |
184+
| `enable_plugins(toggle)` | Enables or disables the plugin subsystem globally |
185+
| `init_plugin_manager_factory(...)` | Creates the singleton factory explicitly during startup |
186+
| `get_plugin_manager(server_id="__global__")` | Returns a context manager when plugins are enabled and the factory exists |
187+
| `shutdown_plugin_manager_factory()` | Shuts down the factory and clears the singleton reference |
188+
| `reset_plugin_manager_factory()` | Clears the singleton reference for tests |
189+
190+
### Important clarification
191+
192+
The accessor **does not lazy-initialize the factory**. If the factory was not initialized during startup, `get_plugin_manager()` returns `None`.
193+
194+
---
195+
196+
## Configuration Merge Model
197+
198+
Each context starts from the base YAML plugin config and optionally applies a list of `PluginConfigOverride` objects returned by `get_config_from_db(context_id)`.
199+
200+
Only plugins already present in the base config participate in the merge. There is no mechanism to introduce new plugins at runtime; the YAML is the canonical plugin registry.
201+
202+
For each matching plugin:
203+
204+
- `config` is shallow-merged: `{**base.config, **override.config}` — override keys win
205+
- `mode` is replaced only if provided in the override
206+
- `priority` is replaced only if provided in the override
207+
208+
Plugins not mentioned in the override list remain unchanged. Passing `None` overrides means: use the base config as-is.
209+
210+
```mermaid
211+
flowchart LR
212+
B["Base YAML config\n(plugin A, plugin B, plugin C)"]
213+
O["PluginConfigOverride list\nfrom get_config_from_db(context_id)"]
214+
M["_merge_tenant_config()"]
215+
R["Merged context Config\n(used to build TenantPluginManager)"]
216+
217+
B -->|"all plugins"| M
218+
O -->|"selective overrides\nmode / priority / config keys"| M
219+
M --> R
220+
```
221+
222+
---
223+
224+
## Concurrency Model
225+
226+
Manager creation is deduplicated per context through `_inflight`.
227+
228+
When multiple coroutines ask for the same context manager concurrently:
229+
230+
1. the first caller acquires the lock and creates `_build_manager(context_id)` as an `asyncio.Task`,
231+
2. the task is stored in `_inflight[context_id]` and the lock is released,
232+
3. later callers acquiring the lock find the existing task and await it,
233+
4. once the task completes, `_build_manager` stores the result in `_managers` under the lock,
234+
5. `get_manager` re-checks `_managers` after the await to pick up any replacement triggered by a concurrent `reload_tenant`,
235+
6. the task is removed from `_inflight` in a `finally` block.
236+
237+
This ensures only one initialization path runs per context at a time, and concurrent callers share the result rather than racing to build duplicate managers.
238+
239+
```mermaid
240+
sequenceDiagram
241+
participant C1 as Caller 1
242+
participant C2 as Caller 2
243+
participant F as Factory (lock)
244+
participant T as _build_manager task
245+
246+
C1->>F: get_manager("server-1")
247+
F->>F: cache miss → create task
248+
F->>T: asyncio.create_task(_build_manager)
249+
F-->>C1: release lock, await task
250+
C2->>F: get_manager("server-1")
251+
F->>F: cache miss → inflight task found
252+
F-->>C2: release lock, await same task
253+
T-->>F: initialize manager, store in _managers
254+
T-->>C1: return manager
255+
T-->>C2: return same manager
256+
```
257+
258+
---
259+
260+
## Reload and Shutdown Semantics
261+
262+
### Reload
263+
264+
`reload_tenant(context_id)`:
265+
266+
1. acquires the lock and removes the cached manager for the context,
267+
2. cancels any existing in-flight build task for the same context,
268+
3. creates a fresh `_build_manager` task and stores it in `_inflight`,
269+
4. releases the lock and shuts down the old manager outside it,
270+
5. awaits the new task and returns the rebuilt manager.
271+
272+
### Shutdown
273+
274+
`shutdown()`:
275+
276+
1. snapshots cached managers and in-flight tasks under the lock,
277+
2. clears both caches atomically,
278+
3. cancels all in-flight tasks,
279+
4. awaits their completion (collecting exceptions),
280+
5. shuts down each cached manager.
281+
282+
This keeps teardown orderly without leaving active manager instances behind.
283+
284+
---
285+
286+
## Backward Compatibility
287+
288+
The current design preserves compatibility in a few important ways:
289+
290+
- `PluginManager` still exists for Borg-based shared-state behavior.
291+
- `TenantPluginManager` keeps the same public lifecycle and hook invocation API as `PluginManager`.
292+
- `get_plugin_manager()` without arguments still resolves the global `__global__` manager.
293+
- Call sites that are not context-aware continue to function against the global manager.
294+
295+
What changed is the wiring: the system now routes plugin access through the factory instead of a single shared manager instance.
296+
297+
---
298+
299+
## Recommended Mental Model
300+
301+
Use the following model when reasoning about the architecture:
302+
303+
- **one factory per process** — holds the base YAML config and the manager cache
304+
- **one cached manager per context ID** — each with an independent registry and executor
305+
- **plugins declared once in YAML at startup** — the YAML is the canonical plugin registry
306+
- **per-context overrides fetched at manager-build time** — via `get_config_from_db`; subclass to wire to your DB
307+
- **one shared base config, optionally merged with per-context overrides** — override keys win; unknown plugins are ignored
308+
309+
That is the current architecture implemented by the code, without requiring every request path to understand how plugin configuration is stored internally.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# -*- coding: utf-8 -*-
2+
"""Add tool_plugin_bindings table for per-tool per-tenant plugin policies
3+
4+
Revision ID: b1c2d3e4f5a6
5+
Revises: cbedf4e580e0
6+
Create Date: 2026-04-03 00:00:00.000000
7+
8+
"""
9+
10+
# Third-Party
11+
from alembic import op
12+
import sqlalchemy as sa
13+
14+
# revision identifiers, used by Alembic.
15+
revision: str = "b1c2d3e4f5a6"
16+
down_revision = "cbedf4e580e0"
17+
branch_labels = None
18+
depends_on = None
19+
20+
21+
def upgrade() -> None:
22+
"""Create tool_plugin_bindings table if it does not already exist."""
23+
inspector = sa.inspect(op.get_bind())
24+
25+
if "tool_plugin_bindings" in inspector.get_table_names():
26+
return
27+
28+
op.create_table(
29+
"tool_plugin_bindings",
30+
sa.Column("id", sa.String(36), primary_key=True),
31+
sa.Column("team_id", sa.String(36), sa.ForeignKey("email_teams.id", ondelete="CASCADE"), nullable=False),
32+
sa.Column("tool_name", sa.String(255), nullable=False),
33+
sa.Column("plugin_id", sa.String(64), nullable=False),
34+
sa.Column("mode", sa.String(20), nullable=False, server_default="enforce"),
35+
sa.Column("priority", sa.Integer(), nullable=False, server_default="50"),
36+
sa.Column("config", sa.JSON(), nullable=False, server_default="{}"),
37+
sa.Column("created_at", sa.DateTime(timezone=True), nullable=False, server_default=sa.func.now()),
38+
sa.Column("created_by", sa.String(255), nullable=False),
39+
sa.Column("updated_at", sa.DateTime(timezone=True), nullable=False, server_default=sa.func.now()),
40+
sa.Column("updated_by", sa.String(255), nullable=False),
41+
sa.UniqueConstraint("team_id", "tool_name", "plugin_id", name="uq_tool_plugin_binding"),
42+
)
43+
44+
op.create_index("ix_tool_plugin_bindings_team_id", "tool_plugin_bindings", ["team_id"])
45+
op.create_index("ix_tool_plugin_bindings_tool_name", "tool_plugin_bindings", ["tool_name"])
46+
47+
48+
def downgrade() -> None:
49+
"""Drop tool_plugin_bindings table."""
50+
inspector = sa.inspect(op.get_bind())
51+
52+
if "tool_plugin_bindings" not in inspector.get_table_names():
53+
return
54+
55+
op.drop_index("ix_tool_plugin_bindings_tool_name", table_name="tool_plugin_bindings")
56+
op.drop_index("ix_tool_plugin_bindings_team_id", table_name="tool_plugin_bindings")
57+
op.drop_table("tool_plugin_bindings")

0 commit comments

Comments
 (0)