From 7e05675423f20c79991301a18524edb16df3fd61 Mon Sep 17 00:00:00 2001
From: Shamil Kashmeri <shamil@viafoura.com>
Date: Wed, 27 May 2026 14:48:55 -0400
Subject: [PATCH] feat(bedrock): add prompt caching via CachePoint markers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes #1871.

## Why this is needed

Multi-turn tool-using kagent agents on Bedrock pay full input-token cost
on every Converse call, because the static prefix (system prompt + tool
definitions) is re-sent and re-billed each turn. Real measurement from a
production deployment using Claude Sonnet 4.5 via the `us.` inference
profile in us-east-1, running every 2 hours against a ~700-pod EKS
cluster:

  - Per sweep: ~4 Converse calls
  - Cumulative input tokens (CloudWatch InvokeModel metric): ~313k
  - Cumulative output tokens: ~3k
  - Per-sweep cost: ~$0.98 (input dominates ~95%)
  - Per cluster/year (5 sweeps/weekday): ~$1,300
  - Per cluster/year (24/7 every 2h): projected ~$4-9k

~30k of the per-call input is identical across every call — system
prompt and tool definitions don't change inside a single task. Bedrock
prompt caching is designed precisely for this case: a `cachePoint` block
in the Converse request marks where the cacheable prefix ends, and
subsequent calls within ~5 minutes (per region) hit the cache and bill
the prefix at a reduced rate.

The Bedrock provider builds Converse requests using
`system: [textBlock]` and `toolConfig.tools: [...]` but never appends a
`cachePoint` block to either array, so caching is never engaged.

## Why this is not redundant with the existing `spec.declarative.compaction`

kagent already has an Agent-level context-compaction feature
(`Compaction`, `CompactionInterval`, `Summarizer`, `TokenThreshold`,
etc.) that summarizes old conversation turns when the conversation
exceeds a token threshold. That solves a different problem:

  - Compaction: shrinks the conversation prompt when it gets too long.
    Helps with context-window pressure on long-running agents.
  - Prompt caching: keeps the prompt the same size but tells Bedrock
    "the first N tokens are stable across calls, cache them and bill
    cached portion at the reduced rate."

Neither replaces the other. For a tool-using agent whose conversation
stays under the context limit but whose static prefix (system prompt +
tool defs) is large, prompt caching is the right hammer; compaction does
nothing because there's nothing to compact in the static prefix.

## What this PR does

Adds a `promptCaching: bool` field to `BedrockConfig` (defaulting to
`false` to preserve existing behavior). When set, the provider appends
a `CachePoint` block:

  1. To the end of the `system` content array (after the system text block)
  2. To the end of the `toolConfig.tools` array (after the last ToolSpec)

Markers use `CachePointTypeDefault`. Bedrock silently ignores cache
points on models that don't support prompt caching, so the field is
safe to enable on a heterogeneous model fleet without per-model
gating.

Tested against `us.anthropic.claude-sonnet-4-5-20250929-v1:0`: the
second and subsequent Converse calls within the cache window drop their
input-token billing by ~70-90% on cache hits, depending on which static
portion (system vs tools vs both) is being hit.

## Implementation surface

Mirrors the change across both runtimes — Go (for `runtime: go` agents)
and Python (for `runtime: python` agents):

Go:
  - `go/api/v1alpha2/modelconfig_types.go`: add `PromptCaching` to
    `BedrockConfig` CRD struct with full doc + kubebuilder default.
  - `go/api/adk/types.go`: add `PromptCaching` to the internal
    `adk.Bedrock` serializable model so it flows through agent config JSON.
  - `go/core/internal/controller/translator/agent/adk_api_translator.go`:
    populate the new field when translating ModelConfig CR -> adk.Bedrock.
  - `go/adk/pkg/agent/agent.go`: thread the value into `models.BedrockConfig`.
  - `go/adk/pkg/models/bedrock.go`: emit the cache point markers in the
    Converse request builders.

Python:
  - `python/.../adk/types.py`: add `prompt_caching: bool` to the
    `Bedrock` Pydantic model and pass through to `KAgentBedrockLlm` factory.
  - `python/.../adk/models/_bedrock.py`: append
    `{"cachePoint": {"type": "default"}}` to `kwargs["system"]` and
    `kwargs["toolConfig"]["tools"]` when enabled.

Regenerated CRDs via `make controller-manifests` so
`helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml` reflects the
new schema field.

## Tests

`go/adk/pkg/models/bedrock_test.go`: new
`TestConvertGenaiToolsToBedrockPromptCaching` covering three cases:
disabled = no marker, enabled = marker appended at END of tool list
with default type, enabled-but-no-tools = no marker (no point in a
standalone marker).

Existing `convertGenaiToolsToBedrock` callers updated to pass the new
`promptCaching bool` argument as `false` (no behavior change).

## Backward compatibility

  - `promptCaching` defaults to `false` everywhere; existing
    ModelConfigs pick up no behavior change.
  - Serialized `adk.Bedrock` JSON uses `omitempty` for the new field;
    older agent pods deserializing newer config see an unknown field
    they ignore (Pydantic + Go json decoders both lenient by default).
  - The Converse API tolerates and ignores `CachePoint` markers on
    models that don't support caching, so enabling on a mixed-model
    setup is safe.

## Example usage

```yaml
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: bedrock-claude
spec:
  provider: Bedrock
  model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
  bedrock:
    region: us-east-1
    promptCaching: true
```

Signed-off-by: Shamil Kashmeri <shamil@viafoura.com>
---
 go/adk/pkg/agent/agent.go                     |  1 +
 go/adk/pkg/models/bedrock.go                  | 37 ++++++++++++-
 go/adk/pkg/models/bedrock_test.go             | 55 +++++++++++++++++--
 go/api/adk/types.go                           |  5 ++
 .../crd/bases/kagent.dev_modelconfigs.yaml    | 18 ++++++
 go/api/v1alpha2/modelconfig_types.go          | 18 ++++++
 .../translator/agent/adk_api_translator.go    |  1 +
 .../templates/kagent.dev_modelconfigs.yaml    | 18 ++++++
 .../src/kagent/adk/models/_bedrock.py         | 17 ++++++
 .../kagent-adk/src/kagent/adk/types.py        |  6 ++
 10 files changed, 170 insertions(+), 6 deletions(-)

diff --git a/go/adk/pkg/agent/agent.go b/go/adk/pkg/agent/agent.go
index 1aae3637d6..db5cf96ef5 100644
--- a/go/adk/pkg/agent/agent.go
+++ b/go/adk/pkg/agent/agent.go
@@ -304,6 +304,7 @@ func CreateLLM(ctx context.Context, m adk.Model, log logr.Logger) (adkmodel.LLM,
 			Model:                        modelName,
 			Region:                       region,
 			AdditionalModelRequestFields: m.AdditionalModelRequestFields,
+			PromptCaching:                m.PromptCaching,
 		}
 		return models.NewBedrockModelWithLogger(ctx, cfg, log)
 
diff --git a/go/adk/pkg/models/bedrock.go b/go/adk/pkg/models/bedrock.go
index d9db5a842e..7c2f8825a1 100644
--- a/go/adk/pkg/models/bedrock.go
+++ b/go/adk/pkg/models/bedrock.go
@@ -77,6 +77,13 @@ type BedrockConfig struct {
 	Temperature                  *float64
 	TopP                         *float64
 	AdditionalModelRequestFields map[string]any
+	// PromptCaching, when true, appends a default CachePoint block at the
+	// end of the Converse request's system content array and the end of
+	// the tools array. Bedrock caches up to and including those markers
+	// across requests in the same region; cached prefix is billed at a
+	// reduced rate. The marker is silently ignored by Bedrock for models
+	// that do not support prompt caching.
+	PromptCaching bool
 }
 
 // BedrockModel implements model.LLM for Amazon Bedrock using the Converse API.
@@ -151,7 +158,7 @@ func (m *BedrockModel) GenerateContent(ctx context.Context, req *model.LLMReques
 		var toolConfig *types.ToolConfiguration
 		nameMap := make(map[string]string)
 		if req.Config != nil && len(req.Config.Tools) > 0 {
-			tools, nm := convertGenaiToolsToBedrock(req.Config.Tools)
+			tools, nm := convertGenaiToolsToBedrock(req.Config.Tools, m.Config.PromptCaching)
 			nameMap = nm
 			if len(tools) > 0 {
 				toolConfig = &types.ToolConfiguration{
@@ -193,6 +200,16 @@ func (m *BedrockModel) GenerateContent(ctx context.Context, req *model.LLMReques
 				Value: systemInstruction,
 			})
 		}
+		// If prompt caching is enabled, mark the end of the system content
+		// as a cache breakpoint. Bedrock caches everything up to and including
+		// this point for ~5 minutes; subsequent requests with the same prefix
+		// hit the cache. Skipped for empty systems — caching nothing is a no-op
+		// that wastes a marker.
+		if m.Config.PromptCaching && len(systemPrompt) > 0 {
+			systemPrompt = append(systemPrompt, &types.SystemContentBlockMemberCachePoint{
+				Value: types.CachePointBlock{Type: types.CachePointTypeDefault},
+			})
+		}
 
 		additionalFields := m.buildAdditionalModelRequestFields()
 
@@ -568,7 +585,12 @@ func convertGenaiContentsToBedrockMessages(contents []*genai.Content, nameMap ma
 // It sanitizes tool names to satisfy Bedrock's [a-zA-Z0-9_-]+ constraint and
 // returns the original->sanitized name mapping so callers can apply it to
 // conversation history and reverse it when restoring names from responses.
-func convertGenaiToolsToBedrock(tools []*genai.Tool) ([]types.Tool, map[string]string) {
+//
+// When promptCaching is true, a CachePoint marker is appended after the
+// last tool spec — Bedrock then caches the entire (typically large) tool
+// definitions array for ~5 minutes, billing the prefix at a reduced rate
+// on cache hits.
+func convertGenaiToolsToBedrock(tools []*genai.Tool, promptCaching bool) ([]types.Tool, map[string]string) {
 	if len(tools) == 0 {
 		return nil, nil
 	}
@@ -625,6 +647,17 @@ func convertGenaiToolsToBedrock(tools []*genai.Tool) ([]types.Tool, map[string]s
 		}
 	}
 
+	// If prompt caching is enabled, append a CachePoint at the END of the
+	// tool list. Bedrock caches the entire tool definitions array up to
+	// this marker; this is usually the biggest single chunk of static
+	// prefix in an agent conversation and benefits most from caching.
+	// Skipped when there are no tools — a cache marker by itself is a no-op.
+	if promptCaching && len(bedrockTools) > 0 {
+		bedrockTools = append(bedrockTools, &types.ToolMemberCachePoint{
+			Value: types.CachePointBlock{Type: types.CachePointTypeDefault},
+		})
+	}
+
 	return bedrockTools, nameMap
 }
 
diff --git a/go/adk/pkg/models/bedrock_test.go b/go/adk/pkg/models/bedrock_test.go
index de2d1c3caf..0f379d8d75 100644
--- a/go/adk/pkg/models/bedrock_test.go
+++ b/go/adk/pkg/models/bedrock_test.go
@@ -162,7 +162,7 @@ func TestConvertGenaiToolsToBedrock(t *testing.T) {
 			},
 		}}}}
 
-		bt1, nm1 := convertGenaiToolsToBedrock(tools)
+		bt1, nm1 := convertGenaiToolsToBedrock(tools, false)
 		schema := extractSchema(t, bt1, nm1)
 
 		props := schema["properties"].(map[string]any)
@@ -190,7 +190,7 @@ func TestConvertGenaiToolsToBedrock(t *testing.T) {
 			},
 		}}}}
 
-		bt2, nm2 := convertGenaiToolsToBedrock(tools)
+		bt2, nm2 := convertGenaiToolsToBedrock(tools, false)
 		schema := extractSchema(t, bt2, nm2)
 		props, ok := schema["properties"].(map[string]any)
 		if !ok || len(props) == 0 {
@@ -211,7 +211,7 @@ func TestConvertGenaiToolsToBedrock(t *testing.T) {
 			ParametersJsonSchema: s,
 		}}}}
 
-		bt3, nm3 := convertGenaiToolsToBedrock(tools)
+		bt3, nm3 := convertGenaiToolsToBedrock(tools, false)
 		schema := extractSchema(t, bt3, nm3)
 		props, ok := schema["properties"].(map[string]any)
 		if !ok || len(props) == 0 {
@@ -366,7 +366,7 @@ func TestConvertGenaiToolsToBedrockSanitizesNames(t *testing.T) {
 		{Name: "filesystem:read_file", Description: "Read a file"},
 	}}}
 
-	bedrockTools, nameMap := convertGenaiToolsToBedrock(tools)
+	bedrockTools, nameMap := convertGenaiToolsToBedrock(tools, false)
 	if len(bedrockTools) != 2 {
 		t.Fatalf("expected 2 tools, got %d", len(bedrockTools))
 	}
@@ -424,3 +424,50 @@ func TestStreamingToolCallParseArgs(t *testing.T) {
 		})
 	}
 }
+
+func TestConvertGenaiToolsToBedrockPromptCaching(t *testing.T) {
+	tools := []*genai.Tool{{FunctionDeclarations: []*genai.FunctionDeclaration{
+		{Name: "get_weather", Description: "lookup weather"},
+		{Name: "list_pods", Description: "list pods"},
+	}}}
+
+	t.Run("disabled: no cache marker appended", func(t *testing.T) {
+		out, _ := convertGenaiToolsToBedrock(tools, false)
+		if len(out) != 2 {
+			t.Fatalf("expected 2 tools, got %d", len(out))
+		}
+		for i, tool := range out {
+			if _, ok := tool.(*types.ToolMemberCachePoint); ok {
+				t.Fatalf("did not expect a CachePoint at index %d when caching disabled", i)
+			}
+		}
+	})
+
+	t.Run("enabled: cache marker appended at the END of the tool list", func(t *testing.T) {
+		out, _ := convertGenaiToolsToBedrock(tools, true)
+		if len(out) != 3 {
+			t.Fatalf("expected 3 entries (2 tools + 1 CachePoint), got %d", len(out))
+		}
+		// The first two must remain ToolSpec entries (order preserved).
+		for i := 0; i < 2; i++ {
+			if _, ok := out[i].(*types.ToolMemberToolSpec); !ok {
+				t.Fatalf("entry %d: expected ToolMemberToolSpec, got %T", i, out[i])
+			}
+		}
+		// The trailing entry must be a CachePoint with type=default.
+		cp, ok := out[2].(*types.ToolMemberCachePoint)
+		if !ok {
+			t.Fatalf("trailing entry: expected ToolMemberCachePoint, got %T", out[2])
+		}
+		if cp.Value.Type != types.CachePointTypeDefault {
+			t.Errorf("expected CachePointType=default, got %v", cp.Value.Type)
+		}
+	})
+
+	t.Run("enabled but no tools: no cache marker (skipped)", func(t *testing.T) {
+		out, _ := convertGenaiToolsToBedrock(nil, true)
+		if len(out) != 0 {
+			t.Fatalf("expected empty slice for no tools, got %d entries", len(out))
+		}
+	})
+}
diff --git a/go/api/adk/types.go b/go/api/adk/types.go
index 602a457980..f825502440 100644
--- a/go/api/adk/types.go
+++ b/go/api/adk/types.go
@@ -251,6 +251,11 @@ type Bedrock struct {
 	// additionalModelRequestFields in the Converse API. Use this for provider-specific
 	// options outside the standard InferenceConfiguration block.
 	AdditionalModelRequestFields map[string]any `json:"additional_model_request_fields,omitempty"`
+	// PromptCaching enables Bedrock prompt caching by appending a CachePoint
+	// block to the end of the system content array and the end of the tools
+	// array in the Converse request. See the v1alpha2.BedrockConfig CRD doc
+	// for context.
+	PromptCaching bool `json:"prompt_caching,omitempty"`
 }
 
 func (b *Bedrock) MarshalJSON() ([]byte, error) {
diff --git a/go/api/config/crd/bases/kagent.dev_modelconfigs.yaml b/go/api/config/crd/bases/kagent.dev_modelconfigs.yaml
index 00b21b6da0..50d115f7f4 100644
--- a/go/api/config/crd/bases/kagent.dev_modelconfigs.yaml
+++ b/go/api/config/crd/bases/kagent.dev_modelconfigs.yaml
@@ -483,6 +483,24 @@ spec:
                       Claude extended thinking or top_k. Values are forwarded as-is to the API.
                       Example: {"top_k": 5, "thinking": {"type": "enabled", "budget_tokens": 16000}}
                     x-kubernetes-preserve-unknown-fields: true
+                  promptCaching:
+                    default: false
+                    description: |-
+                      PromptCaching enables Bedrock prompt caching by appending a CachePoint
+                      block at the end of the Converse request's `system` content array and
+                      the end of the `tools` array. Bedrock will cache the prefix up to and
+                      including those cache points across requests in the same region for
+                      roughly 5 minutes after first use, billing the cached portion at a
+                      reduced rate on cache hits.
+
+                      Recommended for tool-using agents that make many Converse calls per
+                      task with a stable system prompt and tool set — the per-call input
+                      token count can drop by 70-90% on hit. Has no effect on models that
+                      don't support caching; the marker is ignored by Bedrock for those.
+
+                      See https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
+                      for the current list of supported models and minimum prefix sizes.
+                    type: boolean
                   region:
                     description: AWS region where the Bedrock model is available (e.g.,
                       us-east-1, us-west-2)
diff --git a/go/api/v1alpha2/modelconfig_types.go b/go/api/v1alpha2/modelconfig_types.go
index 0d08928681..6a9d03196b 100644
--- a/go/api/v1alpha2/modelconfig_types.go
+++ b/go/api/v1alpha2/modelconfig_types.go
@@ -256,6 +256,24 @@ type BedrockConfig struct {
 	// +optional
 	// +kubebuilder:pruning:PreserveUnknownFields
 	AdditionalModelRequestFields *apiextensionsv1.JSON `json:"additionalModelRequestFields,omitempty"`
+
+	// PromptCaching enables Bedrock prompt caching by appending a CachePoint
+	// block at the end of the Converse request's `system` content array and
+	// the end of the `tools` array. Bedrock will cache the prefix up to and
+	// including those cache points across requests in the same region for
+	// roughly 5 minutes after first use, billing the cached portion at a
+	// reduced rate on cache hits.
+	//
+	// Recommended for tool-using agents that make many Converse calls per
+	// task with a stable system prompt and tool set — the per-call input
+	// token count can drop by 70-90% on hit. Has no effect on models that
+	// don't support caching; the marker is ignored by Bedrock for those.
+	//
+	// See https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
+	// for the current list of supported models and minimum prefix sizes.
+	// +optional
+	// +kubebuilder:default=false
+	PromptCaching bool `json:"promptCaching,omitempty"`
 }
 
 // SAPAICoreConfig contains SAP AI Core-specific configuration options.
diff --git a/go/core/internal/controller/translator/agent/adk_api_translator.go b/go/core/internal/controller/translator/agent/adk_api_translator.go
index 0ee9528f8b..bf21f2bdde 100644
--- a/go/core/internal/controller/translator/agent/adk_api_translator.go
+++ b/go/core/internal/controller/translator/agent/adk_api_translator.go
@@ -697,6 +697,7 @@ func (a *adkApiTranslator) translateModel(ctx context.Context, namespace, modelC
 			},
 			Region:                       model.Spec.Bedrock.Region,
 			AdditionalModelRequestFields: additionalFields,
+			PromptCaching:                model.Spec.Bedrock.PromptCaching,
 		}
 
 		// Populate TLS fields in BaseModel
diff --git a/helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml b/helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml
index 00b21b6da0..50d115f7f4 100644
--- a/helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml
+++ b/helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml
@@ -483,6 +483,24 @@ spec:
                       Claude extended thinking or top_k. Values are forwarded as-is to the API.
                       Example: {"top_k": 5, "thinking": {"type": "enabled", "budget_tokens": 16000}}
                     x-kubernetes-preserve-unknown-fields: true
+                  promptCaching:
+                    default: false
+                    description: |-
+                      PromptCaching enables Bedrock prompt caching by appending a CachePoint
+                      block at the end of the Converse request's `system` content array and
+                      the end of the `tools` array. Bedrock will cache the prefix up to and
+                      including those cache points across requests in the same region for
+                      roughly 5 minutes after first use, billing the cached portion at a
+                      reduced rate on cache hits.
+
+                      Recommended for tool-using agents that make many Converse calls per
+                      task with a stable system prompt and tool set — the per-call input
+                      token count can drop by 70-90% on hit. Has no effect on models that
+                      don't support caching; the marker is ignored by Bedrock for those.
+
+                      See https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
+                      for the current list of supported models and minimum prefix sizes.
+                    type: boolean
                   region:
                     description: AWS region where the Bedrock model is available (e.g.,
                       us-east-1, us-west-2)
diff --git a/python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py b/python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py
index c1a83c045c..55116a198c 100644
--- a/python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py
+++ b/python/packages/kagent-adk/src/kagent/adk/models/_bedrock.py
@@ -251,6 +251,12 @@ class KAgentBedrockLlm(KAgentTLSMixin, BaseLlm):
 
     extra_headers: Optional[dict[str, str]] = None
     additional_model_request_fields: Optional[dict[str, Any]] = None
+    # When True, append a CachePoint block to the end of the Converse
+    # request's `system` content array and the end of the `toolConfig.tools`
+    # array. Bedrock caches the prefix up to and including those markers
+    # across requests in the same region; cached portion is billed at a
+    # reduced rate on hit. See AWS docs for supported models / minimums.
+    prompt_caching: bool = False
 
     model_config = {"arbitrary_types_allowed": True}
 
@@ -288,12 +294,23 @@ async def generate_content_async(
                 text = "\n".join(p.text for p in si.parts or [] if p.text)
                 if text:
                     kwargs["system"] = [{"text": text}]
+            # If prompt caching is on, mark the end of the system content as
+            # a cache breakpoint. Bedrock caches everything up to and including
+            # this point for ~5 minutes; subsequent requests with the same
+            # prefix hit the cache. No-op if we didn't produce any system text.
+            if self.prompt_caching and kwargs.get("system"):
+                kwargs["system"].append({"cachePoint": {"type": "default"}})
 
         if llm_request.config and llm_request.config.tools:
             genai_tools = [t for t in llm_request.config.tools if hasattr(t, "function_declarations")]
             if genai_tools:
                 converse_tools = _convert_tools_to_converse(genai_tools, tool_name_map, tool_name_counter)
                 if converse_tools:
+                    # CachePoint at the END of the tool list: tool definitions
+                    # are usually the biggest static chunk of an agent request
+                    # and benefit most from caching.
+                    if self.prompt_caching:
+                        converse_tools.append({"cachePoint": {"type": "default"}})
                     kwargs["toolConfig"] = {"tools": converse_tools}
 
         # Reverse map lets us restore original tool names from sanitized names in Bedrock responses.
diff --git a/python/packages/kagent-adk/src/kagent/adk/types.py b/python/packages/kagent-adk/src/kagent/adk/types.py
index 5e2f4a97af..0d49fceafa 100644
--- a/python/packages/kagent-adk/src/kagent/adk/types.py
+++ b/python/packages/kagent-adk/src/kagent/adk/types.py
@@ -240,6 +240,11 @@ class Bedrock(BaseLLM):
     # additionalModelRequestFields in the Converse API. Use this for provider-specific
     # options outside the standard InferenceConfiguration block.
     additional_model_request_fields: dict | None = None
+    # prompt_caching enables Bedrock prompt caching: a CachePoint marker is
+    # appended to the end of the Converse request's system content array and
+    # toolConfig.tools array. Bedrock caches the prefix across requests in the
+    # same region; cached portion is billed at a reduced rate on hit.
+    prompt_caching: bool = False
     type: Literal["bedrock"]
 
 
@@ -600,6 +605,7 @@ def _create_llm_from_model_config(model_config: ModelUnion):
             model=model_config.model,
             extra_headers=extra_headers,
             additional_model_request_fields=model_config.additional_model_request_fields,
+            prompt_caching=model_config.prompt_caching,
             **_transport_kwargs(model_config),
         )
     if model_config.type == "sap_ai_core":