Skip to content

Commit 37f3c29

Browse files
litianningdatadogshreyamalpaniduncanistaastuyvejchrostek-dd
authored
Merge Lambda Managed Instance feature branch (#947)
https://datadoghq.atlassian.net/browse/SVLS-8080 ## Overview Merge Lambda Managed Instance feature branch ## Testing Covered by individual commits Co-authored-by: shreyamalpani <shreya.malpani@datadoghq.com> Co-authored-by: duncanista <30836115+duncanista@users.noreply.github.com> Co-authored-by: astuyve <aj.stuyvenberg@datadoghq.com> Co-authored-by: jchrostek-dd <john.chrostek@datadoghq.com> Co-authored-by: tianning.li <tianning.li@datadoghq.com>
1 parent 39a0f5e commit 37f3c29

32 files changed

Lines changed: 3149 additions & 215 deletions

bottlecap/README.md

Lines changed: 156 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ One time setup:
99
Then: `./runBottlecap.sh`
1010

1111
## Developing using Codespaces
12-
Step 1: Create a codespace (code > codespaces > create codespace on main)
12+
Step 1: Create a codespace (code > codespaces > create codespace on main)
1313

1414
![img](./codespace.png)
1515

@@ -18,3 +18,158 @@ Step 2: Hack in the `bottlecap` folder
1818
Step 3: Test your change running `./runBottlecap.sh`
1919

2020
![img](./runBottlecap.png)
21+
22+
## Flush Strategies
23+
24+
Bottlecap supports several flush strategies that control when and how observability data (metrics, logs, traces) is sent to Datadog. The strategy is configured via the `DD_SERVERLESS_FLUSH_STRATEGY` environment variable.
25+
26+
**Important**: Flush strategies behave differently depending on the Lambda execution mode:
27+
- **Managed Instance**: Uses continuous background flushing (flush strategies are ignored)
28+
- **On-Demand**: Uses configurable flush strategies
29+
30+
### Managed Instance Mode vs On-Demand Mode
31+
32+
#### Managed Instance Mode
33+
Lambda Managed Instances run your functions on EC2 instances (managed by AWS) with multi-concurrent invocations. This requires setting up a **capacity provider** - a configuration that defines VPC settings, instance requirements, and scaling parameters for the managed instances.
34+
35+
- **Activation**: Detected automatically via the `AWS_LAMBDA_INITIALIZATION_TYPE` environment variable. When this equals `"lambda-managed-instances"`, Bottlecap enters Managed Instance mode
36+
- **Flush Behavior**:
37+
- A dedicated background task continuously flushes data at regular intervals (default: 30 seconds)
38+
- All flushes are **non-blocking** and run concurrently with invocation processing
39+
- Prevents resource buildup by skipping a flush cycle if the previous flush is still in progress
40+
- `DD_SERVERLESS_FLUSH_STRATEGY` is **ignored** in this mode
41+
- **Shutdown Behavior**:
42+
- Background flusher waits for pending flushes to complete before shutdown
43+
- Final flush ensures all remaining data is sent before the execution environment terminates
44+
- **Execution Model**: Multi-concurrent invocations where one execution environment handles multiple invocations simultaneously (unlike traditional Lambda's one-invocation-per-environment model)
45+
- **Use case**: Steady-state, high-volume workloads where optimizing costs with predictable capacity is desired
46+
- **Key advantage**: Zero flush overhead per invocation - flushing happens independently in the background
47+
- **Infrastructure**: Lambda launches 3 instances by default for availability zone resiliency when a function version is published to a capacity provider
48+
49+
#### On-Demand Mode (Traditional Mode)
50+
- **Activation**: Default mode for standard Lambda execution (one invocation at a time)
51+
- **Flush Behavior**:
52+
- Respects the configured `DD_SERVERLESS_FLUSH_STRATEGY`
53+
- Flush timing is tied to invocation lifecycle events
54+
- Can be blocking or non-blocking depending on the chosen strategy
55+
- **Use case**: Standard Lambda functions with sequential invocation processing
56+
- **Key advantage**: Fine-grained control over flush timing and behavior
57+
58+
### Available Strategies (On-Demand Mode Only)
59+
60+
#### `Default` (Recommended)
61+
- **Configuration**: Set automatically when no strategy is specified, or explicitly via `DD_SERVERLESS_FLUSH_STRATEGY=default`
62+
- **Behavior**: Adaptive - changes based on invocation frequency
63+
- **Initial behavior** (first ~20 invocations): Flushes at end of each invocation (blocking)
64+
- **After 20 invocations**: Switches to non-blocking continuous flushes
65+
- **Interval**: 60 seconds
66+
- **Use case**: Recommended for most serverless workloads - automatically optimizes for your traffic pattern
67+
68+
#### `End`
69+
- **Configuration**: `DD_SERVERLESS_FLUSH_STRATEGY=end`
70+
- **Behavior**: Always flushes at the end of each invocation (blocking)
71+
- **Interval**: 15 minutes (effectively disables periodic flushing)
72+
- **Use case**: Minimize flushing overhead - only flush once per invocation when the invocation is complete
73+
74+
#### `EndPeriodically`
75+
- **Configuration**: `DD_SERVERLESS_FLUSH_STRATEGY=end,<milliseconds>` (e.g., `end,1000`)
76+
- **Behavior**: Flushes both at the end of invocation AND periodically during long-running invocations (blocking)
77+
- **Interval**: User-specified (in milliseconds)
78+
- **Use case**: Long-running Lambda functions where you want data visibility during execution, not just at the end
79+
80+
#### `Periodically`
81+
- **Configuration**: `DD_SERVERLESS_FLUSH_STRATEGY=periodically,<milliseconds>` (e.g., `periodically,60000`)
82+
- **Behavior**: Always flushes at the specified interval (blocking)
83+
- **Interval**: User-specified (in milliseconds)
84+
- **Use case**: Predictable periodic flushing when you want guaranteed flush timing
85+
86+
#### `Continuously`
87+
- **Configuration**: `DD_SERVERLESS_FLUSH_STRATEGY=continuously,<milliseconds>` (e.g., `continuously,60000`)
88+
- **Behavior**: Spawns non-blocking async flush tasks at the specified interval
89+
- **Interval**: User-specified (in milliseconds)
90+
- **Use case**: High-throughput scenarios where invocation latency is critical and you can't afford to wait for flushes
91+
92+
### Summary Table
93+
94+
| Mode | Strategy | Blocking? | Adapts? | Best For |
95+
|------|----------|-----------|---------|----------|
96+
| **Managed Instance** | *Always Continuous* | ❌ No | ❌ No | Steady-state high-volume workloads with multi-concurrent invocations |
97+
| **On-Demand** | Default | Initially yes, then no | ✅ Yes | General use - auto-optimizes |
98+
| **On-Demand** | End | ✅ Yes | ❌ No | Minimal overhead, sporadic invocations |
99+
| **On-Demand** | EndPeriodically | ✅ Yes | ❌ No | Long-running functions with progress visibility |
100+
| **On-Demand** | Periodically | ✅ Yes | ❌ No | Predictable flush timing |
101+
| **On-Demand** | Continuously | ❌ No | ❌ No | High-throughput, latency-sensitive |
102+
103+
### Implementation Details
104+
105+
#### Managed Instance Mode Implementation
106+
Located in `bottlecap/src/bin/bottlecap/main.rs`:
107+
- **Mode Detection** (`bottlecap/src/config/aws.rs`):
108+
- Checks if `AWS_LAMBDA_INITIALIZATION_TYPE` environment variable equals `"lambda-managed-instances"`
109+
- **Event Subscription** (`bottlecap/src/extension/mod.rs`):
110+
- Only subscribes to `SHUTDOWN` events (not `INVOKE` events)
111+
- On-Demand mode subscribes to both `INVOKE` and `SHUTDOWN` events
112+
- **Flush Strategy Override**:
113+
- Function: `get_flush_strategy_for_mode()`
114+
- If user configures a non-continuous strategy, it's overridden to continuous with a warning
115+
- Uses `DEFAULT_CONTINUOUS_FLUSH_INTERVAL` (30 seconds) from `flush_control.rs`
116+
- **Main Event Loop**:
117+
- Processes events from the event bus (telemetry events like `platform.start`, `platform.report`)
118+
- Does NOT call `/next` endpoint for each invocation (only for shutdown)
119+
- Uses `tokio::select!` with biased ordering to prioritize telemetry events over shutdown signals
120+
- **Background Flusher Task**:
121+
- Spawns at startup and runs until shutdown
122+
- Uses `tokio::select!` to handle periodic flush ticks and shutdown signals
123+
- Calls `PendingFlushHandles::spawn_non_blocking_flushes()` for each flush cycle
124+
- Skips flush if previous flush handles are still pending
125+
- **Non-Blocking Flush Spawning**:
126+
- Method: `PendingFlushHandles::spawn_non_blocking_flushes()`
127+
- Spawns separate async tasks for logs, traces, metrics, stats, and proxy flushes
128+
- Each task runs independently without blocking the main event loop
129+
- Failed payloads are tracked for retry in `await_flush_handles()`
130+
- **Shutdown Handling**:
131+
- Separate task waits for SHUTDOWN event from Extensions API
132+
- Cancels background flusher and signals main event loop
133+
- **Final Flush**:
134+
- Function: `blocking_flush_all()`
135+
- Ensures all remaining data is sent before termination
136+
- Uses blocking flush with `force_flush_trace_stats=true`
137+
138+
#### On-Demand Mode Implementation
139+
Located in `bottlecap/src/bin/bottlecap/main.rs`:
140+
- **Flush Control** (`bottlecap/src/lifecycle/flush_control.rs`):
141+
- Function: `evaluate_flush_decision()`
142+
- Evaluates flush strategy and invocation history
143+
- Returns `FlushDecision` enum: `End`, `Periodic`, `Continuous`, or `Dont`
144+
- Adaptive behavior: After ~20 invocations, Default strategy switches from End to Continuous
145+
- **Event Loop**: Uses `FlushControl::evaluate_flush_decision()` to determine flush behavior
146+
- `FlushDecision::End`: Waits for `platform.runtimeDone`, then performs blocking flush
147+
- `FlushDecision::Periodic`: Performs blocking flush at configured interval
148+
- `FlushDecision::Continuous`: Spawns non-blocking flush tasks (similar to Managed Instance)
149+
- `FlushDecision::Dont`: Skips flushing for this cycle
150+
- **Final Flush**:
151+
- Function: `blocking_flush_all()`
152+
- Blocking flush with `force_flush_trace_stats=true`
153+
- Ensures all remaining data is sent before shutdown
154+
- **Configuration** (`bottlecap/src/config/flush_strategy.rs`):
155+
- Deserializes `DD_SERVERLESS_FLUSH_STRATEGY` environment variable
156+
- Supports formats: `"end"`, `"end,<ms>"`, `"periodically,<ms>"`, `"continuously,<ms>"`
157+
158+
### Key Architectural Differences
159+
160+
| Aspect | Managed Instance Mode | On-Demand Mode |
161+
|--------|----------------------|----------------|
162+
| **Event Source** | Telemetry API (platform events) | Extensions API `/next` endpoint |
163+
| **Invocation Model** | Multi-concurrent (one environment handles multiple invocations) | Single-concurrent (one invocation per environment) |
164+
| **Scaling** | Asynchronous, CPU-based scaling | Reactive scaling with cold starts |
165+
| **Pricing** | EC2 instance-based | Per-request duration-based |
166+
| **Flush Trigger** | Background interval timer | Invocation lifecycle + interval |
167+
| **Strategy Config** | Ignored (always continuous) | Configurable via env var |
168+
| **Main Loop** | Event bus processing | `/next` + event bus processing |
169+
| **Shutdown Detection** | Separate task monitors `/next` | Main loop receives from `/next` |
170+
171+
## References
172+
173+
### AWS Lambda Managed Instances Documentation
174+
- [Introducing AWS Lambda Managed Instances: Serverless simplicity with EC2 flexibility](https://aws.amazon.com/blogs/aws/introducing-aws-lambda-managed-instances-serverless-simplicity-with-ec2-flexibility/) - AWS Blog announcement
175+
- [Lambda Managed Instances - AWS Lambda Developer Guide](https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html) - Official AWS documentation

0 commit comments

Comments
 (0)