Skip to content

Commit 21bb040

Browse files
committed
docs: simplify architecture to use single shared Vector agent
1 parent 832ed5b commit 21bb040

3 files changed

Lines changed: 32 additions & 45 deletions

File tree

docs/diagrams/http-metering-c4.puml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@ Person(client, "End User / Client", "Requests services exposed via Datum Cloud E
99

1010
System_Boundary(edge_cluster, "Edge Cluster") {
1111
Container(envoy, "Envoy Gateway Proxy", "Envoy/Go", "Handles ingress HTTP traffic, terminates TLS, enforces WAF/rate-limiting, emits JSON access logs to stdout")
12-
Container(vector_parser, "Vector Agent (Log Parser)", "Vector DaemonSet", "Tails Envoy container logs, parses JSON access logs, and translates them to CloudEvents")
13-
Container(vector_collector, "billing-usage-collector-vector", "Vector DaemonSet (Billing)", "Receives CloudEvents over HTTP, buffers to local disk, and forwards them to the Billing System")
12+
Container(vector_collector, "billing-usage-collector-vector", "Vector DaemonSet (Billing)", "Tails Envoy container logs, parses JSON access logs, translates to CloudEvents, and forwards them to the Billing System")
1413
Container(nso, "Network Services Operator", "Go", "Deploys Envoy Gateway and configures EnvoyProxy logging policies")
1514
}
1615

@@ -20,8 +19,7 @@ System_Boundary(control_plane, "Platform Control Plane") {
2019

2120
Rel(client, envoy, "Sends HTTPS requests to", "HTTPS")
2221
Rel(nso, envoy, "Configures & manages", "Kubernetes API / EnvoyProxy CR")
23-
Rel_D(envoy, vector_parser, "Outputs JSON access logs to", "stdout / container logs")
24-
Rel_R(vector_parser, vector_collector, "Sends CloudEvents to", "HTTP POST (localhost:9880)")
22+
Rel_D(envoy, vector_collector, "Outputs JSON access logs to", "stdout / container logs")
2523
Rel_D(vector_collector, billing_system, "Forwards batched events to", "HTTPS CloudEvents")
2624

2725
@enduml

docs/diagrams/http-metering-sequence.puml

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,7 @@ skinparam ParticipantPadding 10
55
actor Client as client
66
box "Edge Cluster Node" #LightBlue
77
participant "Envoy Gateway (Proxy)" as envoy
8-
participant "Vector Agent\n(Log Parser)" as parser
9-
participant "billing-usage-collector-vector\n(DaemonSet)" as collector
8+
participant "billing-usage-collector-vector\n(DaemonSet)" as vector
109
end box
1110

1211
box "Platform Control Plane" #LightYellow
@@ -20,23 +19,17 @@ envoy -> client : 2. HTTP Response (200 OK with Egress Bytes)
2019
deactivate envoy
2120

2221
note over envoy : Request completed
23-
envoy -> parser : 3. Write structured JSON access log to stdout\n(contains bytes, duration, route name/namespace)
24-
activate parser
22+
envoy -> vector : 3. Write structured JSON access log to stdout\n(contains bytes, duration, route name/namespace)
23+
activate vector
2524

26-
parser -> parser : Parse JSON log\nand map to CloudEvent\n(No enrichment)
25+
vector -> vector : Tail logs, parse JSON,\nand map to CloudEvent\n(No enrichment)
2726

28-
parser -> collector : 4. Send CloudEvent via HTTP POST\n(port 9880/cloudevents)
29-
activate collector
30-
31-
collector -> billing : 5. Forward batched CloudEvents\n(HTTPS batch ingest)
27+
vector -> billing : 4. Forward batched CloudEvents\n(HTTPS batch ingest)
3228
activate billing
3329

3430
billing -> billing : Validate, attribute, and persist
35-
billing --> collector : 200 OK / 202 Accepted
31+
billing --> vector : 200 OK / 202 Accepted
3632
deactivate billing
37-
38-
collector --> parser : 200 OK
39-
deactivate collector
40-
deactivate parser
33+
deactivate vector
4134

4235
@enduml

docs/enhancements/http-traffic-metering.md

Lines changed: 23 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,9 @@ latest-milestone: "v0.x"
3535

3636
Network Services operates an Envoy Gateway-based edge proxy that routes HTTP traffic, terminates TLS, and enforces WAF and rate-limit policies on behalf of platform customers. Today, it lacks a billing presence: there is no registration in the service catalog, no meter definitions, and no integration with the durable usage pipeline.
3737

38-
This enhancement defines the architecture, data structures, and roadmap to bring HTTP traffic metering and catalog registration to Network Services. The work is split into two phases:
39-
- **Phase 1 (Catalog & Metadata):** Declare a `Service` and a companion `ServiceConfiguration` resource (`services.miloapis.com/v1alpha1`) carrying the monitored-resource and meter declarations inline. This is a YAML-only delivery packaged in the `config/services/` bundle.
40-
- **Phase 2 (Emission & Integration):** Configure Envoy Gateway proxy logging and deploy a custom Vector Agent to scrape access logs, parse billing signals into CloudEvents, and forward them to the local `billing-usage-collector-vector` DaemonSet.
38+
This enhancement defines the architecture, data structures, and metadata required to bring HTTP traffic metering and service catalog registration to Network Services. Under this design:
39+
- Network Services' identity and billable metrics are declared via a standard `Service` and companion `ServiceConfiguration` resource (`services.miloapis.com/v1alpha1`).
40+
- Envoy Gateway proxies are instrumented to write structured JSON access logs, which are scraped, parsed, and forwarded as CloudEvents to the platform's billing pipeline by the existing `billing-usage-collector-vector` DaemonSet.
4141

4242
## Motivation
4343

@@ -47,26 +47,23 @@ Because `MeterDefinition` fields (such as `meterName` and `measurement.unit`) ar
4747

4848
### Goals
4949

50-
- Define a standard `Service` and `ServiceConfiguration` to register Network Services under the service domain `networking.datumapis.com`.
51-
- Define the core billing meters for HTTP traffic: request count, egress bytes, ingress bytes, and connection seconds.
52-
- Choose a scalable, low-risk telemetry collection approach that bridges Envoy Gateway's telemetry and the Billing Ingestion Gateway.
53-
- Map the end-to-end data flow with sequence and architecture diagrams.
50+
- Establish Network Services' identity in the platform service catalog to make it discoverable and activatable by platform consumers.
51+
- Define clear, usage-based billing metrics for HTTP traffic (requests, bandwidth, and connection time) so customers pay proportionally to their consumption.
52+
- Design a reliable, zero-data-loss telemetry collection path that ensures accurate billing without impacting proxy performance or request latency.
53+
- Provide clear architectural visibility into the edge-to-billing data flow for platform operators.
5454

5555
### Non-Goals
5656

57-
- Defining rate cards, pricing models, tiers, or billing/invoice generation logic.
58-
- Altering the `MeterDefinition` schema or billing pipeline contract.
59-
- Implementation of the core Billing SDK (owned by the Billing Team).
60-
- Shared-infrastructure cost attribution or cross-project billing logic.
61-
- **Deploying or modifying the Billing System or `billing-usage-collector-vector` DaemonSet.** These components are pre-existing, shared platform infrastructure. Our work is limited to deploying a custom Vector Agent (Log Parser) to parse logs and forward them to this existing collector.
57+
- **Pricing tiers, currencies, and billing cycle schedules:** This design only concerns measuring and reporting raw usage quantities. Determining pricing rates, tier discounts, billing schedules, and invoice calculations is out of scope.
58+
- **Runtime traffic enforcement and quota limits:** Telemetry collection does not gate or throttle traffic. Rate limiting, WAF enforcement, and bandwidth capping remain governed by separate gateway policies, not by the billing pipeline.
6259

6360
## Proposal
6461

6562
We propose to register the service and implement HTTP traffic metering via access log scraping.
6663

6764
- The **Monitored Resource** is the Kubernetes Gateway API `HTTPRoute` resource, representing the customer-facing HTTP endpoint.
68-
- **Phase 1** registers the service with the service catalog via declarative YAML configurations. The service catalog fan-out controller automatically creates `MonitoredResourceType` and `MeterDefinition` resources in the billing namespace.
69-
- **Phase 2** instruments the Envoy Gateway instances to write structured JSON access logs to stdout. A node-level Vector Agent (Log Parser) tails these logs, parses and maps the raw logs into CloudEvents, and forwards them locally via HTTP to the `billing-usage-collector-vector` DaemonSet. The billing collector then handles local disk buffering and reliably forwards them to the Billing System.
65+
- **Catalog Registration** is handled via declarative YAML configurations. The service catalog fan-out controller automatically creates `MonitoredResourceType` and `MeterDefinition` resources in the billing namespace.
66+
- **Telemetry Emission** is handled by instrumenting the Envoy Gateway instances to write structured JSON access logs to stdout. The node-level `billing-usage-collector-vector` DaemonSet (already deployed as part of the billing pipeline) tails these log files directly, parses and maps the raw logs into CloudEvents, handles local disk buffering, and reliably forwards them to the central Billing System.
7067

7168
### Data and Control Flow Diagrams
7269

@@ -273,7 +270,7 @@ config/services/
273270

274271
#### How can this feature be enabled / disabled in a live cluster?
275272
- **Other**
276-
- Describe the mechanism: Phase 1 is enabled by deploying the `Service` and `ServiceConfiguration` manifests. Phase 2 (emission) is enabled by configuring the Envoy Gateway access logs via `EnvoyProxy` and updating the Vector Agent config.
273+
- Describe the mechanism: Catalog registration is enabled by deploying the `Service` and `ServiceConfiguration` manifests. Telemetry emission is enabled by configuring the Envoy Gateway access logs via `EnvoyProxy` and updating the `billing-usage-collector-vector` DaemonSet configuration.
277274
- Will enabling / disabling the feature require downtime of the control plane? No.
278275
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No, Envoy Gateway supports dynamic configuration updates without dropping active traffic.
279276

@@ -284,7 +281,7 @@ No, it only adds background logging and telemetric forwarding.
284281
Yes, reverting the `EnvoyProxy` configuration to its previous state disables access logging.
285282

286283
#### What happens if we reenable the feature if it was previously rolled back?
287-
Logging resumes, and Vector Agent resumes parsing from the end of the log stream.
284+
Logging resumes, and `billing-usage-collector-vector` resumes parsing from the end of the log stream.
288285

289286
---
290287

@@ -298,7 +295,7 @@ Rollouts do not affect traffic handling directly. A malformed access log format
298295
- Billing Ingestion Gateway event rejection rate.
299296

300297
#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
301-
TBD during Phase 2 implementation.
298+
TBD during telemetry emission implementation.
302299

303300
---
304301

@@ -313,23 +310,22 @@ By checking the customer billing dashboard or querying the billing API for resou
313310
#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
314311
- **Metrics**
315312
- Metric name: `vector_transform_errors_total`, `ingestion_gateway_request_count` (with status codes).
316-
- Components exposing the metric: Vector Agent, Billing Ingestion Gateway.
313+
- Components exposing the metric: `billing-usage-collector-vector`, Billing Ingestion Gateway.
317314

318315
---
319316

320317
### Dependencies
321318

322319
#### Does this feature depend on any specific services running in the cluster?
323-
- **Vector Agent (Log Parser):** Tail Envoy logs and forward events to `billing-usage-collector-vector`.
324-
- **billing-usage-collector-vector DaemonSet:** Accept and stream usage events to the Billing System.
320+
- **`billing-usage-collector-vector` DaemonSet:** Tails Envoy container logs, parses JSON access logs, translates to CloudEvents, and forwards events to the Billing System.
325321

326322
---
327323

328324
### Scalability
329325

330326
#### Will enabling / using this feature result in any new API calls?
331327
- Logs are written locally to stdout; there are no new Kube API calls for logging.
332-
- Vector Agent (Log Parser) performs HTTP POST requests to the local `billing-usage-collector-vector`. Throughput scales linearly with request volume.
328+
- `billing-usage-collector-vector` performs HTTPS batch POST requests to the Billing System. Throughput scales linearly with request volume.
333329

334330
#### Will enabling / using this feature result in introducing new API types?
335331
No new Go-level types are introduced in the operator. `Service` and `ServiceConfiguration` are existing types in the platform's service catalog.
@@ -339,11 +335,11 @@ No new Go-level types are introduced in the operator. `Service` and `ServiceConf
339335
### Troubleshooting
340336

341337
#### How does this feature react if the API server is unavailable?
342-
Telemetry generation and log scraping are independent of the Kubernetes API server. Vector will continue to tail files and forward events.
338+
Telemetry generation and log scraping are independent of the Kubernetes API server. `billing-usage-collector-vector` will continue to tail files and forward events.
343339

344340
#### What are other known failure modes?
345-
- **Vector pipeline backlog:** If the Ingestion Gateway is slow, Vector Agent buffers events locally on the node disk.
346-
- **Log rotation race:** Very high traffic might trigger rapid log rotation, which could cause minor data loss if Vector falls too far behind.
341+
- **Vector pipeline backlog:** If the Ingestion Gateway/Billing System is slow, `billing-usage-collector-vector` buffers events locally on the node disk.
342+
- **Log rotation race:** Very high traffic might trigger rapid log rotation, which could cause minor data loss if the collector falls too far behind.
347343

348344
## Open Decisions
349345

@@ -355,10 +351,10 @@ The following decisions are tracked for the implementation of this enhancement:
355351
| OD-2 | Canonical `serviceName` | **Resolved** | `networking.datumapis.com`. |
356352
| OD-3 | `producerProjectRef.name` | **Resolved** | `datum-cloud`. |
357353
| OD-4 | Bundle layout | **Resolved** | Per-service-domain directory under `config/services/networking.datumapis.com/`, matching `datum-cloud/datum/config/services/<service-domain>/`. |
358-
| OD-5 | Is the Vector Agent DaemonSet planned to run on the edge cluster nodes that host Envoy Gateway pods? | **Resolved** | Yes. The shared platform `billing-usage-collector-vector` runs as a DaemonSet in the `billing-system` namespace. We will deploy our custom Vector Agent (Log Parser) as a DaemonSet alongside it to handle log tailing and parsing. |
354+
| OD-5 | Is the Vector Agent DaemonSet planned to run on the edge cluster nodes that host Envoy Gateway pods? | **Resolved** | Yes. The shared platform `billing-usage-collector-vector` runs as a DaemonSet in the `billing-system` namespace. Under this design, this pre-existing agent will directly tail and parse the Envoy stdout logs, avoiding the need for a separate custom log-parsing agent. |
359355
| OD-6 | Can the network-services-operator patch the `EnvoyProxy` CR to inject access log configuration? | **Resolved** | Yes. NSO configures and manages the Envoy Gateway proxies and can patch EnvoyProxy resources to enable structured JSON stdout logging. |
360-
| OD-7 | Is the billing SDK published as a consumable Go module? | **N/A** | We do not compile or use the Billing Go SDK for proxy traffic metering. Instead, the custom Vector Agent (Log Parser) directly parses raw Envoy stdout logs, formats them into CloudEvents, and POSTs them via HTTP to the local `billing-usage-collector-vector` daemon. |
361-
| OD-8 | Enrichment-sidecar placement: per-node alongside Vector, or central in front of the Ingestion Gateway? | **N/A** | At this stage, we will not enrich the event information with additional control-plane data. The custom Vector Agent (Log Parser) will only parse the raw properties from the JSON logs and map them directly to the CloudEvent schema. |
356+
| OD-7 | Is the billing SDK published as a consumable Go module? | **N/A** | We do not compile or use the Billing Go SDK for proxy traffic metering. Instead, the `billing-usage-collector-vector` DaemonSet directly parses raw Envoy stdout logs, formats them into CloudEvents, and forwards them. |
357+
| OD-8 | Enrichment-sidecar placement: per-node alongside Vector, or central in front of the Ingestion Gateway? | **N/A** | At this stage, we will not enrich the event information with additional control-plane data. The `billing-usage-collector-vector` DaemonSet will only parse the raw properties from the JSON logs and map them directly to the CloudEvent schema. |
362358

363359
---
364360

0 commit comments

Comments
 (0)