feat (spike): Package opentelemetry forwarding to xray#115
Conversation
| export( | ||
| spans: ReadableSpan[], | ||
| resultCallback: (result: ExportResult) => void | ||
| ): void { | ||
| console.log("Exporting!!!", spans); | ||
| const sym = Symbol(); | ||
| const promise = client | ||
| .send( | ||
| new PutLogEventsCommand({ | ||
| logGroupName: this.logGroupName, | ||
| logStreamName: this.logStreamName, | ||
| logEvents: spans.map((s) => ({ | ||
| message: this.serializeSpan(s), | ||
| timestamp: new Date().getTime(), | ||
| })), | ||
| }) | ||
| ) | ||
| .then(() => { |
There was a problem hiding this comment.
Does this run in a side car or within the lambda execution?
There was a problem hiding this comment.
Within the lambda, but asynchronously.
There was a problem hiding this comment.
I don't think we should do this. It couples the critical path to telemetry. Should we instead run it as a side car ?
There was a problem hiding this comment.
We could, it'd add some complexity and a bit of overhead though. What's the concern with it being on the critical path? Could just wrap the whole thing in a try/catch if failure is a worry
There was a problem hiding this comment.
The request won't complete until the (slow) put logs request succeeds.
There was a problem hiding this comment.
Agree on not polluting logs.
There was a problem hiding this comment.
Yep, but wont the lambda execution be incomplete until the side car has completed sending to cloudwatch, and terminated, anyway?
There was a problem hiding this comment.
Or are their lifetimes independent?
There was a problem hiding this comment.
Ok, having issues with sequencing the logs to cloudwatch anyway, with this model. Did some reading up, I see lambda extensions get independent lifetimes from the function, ie the function can return back to caller, while the extension gets to clean up. So yeah I'll try making a cut down build of the collector that just exports to cloudwatch.
There was a problem hiding this comment.
There's https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder for making custom builds of the collector
thantos
left a comment
There was a problem hiding this comment.
Have any sample of what value this change brings?
| this.workflows.orchestrator, | ||
| "orchestrator" | ||
| ); | ||
| this.telemetry.attachToFunction(this.scheduler.forwarder, "forwarder"); |
There was a problem hiding this comment.
Not sure the scheduler forwarded needs this? What is it logging? If it does need it, then the scheuler.handler also needs it.
There was a problem hiding this comment.
Sure can take it off.
| const logStream = new LogStream(this, `LogStream${componentName}`, { | ||
| logGroup: this.logGroup, | ||
| logStreamName: componentName, | ||
| }); |
There was a problem hiding this comment.
Is this what we want? A log steam per function for all time? Or is this just an experiment?
There is a limit to the number of writes to a log stream.
5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
The orchestrator, for all workflow executions, would be limited to 5TPS.
There was a problem hiding this comment.
Hrm ok didnt realise there was a throttle. I originally had it creating a new log stream every execution, but reliased without static streams it would be difficult to attach events listeners to the streams, to forward logs to the real collector. With static streams we can just set that up in cdk.
There was a problem hiding this comment.
An option I can think of instead, actually, is just skip the logging to cloudwatch part. Instead the extension just sends the data to the otel collector running in a different lambda, over http
| const tracer = trace.getTracer(executionId, "0.0.0"); | ||
| await tracer.startActiveSpan( | ||
| "startWorkflow", | ||
| { | ||
| attributes: { workflowName, input }, | ||
| kind: SpanKind.PRODUCER, |
There was a problem hiding this comment.
Does it make sense to trace in the client or should we trace in the orchestrator (aka: those who call the client). Not all of the callers of the client will have tracing on.
There was a problem hiding this comment.
Yeah orchestrator probably makes more sense.
| provider.addSpanProcessor( | ||
| new BatchSpanProcessor(new OTLPTraceExporter({ hostname: "127.0.0.1" })) | ||
| ); | ||
| provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter())); |
| * Creates an entrypoint function for orchestrating a workflow | ||
| * from within an AWS Lambda Function attached to a SQS FIFO queue. | ||
| */ | ||
| const traceProvider = registerTelemetryApi(); |
There was a problem hiding this comment.
these functions compile to MJS, would it help to make the AWS clients and telemetry start in parallel (top level await)
There was a problem hiding this comment.
I initially tried that somewhere else, but the functions also compile to cjs, and breaks in that form. Its our lowest common denominator. That being said, registerTelemetryApi isnt an async function, I dont think top level await's going to help us here.
| await tracer.startActiveSpan( | ||
| "createActivityWorker", | ||
| { attributes: { command: request.command.name } }, |
There was a problem hiding this comment.
Why have a span for createActivityWorker, but not createOrchestrator? Does this one time operation need a span?
There was a problem hiding this comment.
I've only created a couple of spans for testing purposes. I'll leave it up to you guys to create more once everything's in place.
| const orchestrateSpan = tracer.startSpan("orchestrate"); | ||
| const ret = await orchestrateExecution(workflow, executionId, records); | ||
| orchestrateSpan.end(); | ||
| return ret; |
There was a problem hiding this comment.
Would the scope function be better here?
There was a problem hiding this comment.
It would be, for some reason that one isn't working right now though. Probably missing something in the setup of the sdk.
Aside from that, I'm suspecting that the traces being sent to eg. grafana could do a good job at visualising the workflow runs, and with a bunch of customisation ability, better than our own visualiser will be able to provide for some time. As for a sample, I'll put one together once everything works. |
❌ Deploy Preview for preeminent-heliotrope-380b2a failed.
|
Packages AWS opentelemetry distro for lambda into the functions, enabling capturing of opentelemetry instrumentation, and currently forwarding data to x-ray. An experiment on the impact on cold start times. Currently clocking in at 1.5s.
Have attempted to use own build of collector, enabling region-agnostic access, and enabling us to trim unnecessary modules from it, However so far haven't been able to get my build to forward to x-ray.
Next step will add some light instrumentation around activities and events to see how it turns out.