feat(webinar): add metrics#4942
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e185ad3534
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (const dataSet of message.dataSets) { | ||
| if (this.pendingSyncMetrics.has(dataSet.name)) { | ||
| this.completeSyncMetrics(dataSet.name, messageReceivedAt); |
There was a problem hiding this comment.
Correlate sync completion to request, not just dataset name
Completing pending metrics by matching only dataSet.name allows unrelated hash-tree messages to close a pending sync and emit client.locus.sync.complete with incorrect latencies. If a normal main update arrives before the sync-triggered broadcast, syncMessageReceiveTime/totalTime are attributed to the wrong message and the real sync completion is effectively lost. This is especially risky because tracked datasets like main can receive frequent independent updates.
Useful? React with 👍 / 👎.
| if (shouldCollectMetrics) { | ||
| const randomBackoffTime = dataSet.lastBackoffTime ?? 0; | ||
| dataSet.lastBackoffTime = undefined; | ||
|
|
||
| this.pendingSyncMetrics.set(dataSet.name, { | ||
| syncResponseReceivedAt: performance.now(), |
There was a problem hiding this comment.
Create pending metrics only when a sync request is issued
Pending sync metrics are inserted unconditionally for tracked datasets, even when no POST /sync call was made (for example, when diffHashes yields no mismatched leaves and Object.keys(leavesData).length is 0). In that path, the next message for the dataset can still emit client.locus.sync.complete, producing a fabricated "sync complete" event for a sync that never happened.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7c479708c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (isInitialJoin && timings) { | ||
| this.sendLLMConnectMetric(timings); | ||
| } | ||
|
|
||
| return Promise.resolve(timings); |
There was a problem hiding this comment.
Use register result instead of undefined
timings
Replace timings with the value returned by registerAndConnect in this callback. As written, timings is never defined in scope, so any successful registerAndConnect call will hit a ReferenceError when evaluating if (isInitialJoin && timings), causing updateLLMConnection() to reject after connecting and preventing both listener setup completion and the new LLM latency metric emission during join.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9dfc6a666f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| this.pendingSyncMetrics.set(dataSet.name, { | ||
| syncResponseReceivedAt: performance.now(), | ||
| totalStartTime: totalStart, |
There was a problem hiding this comment.
Clear stale pending sync metrics before reusing parser
When a new pending metric is stored here, it can survive parser lifecycle transitions because pendingSyncMetrics is never cleared in stop(), resumeFromMessage(), or resumeFromApiResponse(). In LocusInfo, stopped parsers are explicitly resumed and reused, so a stale entry from a previous locus session can be completed by the next unrelated hash-tree message and emit a bogus client.locus.sync.complete latency. Please clear pending entries when stopping/resuming (or when replacing datasets) so metrics cannot leak across parser sessions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9446a31d85
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Store pending sync metrics only when a sync request was actually issued. | ||
| // If no leaves were mismatched, no sync was needed, so no metrics should be emitted. | ||
| if (shouldCollectMetrics && syncRequestMade) { | ||
| this.pendingSyncMetrics.set(dataSet.name, { |
There was a problem hiding this comment.
Preserve pending sync metrics across overlapping sync attempts
Avoid overwriting an existing pending entry for the same dataset here. pendingSyncMetrics is keyed only by dataset name, and set() replaces the previous request state unconditionally; because runSyncAlgorithm() is restarted on every sync, a delayed broadcast can let a second sync run first, so the first request’s latencies are lost and the eventual completion is attributed to the wrong attempt. This skews client.locus.sync.complete timing data under normal retry/latency conditions.
Useful? React with 👍 / 👎.
| error?: any | ||
| ) { | ||
| // @ts-ignore | ||
| const llmWebsocketUrl = this.webex.internal.llm.getDatachannelUrl?.() || undefined; |
There was a problem hiding this comment.
Use actual WebSocket URL for llmWebsocketUrl metric field
This metric populates llmWebsocketUrl using getDatachannelUrl(), but that getter returns the registration/datachannel endpoint (sessionData.datachannelUrl), not the connected socket URL (sessionData.webSocketUrl). As a result, both client.llm.connect.response and client.locus.sync.complete emit a mislabeled identifier, which breaks downstream correlation that expects the websocket URL value.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6656ad6f23
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Record the actual backoff time for sync metrics | ||
| dataSet.lastBackoffTime = Math.round(performance.now() - timerSetAt); |
There was a problem hiding this comment.
Record only jitter in randomBackoffTime
randomBackoffTime is populated from elapsed timer duration here, but this timer waits for idleMs + backoff, not just the random backoff portion. As a result every emitted sync metric overstates random backoff by at least idleMs, which skews latency analysis and makes dataset-to-dataset comparisons inaccurate.
Useful? React with 👍 / 👎.
| // Send LLM connect metric with error on failure | ||
| this.sendLLMConnectMetric({clientLLMDatachannelResponseTime: 0}, error); |
There was a problem hiding this comment.
Preserve measured latency on LLM connect failures
This failure path always emits clientLLMDatachannelResponseTime: 0, even when registerAndConnect() may have already spent non-zero time in the datachannel POST and then failed during websocket connect. That introduces systematically false latency values for failed joins and degrades diagnostic signal in client.llm.connect.response metrics.
Useful? React with 👍 / 👎.
This pull request addresses
https://jira-eng-gpk2.cisco.com/jira/browse/SPARK-774846
by making the following changes
dataSet.version >= pending.dataSetVersioncan complete pending metrics, preventing stale messages from producing incorrect latencies.totalTimemeasures from GET /hashtree through sync completion (excludes random backoff, which is reported separately).Change Type
The following scenarios were tested
< ENUMERATE TESTS PERFORMED, WHETHER MANUAL OR AUTOMATED >
The GAI Coding Policy And Copyright Annotation Best Practices
I certified that
Make sure to have followed the contributing guidelines before submitting.