Skip to content

Queue-depth ODS counters (#5913)#5913

Closed
adityas-meta wants to merge 1 commit into
pytorch:mainfrom
adityas-meta:export-D108312495
Closed

Queue-depth ODS counters (#5913)#5913
adityas-meta wants to merge 1 commit into
pytorch:mainfrom
adityas-meta:export-D108312495

Conversation

@adityas-meta

@adityas-meta adityas-meta commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2832

Add observability for the two queue depths in the RES streaming pipeline. (1) weights_to_stream_queue_ (MPSC stream queue in fbgemm) emits stream_mpsc_depth after every dequeue. (2) tensor_to_dedup_queues_[i] (per-shard dedup queues in training_ps) emit dedup_depth.shard_{i} per shard plus a dedup_depth.total rollup, bumped at the top of co_stream_tensors. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via TrainingPsOdsLogger::bumpKeyGauge (emits P50+P99 — the right stats for depth percentiles), on the raw_embedding_streaming category. The handler already owned ods_logger_; the fbgemm-side RawEmbeddingStreamer reuses the TrainingPsOdsLogger member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 ResGauges.h helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on RawEmbeddingStreamer. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; num_dedup_threads_ is configurable but typically small (4-8); .total is for at-a-glance, per-shard for hot-spot debug.

Reviewed By: FriedCosey

Differential Revision: D108312495

@meta-cla meta-cla Bot added the cla signed label Jun 16, 2026
@meta-codesync meta-codesync Bot changed the title Queue-depth ODS counters Queue-depth ODS counters (#5913) Jun 17, 2026
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 17, 2026
Summary:

Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `res.queue.stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `res.queue.dedup_depth.shard_{i}` per shard plus a `res.queue.dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyLatency` (emits AVG+P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Differential Revision: D108312495
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 17, 2026
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `res.queue.stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `res.queue.dedup_depth.shard_{i}` per shard plus a `res.queue.dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyLatency` (emits AVG+P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Differential Revision: D108312495
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 17, 2026
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `res.queue.stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `res.queue.dedup_depth.shard_{i}` per shard plus a `res.queue.dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyLatency` (emits AVG+P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Differential Revision: D108312495
@adityas-meta adityas-meta force-pushed the export-D108312495 branch 2 times, most recently from 76587ba to d28c754 Compare June 22, 2026 21:20
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 22, 2026
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `dedup_depth.shard_{i}` per shard plus a `dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyGauge` (emits AVG+P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Differential Revision: D108312495
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 23, 2026
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `dedup_depth.shard_{i}` per shard plus a `dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyGauge` (emits P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Differential Revision: D108312495
@adityas-meta adityas-meta force-pushed the export-D108312495 branch 2 times, most recently from 453b3b6 to 06e6872 Compare June 26, 2026 18:10
adityas-meta added a commit to adityas-meta/FBGEMM-1 that referenced this pull request Jun 26, 2026
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `dedup_depth.shard_{i}` per shard plus a `dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyGauge` (emits P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Reviewed By: FriedCosey

Differential Revision: D108312495
Summary:
X-link: facebookresearch/FBGEMM#2832


Add observability for the two queue depths in the RES streaming pipeline. (1) `weights_to_stream_queue_` (MPSC stream queue in fbgemm) emits `stream_mpsc_depth` after every dequeue. (2) `tensor_to_dedup_queues_[i]` (per-shard dedup queues in training_ps) emit `dedup_depth.shard_{i}` per shard plus a `dedup_depth.total` rollup, bumped at the top of `co_stream_tensors`. Backpressure today is invisible until a trainer crashes; these signals let oncall see saturation trends and per-shard hot-spots in advance.

Both surfaces use OBC via `TrainingPsOdsLogger::bumpKeyGauge` (emits P50+P99 — the right stats for depth percentiles), on the `raw_embedding_streaming` category. The handler already owned `ods_logger_`; the fbgemm-side `RawEmbeddingStreamer` reuses the `TrainingPsOdsLogger` member added by the parent diff D107811590 (constructed only when streaming is enabled). No fb303: OBC reaches ODS via the host-level agent without per-process export config. (An earlier revision used an fb303 `ResGauges.h` helper for the fbgemm gauge; removed per review feedback on the sibling silent-failure diff.)

Stacked on D107811590 (silent-failure ODS counters), which introduces the shared OBC logger member on `RawEmbeddingStreamer`. Both feed the master observability initiative T269497764.

Design note: per-shard cardinality is N OBC keys x 3 stats; `num_dedup_threads_` is configurable but typically small (4-8); `.total` is for at-a-glance, per-shard for hot-spot debug.

Reviewed By: FriedCosey

Differential Revision: D108312495
@meta-codesync

meta-codesync Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in 969615c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants