Skip to content

Commit 599ca5c

Browse files
author
Willem Homan
committed
feat(platform): PAYMENTS-11567 Resque metrics for time spent on queue on worker processing time
1 parent bdddff6 commit 599ca5c

14 files changed

Lines changed: 861 additions & 21 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ Changelog for the bc-prometheus-ruby gem.
22

33
### Pending Release
44

5+
- Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`. Recorded from the parent worker process via a `Module#prepend` around `Resque::Worker#perform_with_fork` (queue latency before the fork, perform duration after), so no synchronous flush is needed in the forked child. Gated by `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED` (default off).
6+
57
## 0.8.1
68

79
- Prometheus client respects the enabled setting

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,17 @@ require 'bigcommerce/prometheus'
4343
Bigcommerce::Prometheus::Instrumentors::Resque.new(app: Rails.application).start
4444
```
4545

46+
### Per-job metrics (opt-in)
47+
48+
Set `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1` on Resque worker pods to enable two additional histograms recorded from the parent worker process (no per-job synchronous flush):
49+
50+
- `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency.
51+
- `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit.
52+
53+
These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service.
54+
55+
ActiveJob-wrapped jobs are unwrapped automatically — the `job_class` label is the user's job class name, not `ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper`.
56+
4657
## Configuration
4758

4859
After requiring the main file, you can further configure with:
@@ -58,6 +69,7 @@ After requiring the main file, you can further configure with:
5869
| server_thread_pool_size | The number of threads used for the exporter server | `3` | `ENV['PROMETHEUS_SERVER_THREAD_POOL_SIZE']` |
5970
| process_name | What the current process name is (used in logging) | `"unknown"` | `ENV['PROCESS']` |
6071
| railtie_disabled | Opt out flag for Railtie; use `Bigcommerce::Prometheus::Instrumentors::Web.new(app: Rails.application).start` in your app's code to start it up yourself | `0` | `ENV['PROMETHEUS_DISABLE_RAILTIE']` |
72+
| resque_per_job_metrics_enabled | Enable per-job queue-latency and perform-duration histograms (parent-side, no synchronous flush) | `0` | `ENV['PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED']` |
6173

6274
## Custom Collectors
6375

lib/bigcommerce/prometheus.rb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
require_relative 'prometheus/collectors/resque'
3636
require_relative 'prometheus/type_collectors/base'
3737
require_relative 'prometheus/type_collectors/resque'
38+
require_relative 'prometheus/type_collectors/resque_job'
3839
require_relative 'prometheus/integrations/active_record'
3940
require_relative 'prometheus/type_collectors/active_record'
4041

@@ -44,6 +45,8 @@
4445
require_relative 'prometheus/integrations/railtie' if defined?(Rails)
4546
require_relative 'prometheus/integrations/puma'
4647
require_relative 'prometheus/integrations/resque'
48+
require_relative 'prometheus/integrations/resque/job_payload'
49+
require_relative 'prometheus/integrations/resque/job_metrics'
4750

4851
require_relative 'prometheus/servers/puma/server'
4952
require_relative 'prometheus/servers/puma/rack_app'

lib/bigcommerce/prometheus/configuration.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ module Configuration
3535
puma_process_label: ENV.fetch('PROMETHEUS_PUMA_PROCESS_LABEL', 'web').to_s,
3636
resque_collection_frequency: ENV.fetch('PROMETHEUS_RESQUE_COLLECTION_FREQUENCY', 30).to_i,
3737
resque_process_label: ENV.fetch('PROMETHEUS_REQUEST_PROCESS_LABEL', 'resque').to_s,
38+
resque_per_job_metrics_enabled: ENV.fetch('PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED', 0).to_i.positive?,
3839

3940
# Server configuration
4041
not_found_body: ENV.fetch('PROMETHEUS_SERVER_NOT_FOUND_BODY', 'Not Found! The Prometheus Ruby Exporter only listens on /metrics and /send-metrics').to_s,

lib/bigcommerce/prometheus/instrumentors/resque.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def start
4646

4747
server.add_type_collector(PrometheusExporter::Server::ActiveRecordCollector.new)
4848
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::Resque.new)
49+
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::ResqueJob.new)
4950
@type_collectors.each do |tc|
5051
server.add_type_collector(tc)
5152
end

lib/bigcommerce/prometheus/integrations/resque.rb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def self.start(client: nil)
3434
client: client || ::Bigcommerce::Prometheus.client,
3535
frequency: ::Bigcommerce::Prometheus.resque_collection_frequency
3636
)
37+
::Bigcommerce::Prometheus::Integrations::Resque::JobMetrics.start(
38+
client: client || ::Bigcommerce::Prometheus.client
39+
)
3740
end
3841
end
3942
end
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
module Bigcommerce
19+
module Prometheus
20+
module Integrations
21+
class Resque
22+
##
23+
# Per-Resque-job histogram metrics, recorded from the parent worker
24+
# process. Hooked via a prepend around
25+
# Resque::Worker#perform_with_fork — queue latency is captured before
26+
# super, perform duration after.
27+
#
28+
# Off unless PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 — emits one
29+
# histogram observation per job per worker process, which can be high
30+
# cardinality at scale.
31+
#
32+
# NOTE: queue_latency only emits for jobs whose Resque payload is
33+
# ActiveJob-shaped. The gem reads three fields from
34+
# `payload['args'][0]` (which must be a Hash):
35+
#
36+
# * job_class — the user's actual job class name; used as the
37+
# metric label.
38+
# * enqueued_at — ISO 8601 string; used as the queue-latency
39+
# anchor when scheduled_at is absent.
40+
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
41+
# when present (e.g. retries-with-backoff, so
42+
# the intentional wait isn't counted as latency).
43+
#
44+
# ActiveJob produces this shape natively when jobs are enqueued via
45+
# .perform_later — the payload is wrapped by
46+
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps
47+
# the three fields above into `args[0]`.
48+
#
49+
# Vanilla Resque jobs (`class MyJob; @queue = :foo; def self.perform;
50+
# end`, enqueued via Resque.enqueue) do not produce this shape —
51+
# their args are raw primitive values, not a wrapping hash. For
52+
# these jobs, queue_latency silently no-ops. perform_duration works
53+
# for both styles regardless.
54+
#
55+
# Services that want queue_latency for vanilla Resque jobs can
56+
# produce the ActiveJob-shaped payload themselves via a service-local
57+
# wrapper class — see the PR description for options.
58+
#
59+
module JobMetrics
60+
class << self
61+
##
62+
# Install the parent-side hooks if the per-job metrics feature is
63+
# enabled. Idempotent: safe to call multiple times.
64+
#
65+
# @param [PrometheusExporter::Client] client
66+
#
67+
def start(client:)
68+
return unless ::Bigcommerce::Prometheus.resque_per_job_metrics_enabled
69+
70+
@client = client
71+
install_hooks
72+
end
73+
74+
##
75+
# Push the queue-latency observation for a job that's about to be
76+
# picked up by a worker. Anchors on scheduled_at if present (so
77+
# retries-with-backoff don't show the intentional wait as latency),
78+
# falling back to enqueued_at.
79+
#
80+
# @param [JobPayload] payload
81+
#
82+
def record_queue_latency(payload)
83+
anchor = payload.anchor_time
84+
return unless anchor
85+
86+
# Clock skew between the enqueuer/scheduler and the worker can put
87+
# the anchor in the future; clamp so the histogram never records a
88+
# negative latency.
89+
latency = (Time.now - anchor).to_f.clamp(0.0..)
90+
91+
@client.send_json(
92+
type: 'resque_job',
93+
metric: 'queue_latency',
94+
value: latency,
95+
custom_labels: { job_class: payload.job_class }
96+
)
97+
rescue StandardError => e
98+
::Bigcommerce::Prometheus.logger&.warn(
99+
"[bc-prometheus-ruby] resque_job queue_latency push failed: #{e.message}"
100+
)
101+
end
102+
103+
##
104+
# Push the perform-duration observation for a completed job. Called
105+
# from the `Resque::Worker#perform_with_fork` prepend, so it measures
106+
# the full child lifetime (fork + reconnect + perform + exit).
107+
#
108+
# @param [JobPayload] payload
109+
# @param [Float] duration in seconds
110+
#
111+
def record_perform_duration(payload, duration)
112+
@client.send_json(
113+
type: 'resque_job',
114+
metric: 'perform_duration',
115+
value: duration,
116+
custom_labels: { job_class: payload.job_class }
117+
)
118+
rescue StandardError => e
119+
::Bigcommerce::Prometheus.logger&.warn(
120+
"[bc-prometheus-ruby] resque_job perform_duration push failed: #{e.message}"
121+
)
122+
end
123+
124+
private
125+
126+
def install_hooks
127+
return if @hooks_installed
128+
129+
::Resque::Worker.prepend(WorkerInstrumentation)
130+
@hooks_installed = true
131+
end
132+
end
133+
134+
##
135+
# Prepended onto Resque::Worker to capture both queue latency
136+
# (before super) and perform duration (after super) for every
137+
# job that goes through perform_with_fork. JobPayload is built
138+
# once per job and shared between the two recordings.
139+
#
140+
module WorkerInstrumentation
141+
def perform_with_fork(job, &block)
142+
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
143+
payload = JobPayload.new(job)
144+
JobMetrics.record_queue_latency(payload)
145+
super
146+
ensure
147+
JobMetrics.record_perform_duration(
148+
payload,
149+
Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
150+
)
151+
end
152+
end
153+
end
154+
end
155+
end
156+
end
157+
end
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
require 'time'
19+
20+
module Bigcommerce
21+
module Prometheus
22+
module Integrations
23+
class Resque
24+
##
25+
# Extracts the fields per-job metrics need from a Resque::Job's
26+
# payload. Eagerly parses in #initialize and exposes plain
27+
# attr_readers — does not hold a reference to the Resque::Job after
28+
# construction.
29+
#
30+
# See JobMetrics's class-level docs for the ActiveJob-shaped payload
31+
# contract this class consumes.
32+
#
33+
class JobPayload
34+
# @return [String] the user's actual job class name. For
35+
# ActiveJob-shaped payloads this is `args[0]['job_class']`.
36+
# Falls back to the top-level Resque `class` field for vanilla
37+
# Resque payloads. `'unknown'` if neither is available, or if
38+
# the payload is malformed (nil, non-Hash, etc.).
39+
attr_reader :job_class
40+
41+
# @return [Time, nil] the queue-latency anchor. Prefers
42+
# `scheduled_at` over `enqueued_at` (so retries-with-backoff
43+
# don't count the intentional wait). nil for non-ActiveJob-shaped
44+
# payloads (where `args[0]` isn't a Hash), for payloads where
45+
# both timestamps are absent or unparseable, and for malformed
46+
# payloads.
47+
attr_reader :anchor_time
48+
49+
# @param [Resque::Job] resque_job
50+
def initialize(resque_job)
51+
payload = resque_job.payload || {}
52+
inner = extract_activejob_inner(payload)
53+
54+
@job_class = extract_job_class(payload, inner)
55+
@anchor_time = extract_anchor_time(inner)
56+
end
57+
58+
private
59+
60+
# Returns the inner ActiveJob-shaped payload (`args[0]` if it's a
61+
# Hash), or nil for any payload shape that isn't ActiveJob-wrapped
62+
# (vanilla Resque jobs with primitive args, non-Hash payloads,
63+
# `args` not being an Array, etc.).
64+
def extract_activejob_inner(payload)
65+
return nil unless payload.is_a?(Hash)
66+
67+
args = payload['args']
68+
first = args.is_a?(Array) ? args.first : nil
69+
first.is_a?(Hash) ? first : nil
70+
end
71+
72+
def extract_job_class(payload, inner)
73+
inner_class = inner.is_a?(Hash) ? inner['job_class'] : nil
74+
outer_class = payload.is_a?(Hash) ? payload['class'] : nil
75+
inner_class || outer_class || 'unknown'
76+
end
77+
78+
def extract_anchor_time(inner)
79+
return nil unless inner.is_a?(Hash)
80+
81+
parse_time(inner['scheduled_at']) || parse_time(inner['enqueued_at'])
82+
end
83+
84+
def parse_time(value)
85+
return value if value.is_a?(Time)
86+
return nil if value.nil? || value.to_s.empty?
87+
88+
Time.iso8601(value.to_s)
89+
rescue ArgumentError
90+
nil
91+
end
92+
end
93+
end
94+
end
95+
end
96+
end

lib/bigcommerce/prometheus/type_collectors/resque.rb

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,14 @@ module Bigcommerce
1919
module Prometheus
2020
module TypeCollectors
2121
##
22-
# Collect resque data from collectors and parse them into metrics
22+
# Aggregates the periodic worker/queue state pushed from
23+
# `Bigcommerce::Prometheus::Collectors::Resque#collect`. Per-job
24+
# histograms (`queue_latency`, `perform_duration`) live in
25+
# `TypeCollectors::ResqueJob` so the upstream router can dispatch each
26+
# envelope type to its own collector.
2327
#
2428
class Resque < Bigcommerce::Prometheus::TypeCollectors::Base
2529
##
26-
# Initialize the collector
27-
#
2830
# @return [Hash]
2931
#
3032
def build_metrics
@@ -38,9 +40,6 @@ def build_metrics
3840
}
3941
end
4042

41-
##
42-
# Collect resque metrics from input data
43-
#
4443
def collect_metrics(data:, labels: {})
4544
metric(:workers_total).observe(data['workers_total'], labels)
4645
metric(:jobs_failed_total).observe(data['jobs_failed_total'], labels)

0 commit comments

Comments
 (0)