Skip to content

Commit 39b0fbd

Browse files
author
Willem Homan
committed
feat(platform): PAYMENTS-11567 Resque metrics for time spent on queue on worker processing time
1 parent bdddff6 commit 39b0fbd

16 files changed

Lines changed: 1086 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ Changelog for the bc-prometheus-ruby gem.
22

33
### Pending Release
44

5+
- Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`.
6+
57
## 0.8.1
68

79
- Prometheus client respects the enabled setting

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,18 @@ require 'bigcommerce/prometheus'
4343
Bigcommerce::Prometheus::Instrumentors::Resque.new(app: Rails.application).start
4444
```
4545

46+
### Per-job metrics (opt-in)
47+
48+
Set `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1` on Resque worker pods to enable two additional histograms recorded from the parent worker process.
49+
50+
- `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency.
51+
- `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit.
52+
53+
These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service.
54+
55+
`resque_job_queue_latency_seconds` is supported for jobs enqueued via ActiveJob (`.perform_later`) — the enqueue timestamps come from ActiveJob's serialized payload, and the `job_class` label is the user's job class name, not `ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper`.
56+
Vanilla Resque jobs (`Resque.enqueue`) carry no enqueue timestamps, so `queue_latency` silently no-ops for them; `perform_duration` works for every job regardless.
57+
4658
## Configuration
4759

4860
After requiring the main file, you can further configure with:
@@ -58,6 +70,7 @@ After requiring the main file, you can further configure with:
5870
| server_thread_pool_size | The number of threads used for the exporter server | `3` | `ENV['PROMETHEUS_SERVER_THREAD_POOL_SIZE']` |
5971
| process_name | What the current process name is (used in logging) | `"unknown"` | `ENV['PROCESS']` |
6072
| railtie_disabled | Opt out flag for Railtie; use `Bigcommerce::Prometheus::Instrumentors::Web.new(app: Rails.application).start` in your app's code to start it up yourself | `0` | `ENV['PROMETHEUS_DISABLE_RAILTIE']` |
73+
| resque_per_job_metrics_enabled | Enable per-job queue-latency and perform-duration histograms (parent-side, no synchronous flush) | `0` | `ENV['PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED']` |
6174

6275
## Custom Collectors
6376

lib/bigcommerce/prometheus.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
require_relative 'prometheus/collectors/resque'
3636
require_relative 'prometheus/type_collectors/base'
3737
require_relative 'prometheus/type_collectors/resque'
38+
require_relative 'prometheus/type_collectors/resque_job'
3839
require_relative 'prometheus/integrations/active_record'
3940
require_relative 'prometheus/type_collectors/active_record'
4041

@@ -44,6 +45,10 @@
4445
require_relative 'prometheus/integrations/railtie' if defined?(Rails)
4546
require_relative 'prometheus/integrations/puma'
4647
require_relative 'prometheus/integrations/resque'
48+
require_relative 'prometheus/integrations/resque/active_job_payload'
49+
require_relative 'prometheus/integrations/resque/vanilla_resque_payload'
50+
require_relative 'prometheus/integrations/resque/job_payload'
51+
require_relative 'prometheus/integrations/resque/job_metrics'
4752

4853
require_relative 'prometheus/servers/puma/server'
4954
require_relative 'prometheus/servers/puma/rack_app'

lib/bigcommerce/prometheus/configuration.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ module Configuration
3535
puma_process_label: ENV.fetch('PROMETHEUS_PUMA_PROCESS_LABEL', 'web').to_s,
3636
resque_collection_frequency: ENV.fetch('PROMETHEUS_RESQUE_COLLECTION_FREQUENCY', 30).to_i,
3737
resque_process_label: ENV.fetch('PROMETHEUS_REQUEST_PROCESS_LABEL', 'resque').to_s,
38+
resque_per_job_metrics_enabled: ENV.fetch('PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED', 0).to_i.positive?,
3839

3940
# Server configuration
4041
not_found_body: ENV.fetch('PROMETHEUS_SERVER_NOT_FOUND_BODY', 'Not Found! The Prometheus Ruby Exporter only listens on /metrics and /send-metrics').to_s,

lib/bigcommerce/prometheus/instrumentors/resque.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def start
4646

4747
server.add_type_collector(PrometheusExporter::Server::ActiveRecordCollector.new)
4848
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::Resque.new)
49+
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::ResqueJob.new)
4950
@type_collectors.each do |tc|
5051
server.add_type_collector(tc)
5152
end

lib/bigcommerce/prometheus/integrations/resque.rb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def self.start(client: nil)
3434
client: client || ::Bigcommerce::Prometheus.client,
3535
frequency: ::Bigcommerce::Prometheus.resque_collection_frequency
3636
)
37+
::Bigcommerce::Prometheus::Integrations::Resque::JobMetrics.start(
38+
client: client || ::Bigcommerce::Prometheus.client
39+
)
3740
end
3841
end
3942
end
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
require 'time'
19+
20+
module Bigcommerce
21+
module Prometheus
22+
module Integrations
23+
class Resque
24+
##
25+
# Payload fields for an ActiveJob-shaped Resque job, read from the
26+
# inner hash at `args[0]`. ActiveJob's JobWrapper stamps the three
27+
# fields the per-job metrics consume:
28+
#
29+
# * job_class — the user's actual job class name; used as the
30+
# metric label.
31+
# * enqueued_at — ISO 8601 string; queue-latency anchor when
32+
# scheduled_at is absent.
33+
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
34+
# when present (e.g. retries-with-backoff, so the
35+
# intentional wait isn't counted as latency).
36+
class ActiveJobPayload
37+
# @return [String] the user's actual job class name
38+
attr_reader :job_class
39+
40+
# @return [Time, nil] the queue-latency anchor; nil when both
41+
# timestamps are absent or unparseable
42+
attr_reader :anchor_time
43+
44+
# @param [Hash] inner the ActiveJob-shaped hash at `args[0]`;
45+
# JobPayload.for guarantees a truthy 'job_class'
46+
def initialize(inner)
47+
@job_class = inner['job_class']
48+
@anchor_time = parse_time(inner['scheduled_at']) || parse_time(inner['enqueued_at'])
49+
end
50+
51+
private
52+
53+
def parse_time(value)
54+
return value if value.is_a?(Time)
55+
return nil if value.nil? || value.to_s.empty?
56+
57+
Time.iso8601(value.to_s)
58+
rescue ArgumentError
59+
nil
60+
end
61+
end
62+
end
63+
end
64+
end
65+
end
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
module Bigcommerce
19+
module Prometheus
20+
module Integrations
21+
class Resque
22+
##
23+
# Per-Resque-job histogram metrics, recorded from the parent worker process.
24+
# Hooked via a prepend around Resque::Worker#perform_with_fork.
25+
# Queue latency is captured before super, perform duration after.
26+
#
27+
# Off unless PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1
28+
# Emits one histogram observation per job per worker process, which can be high cardinality at scale.
29+
#
30+
# NOTE: queue_latency is supported for jobs enqueued via ActiveJob
31+
# The gem reads three fields from
32+
# `payload['args'][0]` (which must be a Hash):
33+
#
34+
# * job_class — the user's actual job class name; used as the
35+
# metric label.
36+
# * enqueued_at — ISO 8601 string; used as the queue-latency
37+
# anchor when scheduled_at is absent.
38+
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
39+
# when present (e.g. retries-with-backoff, so
40+
# the intentional wait isn't counted as latency).
41+
#
42+
# ActiveJob produces this shape natively — the payload is wrapped by
43+
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps
44+
# the three fields above into `args[0]`.
45+
#
46+
# Vanilla Resque jobs enqueued via Resque.enqueue carry no enqueue timestamps.
47+
# class MyJob
48+
# @queue = :foo;
49+
# def self.perform;
50+
# end
51+
# Their args are raw primitive values, not a wrapping hash.
52+
# For these jobs, queue_latency silently no-ops.
53+
# perform_duration works for both styles regardless.
54+
#
55+
# Payloads that replicate the three fields above are read the same way.
56+
# Detection is by shape, not by wrapper class name.
57+
# This means a vanilla job can opt in to queue_latency either by
58+
# - converting to ActiveJob
59+
# - enqueueing through a small wrapper class that stamps these fields into args[0].
60+
#
61+
module JobMetrics
62+
class << self
63+
##
64+
# Install the parent-side hooks if the per-job metrics feature is enabled.
65+
# Idempotent: safe to call multiple times.
66+
#
67+
# @param [PrometheusExporter::Client] client
68+
#
69+
def start(client:)
70+
return unless ::Bigcommerce::Prometheus.resque_per_job_metrics_enabled
71+
72+
@client = client
73+
install_hooks
74+
end
75+
76+
##
77+
# Push the queue-latency observation for a job that's about to be picked up by a worker.
78+
# Anchors on scheduled_at if present so retries-with-backoff don't show the intentional wait as latency.
79+
# Falls back to enqueued_at if scheduled_at isn't present.
80+
#
81+
# @param [ActiveJobPayload, VanillaResquePayload] payload
82+
#
83+
def record_queue_latency(payload)
84+
anchor = payload.anchor_time
85+
return unless anchor
86+
87+
# Clock skew between the enqueuer/scheduler and the worker can put the anchor in the future.
88+
# Clamp to zero so the histogram never records a negative latency.
89+
latency = (Time.now - anchor).to_f.clamp(0.0..)
90+
91+
@client.send_json(
92+
type: 'resque_job',
93+
metric: 'queue_latency',
94+
value: latency,
95+
custom_labels: { job_class: payload.job_class }
96+
)
97+
rescue StandardError => e
98+
::Bigcommerce::Prometheus.logger&.warn(
99+
"[bigcommerce-prometheus] resque_job queue_latency push failed: #{e.message}"
100+
)
101+
end
102+
103+
##
104+
# Push the perform-duration observation for a completed job.
105+
# Called from the `Resque::Worker#perform_with_fork` prepend, so it measures the full child lifetime:
106+
# fork + reconnect + perform + exit
107+
#
108+
# The duration is computed here, not at the call site: the caller invokes this from an
109+
# ensure block, which must never raise over an exception already propagating. Keeping
110+
# the arithmetic inside this rescue absorbs every recording failure — including a nil
111+
# started_at when a catastrophic error fired before timing began.
112+
#
113+
# @param [ActiveJobPayload, VanillaResquePayload] payload
114+
# @param [Float] started_at monotonic timestamp taken just before the fork
115+
#
116+
def record_perform_duration(payload, started_at)
117+
@client.send_json(
118+
type: 'resque_job',
119+
metric: 'perform_duration',
120+
value: Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at,
121+
custom_labels: { job_class: payload.job_class }
122+
)
123+
rescue StandardError => e
124+
::Bigcommerce::Prometheus.logger&.warn(
125+
"[bigcommerce-prometheus] resque_job perform_duration push failed: #{e.message}"
126+
)
127+
end
128+
129+
private
130+
131+
def install_hooks
132+
return if @hooks_installed
133+
134+
::Resque::Worker.prepend(WorkerInstrumentation)
135+
@hooks_installed = true
136+
end
137+
end
138+
139+
##
140+
# Prepended onto Resque::Worker to capture for every job that goes through perform_with_fork:
141+
# - queue latency: before super
142+
# - perform duration: after super
143+
module WorkerInstrumentation
144+
def perform_with_fork(job, &block)
145+
payload = JobPayload.for(job)
146+
JobMetrics.record_queue_latency(payload)
147+
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
148+
super
149+
ensure
150+
# Bare locals only: argument expressions here evaluate outside the
151+
# recorder's rescue, so they must not be able to raise.
152+
JobMetrics.record_perform_duration(payload, started_at)
153+
end
154+
end
155+
end
156+
end
157+
end
158+
end
159+
end
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
module Bigcommerce
19+
module Prometheus
20+
module Integrations
21+
class Resque
22+
##
23+
# Classifies a Resque::Job's payload and builds the matching
24+
# shape-specific payload object for per-job metrics.
25+
#
26+
# A payload is ActiveJob-shaped when `args[0]` is a Hash carrying a
27+
# truthy 'job_class' — the shape
28+
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper produces
29+
# natively. Detection is by shape rather than by wrapper class name:
30+
# the fields are ActiveJob's stable serialization format (persisted
31+
# payloads must survive Rails upgrades), while the wrapper's class
32+
# name is a private Rails constant — matching on it would silently
33+
# kill the metric if Rails ever moved it. Payloads that replicate
34+
# these fields are read the same way, by mechanism. Everything
35+
# else — vanilla Resque jobs with primitive args, nil or non-Hash
36+
# payloads, `args` not being an Array — is treated as vanilla.
37+
#
38+
# Both payload classes expose the same interface: #job_class
39+
# (String) and #anchor_time (Time or nil).
40+
#
41+
module JobPayload
42+
class << self
43+
##
44+
# Never raises: instrumentation must not break job execution, so a payload object is always returned.
45+
# Unexpected failures degrade to a vanilla payload labelled 'unknown'.
46+
#
47+
# @param [Resque::Job] resque_job
48+
# @return [ActiveJobPayload, VanillaResquePayload]
49+
#
50+
def for(resque_job)
51+
payload = resque_job.payload
52+
payload = {} unless payload.is_a?(Hash)
53+
54+
inner = activejob_inner(payload)
55+
inner ? ActiveJobPayload.new(inner) : VanillaResquePayload.new(payload)
56+
rescue StandardError
57+
VanillaResquePayload.new({})
58+
end
59+
60+
private
61+
62+
def activejob_inner(payload)
63+
args = payload['args']
64+
first = args.is_a?(Array) ? args.first : nil
65+
return nil unless first.is_a?(Hash) && first['job_class']
66+
67+
first
68+
end
69+
end
70+
end
71+
end
72+
end
73+
end
74+
end

0 commit comments

Comments
 (0)