Skip to content

Commit fe8acb7

Browse files
author
Willem Homan
committed
feat(platform): PAYMENTS-11567 Extract resque job metrics to separate class
1 parent bdddff6 commit fe8acb7

14 files changed

Lines changed: 844 additions & 21 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ Changelog for the bc-prometheus-ruby gem.
22

33
### Pending Release
44

5+
- Add opt-in per-Resque-job histograms `resque_job_queue_latency_seconds` and `resque_job_perform_duration_seconds`, labelled by `job_class`. Recorded from the parent worker process (via `Resque.before_fork` and a `Module#prepend` around `Resque::Worker#perform_with_fork`), so no synchronous flush is needed in the forked child. Gated by `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED` (default off).
6+
57
## 0.8.1
68

79
- Prometheus client respects the enabled setting

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,17 @@ require 'bigcommerce/prometheus'
4343
Bigcommerce::Prometheus::Instrumentors::Resque.new(app: Rails.application).start
4444
```
4545

46+
### Per-job metrics (opt-in)
47+
48+
Set `PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1` on Resque worker pods to enable two additional histograms recorded from the parent worker process (no per-job synchronous flush):
49+
50+
- `resque_job_queue_latency_seconds{job_class}` — time from `scheduled_at` (falling back to `enqueued_at`) until a worker picks the job up. Per attempt; retries-with-backoff anchor on `scheduled_at` so the intentional backoff doesn't show as latency.
51+
- `resque_job_perform_duration_seconds{job_class}` — total Resque child lifetime (fork → `Process.waitpid` return). Includes fork overhead, Redis reconnect, after_fork hooks, perform, and exit.
52+
53+
These are off by default because they emit one histogram observation per job per worker pod, which adds cardinality. Opt in per service.
54+
55+
ActiveJob-wrapped jobs are unwrapped automatically — the `job_class` label is the user's job class name, not `ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper`.
56+
4657
## Configuration
4758

4859
After requiring the main file, you can further configure with:
@@ -58,6 +69,7 @@ After requiring the main file, you can further configure with:
5869
| server_thread_pool_size | The number of threads used for the exporter server | `3` | `ENV['PROMETHEUS_SERVER_THREAD_POOL_SIZE']` |
5970
| process_name | What the current process name is (used in logging) | `"unknown"` | `ENV['PROCESS']` |
6071
| railtie_disabled | Opt out flag for Railtie; use `Bigcommerce::Prometheus::Instrumentors::Web.new(app: Rails.application).start` in your app's code to start it up yourself | `0` | `ENV['PROMETHEUS_DISABLE_RAILTIE']` |
72+
| resque_per_job_metrics_enabled | Enable per-job queue-latency and perform-duration histograms (parent-side, no synchronous flush) | `0` | `ENV['PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED']` |
6173

6274
## Custom Collectors
6375

lib/bigcommerce/prometheus.rb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
require_relative 'prometheus/collectors/resque'
3636
require_relative 'prometheus/type_collectors/base'
3737
require_relative 'prometheus/type_collectors/resque'
38+
require_relative 'prometheus/type_collectors/resque_job'
3839
require_relative 'prometheus/integrations/active_record'
3940
require_relative 'prometheus/type_collectors/active_record'
4041

@@ -44,6 +45,8 @@
4445
require_relative 'prometheus/integrations/railtie' if defined?(Rails)
4546
require_relative 'prometheus/integrations/puma'
4647
require_relative 'prometheus/integrations/resque'
48+
require_relative 'prometheus/integrations/resque/job_payload'
49+
require_relative 'prometheus/integrations/resque/job_metrics'
4750

4851
require_relative 'prometheus/servers/puma/server'
4952
require_relative 'prometheus/servers/puma/rack_app'

lib/bigcommerce/prometheus/configuration.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ module Configuration
3535
puma_process_label: ENV.fetch('PROMETHEUS_PUMA_PROCESS_LABEL', 'web').to_s,
3636
resque_collection_frequency: ENV.fetch('PROMETHEUS_RESQUE_COLLECTION_FREQUENCY', 30).to_i,
3737
resque_process_label: ENV.fetch('PROMETHEUS_REQUEST_PROCESS_LABEL', 'resque').to_s,
38+
resque_per_job_metrics_enabled: ENV.fetch('PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED', 0).to_i.positive?,
3839

3940
# Server configuration
4041
not_found_body: ENV.fetch('PROMETHEUS_SERVER_NOT_FOUND_BODY', 'Not Found! The Prometheus Ruby Exporter only listens on /metrics and /send-metrics').to_s,

lib/bigcommerce/prometheus/instrumentors/resque.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def start
4646

4747
server.add_type_collector(PrometheusExporter::Server::ActiveRecordCollector.new)
4848
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::Resque.new)
49+
server.add_type_collector(Bigcommerce::Prometheus::TypeCollectors::ResqueJob.new)
4950
@type_collectors.each do |tc|
5051
server.add_type_collector(tc)
5152
end

lib/bigcommerce/prometheus/integrations/resque.rb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ def self.start(client: nil)
3434
client: client || ::Bigcommerce::Prometheus.client,
3535
frequency: ::Bigcommerce::Prometheus.resque_collection_frequency
3636
)
37+
::Bigcommerce::Prometheus::Integrations::Resque::JobMetrics.start(
38+
client: client || ::Bigcommerce::Prometheus.client
39+
)
3740
end
3841
end
3942
end
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
module Bigcommerce
19+
module Prometheus
20+
module Integrations
21+
class Resque
22+
##
23+
# Per-Resque-job histogram metrics, recorded from the parent worker
24+
# process. Hooked via a prepend around
25+
# Resque::Worker#perform_with_fork — queue latency is captured before
26+
# super, perform duration after.
27+
#
28+
# Off unless PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 — emits one
29+
# histogram observation per job per worker process, which can be high
30+
# cardinality at scale.
31+
#
32+
# NOTE: queue_latency only emits for jobs whose Resque payload is
33+
# ActiveJob-shaped. The gem reads three fields from
34+
# `payload['args'][0]` (which must be a Hash):
35+
#
36+
# * job_class — the user's actual job class name; used as the
37+
# metric label.
38+
# * enqueued_at — ISO 8601 string; used as the queue-latency
39+
# anchor when scheduled_at is absent.
40+
# * scheduled_at — ISO 8601 string; preferred over enqueued_at
41+
# when present (e.g. retries-with-backoff, so
42+
# the intentional wait isn't counted as latency).
43+
#
44+
# ActiveJob produces this shape natively when jobs are enqueued via
45+
# .perform_later — the payload is wrapped by
46+
# ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps
47+
# the three fields above into `args[0]`.
48+
#
49+
# Vanilla Resque jobs (`class MyJob; @queue = :foo; def self.perform;
50+
# end`, enqueued via Resque.enqueue) do not produce this shape —
51+
# their args are raw primitive values, not a wrapping hash. For
52+
# these jobs, queue_latency silently no-ops. perform_duration works
53+
# for both styles regardless.
54+
#
55+
# Services that want queue_latency for vanilla Resque jobs can
56+
# produce the ActiveJob-shaped payload themselves via a service-local
57+
# wrapper class — see the PR description for options.
58+
#
59+
module JobMetrics
60+
class << self
61+
##
62+
# Install the parent-side hooks if the per-job metrics feature is
63+
# enabled. Idempotent: safe to call multiple times.
64+
#
65+
# @param [PrometheusExporter::Client] client
66+
#
67+
def start(client:)
68+
return unless ::Bigcommerce::Prometheus.resque_per_job_metrics_enabled
69+
70+
@client = client
71+
install_hooks
72+
end
73+
74+
##
75+
# Push the queue-latency observation for a job that's about to be
76+
# picked up by a worker. Anchors on scheduled_at if present (so
77+
# retries-with-backoff don't show the intentional wait as latency),
78+
# falling back to enqueued_at.
79+
#
80+
# @param [JobPayload] payload
81+
#
82+
def record_queue_latency(payload)
83+
anchor = payload.anchor_time
84+
return unless anchor
85+
86+
@client.send_json(
87+
type: 'resque_job',
88+
metric: 'queue_latency',
89+
value: (Time.now - anchor).to_f,
90+
custom_labels: { job_class: payload.job_class }
91+
)
92+
rescue StandardError => e
93+
::Bigcommerce::Prometheus.logger&.warn(
94+
"[bc-prometheus-ruby] resque_job queue_latency push failed: #{e.message}"
95+
)
96+
end
97+
98+
##
99+
# Push the perform-duration observation for a completed job. Called
100+
# from the `Resque::Worker#perform_with_fork` prepend, so it measures
101+
# the full child lifetime (fork + reconnect + perform + exit).
102+
#
103+
# @param [JobPayload] payload
104+
# @param [Float] duration in seconds
105+
#
106+
def record_perform_duration(payload, duration)
107+
@client.send_json(
108+
type: 'resque_job',
109+
metric: 'perform_duration',
110+
value: duration,
111+
custom_labels: { job_class: payload.job_class }
112+
)
113+
rescue StandardError => e
114+
::Bigcommerce::Prometheus.logger&.warn(
115+
"[bc-prometheus-ruby] resque_job perform_duration push failed: #{e.message}"
116+
)
117+
end
118+
119+
private
120+
121+
def install_hooks
122+
return if @hooks_installed
123+
124+
::Resque::Worker.prepend(WorkerInstrumentation)
125+
@hooks_installed = true
126+
end
127+
end
128+
129+
##
130+
# Prepended onto Resque::Worker to capture both queue latency
131+
# (before super) and perform duration (after super) for every
132+
# job that goes through perform_with_fork. JobPayload is built
133+
# once per job and shared between the two recordings.
134+
#
135+
module WorkerInstrumentation
136+
def perform_with_fork(job, &block)
137+
started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
138+
payload = JobPayload.new(job)
139+
JobMetrics.record_queue_latency(payload)
140+
super
141+
ensure
142+
JobMetrics.record_perform_duration(
143+
payload,
144+
Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
145+
)
146+
end
147+
end
148+
end
149+
end
150+
end
151+
end
152+
end
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# frozen_string_literal: true
2+
3+
# Copyright (c) 2019-present, BigCommerce Pty. Ltd. All rights reserved
4+
#
5+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
6+
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
7+
# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
8+
# persons to whom the Software is furnished to do so, subject to the following conditions:
9+
#
10+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
11+
# Software.
12+
#
13+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
14+
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
15+
# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
16+
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17+
#
18+
module Bigcommerce
19+
module Prometheus
20+
module Integrations
21+
class Resque
22+
##
23+
# Extracts the fields per-job metrics need from a Resque::Job's
24+
# payload. Eagerly parses in #initialize and exposes plain
25+
# attr_readers — does not hold a reference to the Resque::Job after
26+
# construction.
27+
#
28+
# See JobMetrics's class-level docs for the ActiveJob-shaped payload
29+
# contract this class consumes.
30+
#
31+
class JobPayload
32+
# @return [String] the user's actual job class name. For
33+
# ActiveJob-shaped payloads this is `args[0]['job_class']`.
34+
# Falls back to the top-level Resque `class` field for vanilla
35+
# Resque payloads. `'unknown'` if neither is available, or if
36+
# the payload is malformed (nil, non-Hash, etc.).
37+
attr_reader :job_class
38+
39+
# @return [Time, nil] the queue-latency anchor. Prefers
40+
# `scheduled_at` over `enqueued_at` (so retries-with-backoff
41+
# don't count the intentional wait). nil for non-ActiveJob-shaped
42+
# payloads (where `args[0]` isn't a Hash), for payloads where
43+
# both timestamps are absent or unparseable, and for malformed
44+
# payloads.
45+
attr_reader :anchor_time
46+
47+
# @param [Resque::Job] resque_job
48+
def initialize(resque_job)
49+
payload = resque_job.payload || {}
50+
inner = extract_activejob_inner(payload)
51+
52+
@job_class = extract_job_class(payload, inner)
53+
@anchor_time = extract_anchor_time(inner)
54+
end
55+
56+
private
57+
58+
# Returns the inner ActiveJob-shaped payload (`args[0]` if it's a
59+
# Hash), or nil for any payload shape that isn't ActiveJob-wrapped
60+
# (vanilla Resque jobs with primitive args, non-Hash payloads,
61+
# `args` not being an Array, etc.).
62+
def extract_activejob_inner(payload)
63+
return nil unless payload.is_a?(Hash)
64+
65+
args = payload['args']
66+
first = args.is_a?(Array) ? args.first : nil
67+
first.is_a?(Hash) ? first : nil
68+
end
69+
70+
def extract_job_class(payload, inner)
71+
inner_class = inner.is_a?(Hash) ? inner['job_class'] : nil
72+
outer_class = payload.is_a?(Hash) ? payload['class'] : nil
73+
inner_class || outer_class || 'unknown'
74+
end
75+
76+
def extract_anchor_time(inner)
77+
return nil unless inner.is_a?(Hash)
78+
79+
parse_time(inner['scheduled_at']) || parse_time(inner['enqueued_at'])
80+
end
81+
82+
def parse_time(value)
83+
return value if value.is_a?(Time)
84+
return nil if value.nil? || value.to_s.empty?
85+
86+
Time.iso8601(value.to_s)
87+
rescue ArgumentError
88+
nil
89+
end
90+
end
91+
end
92+
end
93+
end
94+
end

lib/bigcommerce/prometheus/type_collectors/resque.rb

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,14 @@ module Bigcommerce
1919
module Prometheus
2020
module TypeCollectors
2121
##
22-
# Collect resque data from collectors and parse them into metrics
22+
# Aggregates the periodic worker/queue state pushed from
23+
# `Bigcommerce::Prometheus::Collectors::Resque#collect`. Per-job
24+
# histograms (`queue_latency`, `perform_duration`) live in
25+
# `TypeCollectors::ResqueJob` so the upstream router can dispatch each
26+
# envelope type to its own collector.
2327
#
2428
class Resque < Bigcommerce::Prometheus::TypeCollectors::Base
2529
##
26-
# Initialize the collector
27-
#
2830
# @return [Hash]
2931
#
3032
def build_metrics
@@ -38,9 +40,6 @@ def build_metrics
3840
}
3941
end
4042

41-
##
42-
# Collect resque metrics from input data
43-
#
4443
def collect_metrics(data:, labels: {})
4544
metric(:workers_total).observe(data['workers_total'], labels)
4645
metric(:jobs_failed_total).observe(data['jobs_failed_total'], labels)

0 commit comments

Comments
 (0)