Skip to content

Commit b6c0708

Browse files
committed
LiveQueriesCollector bug fix | P95 sorting | MongoPoolCollector listener cleanup | shared utils | container memory fix
1 parent 99fb526 commit b6c0708

13 files changed

Lines changed: 137 additions & 81 deletions

.versions

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ reactive-var@1.0.13
4646
reload@1.3.2
4747
retry@1.1.1
4848
routepolicy@1.1.2
49-
skysignal:agent@1.0.17
49+
skysignal:agent@1.0.18
5050
socket-stream-client@0.6.1
5151
tracker@1.3.4
5252
typescript@5.9.3

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -854,6 +854,16 @@ Main agent singleton instance.
854854

855855
## Changelog
856856

857+
### v1.0.19 (Bug Fixes & Code Quality)
858+
859+
- **Fix container memory usage reporting >100%** - `SystemMetricsCollector` previously calculated container memory as `processMemory.rss / constrainedMemory * 100`, which could exceed 100% because RSS (Resident Set Size) includes shared library pages, memory-mapped files, and kernel page cache that don't count against the container's cgroup memory limit. Now uses `process.availableMemory()` (Node 19+), which reads directly from the cgroup memory controller and accounts for reclaimable buffers, to compute usage as `(constrainedMemory - availableMemory) / constrainedMemory * 100`. Falls back to `heapUsed / constrainedMemory * 100` on older Node versions. This aligns reported memory with what container orchestrators (e.g., Meteor Galaxy) actually report.
860+
- **Fix observer stop logging crash** - `LiveQueriesCollector._wrapHandle()` used `this._log()` inside a regular `function()` callback where `this` referred to the handle object, not the collector instance. Changed to `self._log()` to use the captured closure variable. Previously, calling `handle.stop()` would throw `TypeError: this._log is not a function`, silently preventing observer lifecycle metrics from being recorded.
861+
- **Fix P95 percentile off-by-one** - `MongoPoolCollector`, `DnsTimingCollector`, and `DiagnosticsChannelCollector` all used `Math.floor(count * 0.95)` to index into a sorted array, which overshoots the true 95th percentile by one position (e.g., for 100 items, returns the 96th element instead of the 95th). Changed to `Math.ceil(count * p) - 1` across all three collectors. Extracted to shared `percentile()` utility in `lib/utils/percentile.js`.
862+
- **Fix MongoPoolCollector.stop() killing other event listeners** - `stop()` called `client.removeAllListeners(eventName)` for each pool event, which removed ALL listeners for that event — including those registered by the application or other collectors. Now stores individual handler references in `start()` and calls `client.removeListener(eventName, handler)` in `stop()` to remove only the collector's own handlers.
863+
- **Fix circular buffer read after wrap-around** - `MongoPoolCollector._calculateCheckoutMetrics()` used `checkoutSamples.slice(0, count)` to extract samples, which returns incorrect data after the circular buffer wraps (old data mixed with new). Now correctly reads from the current write index forward using modular arithmetic to reconstruct the proper time-ordered sequence.
864+
- **Shared percentile utility** - Extracted percentile calculation to `lib/utils/percentile.js` with `percentile(sorted, p)` and `percentiles(values)` functions, replacing duplicated math in `MongoPoolCollector`, `DnsTimingCollector`, and `DiagnosticsChannelCollector`.
865+
- **Shared buffer eviction utility** - Extracted array trimming to `lib/utils/buffer.js` with `trimToMaxSize(array, maxSize)`, replacing duplicated `splice(0, length - max)` patterns in `DnsTimingCollector`, `DiagnosticsChannelCollector`, and `MongoPoolCollector._recordPoolWaitTime`.
866+
857867
### v1.0.18 (Container-Aware Metrics)
858868

859869
- **Container-aware memory usage** - `SystemMetricsCollector` now uses `process.constrainedMemory()` (Node 19+) to detect cgroup memory limits in containerized deployments. When a cgroup limit is present, memory usage is calculated as `processMemory.rss / constrainedMemory * 100` instead of `(os.totalmem() - os.freemem()) / os.totalmem() * 100`. The OS-level calculation counts kernel buffer/cache as "used", which dramatically overstates actual memory pressure in containers (e.g., reporting 89% when real RSS usage is 27%).

lib/collectors/DiagnosticsChannelCollector.js

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
* used by OpenTelemetry, Undici, and Node's native fetch.
1313
*/
1414

15+
import { percentile } from "../utils/percentile";
16+
import { trimToMaxSize } from "../utils/buffer";
17+
1518
let diagnostics_channel;
1619
try {
1720
diagnostics_channel = require("diagnostics_channel");
@@ -223,10 +226,8 @@ export default class DiagnosticsChannelCollector {
223226
timestamp: Date.now(),
224227
});
225228

226-
// Drop oldest entries in-place instead of allocating a new array
227-
if (this._requests.length > this._maxRequests) {
228-
this._requests.splice(0, this._requests.length - this._maxRequests);
229-
}
229+
// Keep buffer bounded
230+
trimToMaxSize(this._requests, this._maxRequests);
230231
}
231232

232233
_collect() {
@@ -265,9 +266,9 @@ export default class DiagnosticsChannelCollector {
265266

266267
// Percentiles
267268
times.sort((a, b) => a - b);
268-
const p50 = times[Math.floor(times.length * 0.5)] || 0;
269-
const p95 = times[Math.floor(times.length * 0.95)] || 0;
270-
const p99 = times[Math.floor(times.length * 0.99)] || 0;
269+
const p50 = percentile(times, 0.5);
270+
const p95 = percentile(times, 0.95);
271+
const p99 = percentile(times, 0.99);
271272

272273
// Build top external hosts
273274
const topHosts = Object.entries(byHost)

lib/collectors/DnsTimingCollector.js

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
import dns from "dns";
22
import { performance } from "perf_hooks";
3+
import { percentile } from "../utils/percentile";
4+
import { trimToMaxSize } from "../utils/buffer";
35

46
/**
57
* DnsTimingCollector
@@ -129,10 +131,8 @@ export default class DnsTimingCollector {
129131
timestamp: Date.now(),
130132
});
131133

132-
// Keep buffer bounded — drop oldest entries in-place instead of allocating a new array
133-
if (this._samples.length > this._maxSamples) {
134-
this._samples.splice(0, this._samples.length - this._maxSamples);
135-
}
134+
// Keep buffer bounded
135+
trimToMaxSize(this._samples, this._maxSamples);
136136
}
137137

138138
_collect() {
@@ -164,9 +164,9 @@ export default class DnsTimingCollector {
164164

165165
// Calculate percentiles
166166
durations.sort((a, b) => a - b);
167-
const p50 = durations[Math.floor(durations.length * 0.5)] || 0;
168-
const p95 = durations[Math.floor(durations.length * 0.95)] || 0;
169-
const p99 = durations[Math.floor(durations.length * 0.99)] || 0;
167+
const p50 = percentile(durations, 0.5);
168+
const p95 = percentile(durations, 0.95);
169+
const p99 = percentile(durations, 0.99);
170170

171171
// Build top hostnames by count
172172
const topHostnames = Object.entries(byHostname)

lib/collectors/LiveQueriesCollector.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,7 @@ export default class LiveQueriesCollector {
340340
observer.status = "stopped";
341341
observer.stoppedAt = Date.now();
342342
observer.observerLifespan = Math.round((observer.stoppedAt - observer.createdAt) / 1000);
343-
this._log(`Observer stopped: ${observer.collectionName} - ${observerId}`);
343+
self._log(`Observer stopped: ${observer.collectionName} - ${observerId}`);
344344
}
345345
}
346346
return originalStop.call(this);

lib/collectors/MongoPoolCollector.js

Lines changed: 36 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
import { Meteor } from 'meteor/meteor';
22
import { MongoInternals } from 'meteor/mongo';
3+
import { percentile } from '../utils/percentile';
4+
import { trimToMaxSize } from '../utils/buffer';
35

46
/**
57
* MongoPoolCollector
@@ -311,39 +313,21 @@ export default class MongoPoolCollector {
311313
// (connectionPoolCreated event already fired before we started)
312314
this._capturePoolConfigFromClient();
313315

314-
// Track pool creation and configuration (for reconnections)
315-
this.client.on('connectionPoolCreated', (event) => {
316-
this._onPoolCreated(event);
317-
});
318-
319-
// Track connection lifecycle
320-
this.client.on('connectionCreated', (event) => {
321-
this._onConnectionCreated(event);
322-
});
323-
324-
this.client.on('connectionClosed', (event) => {
325-
this._onConnectionClosed(event);
326-
});
327-
328-
// Track when connection is checked in (returned to pool)
329-
this.client.on('connectionCheckedIn', (event) => {
330-
this._onConnectionCheckedIn(event);
331-
});
332-
333-
// Track when connection checkout starts (wait begins)
334-
this.client.on('connectionCheckOutStarted', (event) => {
335-
this._onCheckoutStarted(event);
336-
});
337-
338-
// Track when connection is successfully checked out (wait ends)
339-
this.client.on('connectionCheckedOut', (event) => {
340-
this._onConnectionCheckedOut(event);
341-
});
316+
// Store bound handler references so stop() can remove them individually
317+
// (removeAllListeners would nuke handlers from other collectors/app code)
318+
this._handlers = {
319+
connectionPoolCreated: (event) => this._onPoolCreated(event),
320+
connectionCreated: (event) => this._onConnectionCreated(event),
321+
connectionClosed: (event) => this._onConnectionClosed(event),
322+
connectionCheckedIn: (event) => this._onConnectionCheckedIn(event),
323+
connectionCheckOutStarted: (event) => this._onCheckoutStarted(event),
324+
connectionCheckedOut: (event) => this._onConnectionCheckedOut(event),
325+
connectionCheckOutFailed: (event) => this._onCheckoutFailed(event),
326+
};
342327

343-
// Track checkout failures (timeouts, errors)
344-
this.client.on('connectionCheckOutFailed', (event) => {
345-
this._onCheckoutFailed(event);
346-
});
328+
for (const [eventName, handler] of Object.entries(this._handlers)) {
329+
this.client.on(eventName, handler);
330+
}
347331

348332
// Start periodic snapshot collection
349333
this.snapshotTimer = setInterval(() => {
@@ -374,15 +358,12 @@ export default class MongoPoolCollector {
374358
this.snapshotTimer = null;
375359
}
376360

377-
// Remove all event listeners
378-
if (this.client) {
379-
this.client.removeAllListeners('connectionPoolCreated');
380-
this.client.removeAllListeners('connectionCreated');
381-
this.client.removeAllListeners('connectionClosed');
382-
this.client.removeAllListeners('connectionCheckedIn');
383-
this.client.removeAllListeners('connectionCheckOutStarted');
384-
this.client.removeAllListeners('connectionCheckedOut');
385-
this.client.removeAllListeners('connectionCheckOutFailed');
361+
// Remove only our own event listeners (not all listeners for the event)
362+
if (this.client && this._handlers) {
363+
for (const [eventName, handler] of Object.entries(this._handlers)) {
364+
this.client.removeListener(eventName, handler);
365+
}
366+
this._handlers = null;
386367
}
387368

388369
// Clear state
@@ -631,10 +612,8 @@ export default class MongoPoolCollector {
631612
...metadata
632613
});
633614

634-
// Keep only recent samples — batch eviction instead of per-item shift() which is O(n)
635-
if (global._skySignalPoolWaitTimes.length > 1100) {
636-
global._skySignalPoolWaitTimes.splice(0, global._skySignalPoolWaitTimes.length - 1000);
637-
}
615+
// Keep only recent samples
616+
trimToMaxSize(global._skySignalPoolWaitTimes, 1000);
638617

639618
} catch (error) {
640619
this._warn(' Error recording pool wait time:', error);
@@ -918,8 +897,17 @@ export default class MongoPoolCollector {
918897
};
919898
}
920899

921-
// Extract valid samples from circular buffer
922-
const samples = this.poolState.checkoutSamples.slice(0, count);
900+
// Extract valid samples from circular buffer in correct order
901+
const buffer = this.poolState.checkoutSamples;
902+
const idx = this.poolState.checkoutSampleIndex;
903+
let samples;
904+
if (count < 1000) {
905+
// Buffer hasn't wrapped yet — first `count` elements are valid
906+
samples = buffer.slice(0, count);
907+
} else {
908+
// Buffer has wrapped — read from current index forward (oldest first)
909+
samples = buffer.slice(idx).concat(buffer.slice(0, idx));
910+
}
923911

924912
// Calculate average
925913
let sum = 0;
@@ -933,8 +921,7 @@ export default class MongoPoolCollector {
933921

934922
// Calculate P95 (95th percentile) - only sort when needed
935923
const sorted = samples.slice().sort((a, b) => a - b);
936-
const p95Index = Math.floor(count * 0.95);
937-
const p95CheckoutTime = sorted[p95Index] || 0;
924+
const p95CheckoutTime = percentile(sorted, 0.95);
938925

939926
return {
940927
avgCheckoutTime,

lib/collectors/SystemMetricsCollector.js

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import v8 from "v8";
1010
const execAsync = promisify(exec);
1111

1212
// Agent version - must be updated alongside package.js on each release
13-
const AGENT_VERSION = '1.0.18';
13+
const AGENT_VERSION = '1.0.19';
1414

1515
/**
1616
* SystemMetricsCollector
@@ -181,12 +181,25 @@ export default class SystemMetricsCollector {
181181
// Constrained memory (cgroup limit, Node 19+)
182182
const constrainedMemory = this._getConstrainedMemory();
183183

184-
// Container-aware memory: os.freemem() excludes reclaimable buffer/cache,
185-
// overstating memory pressure. When a cgroup limit is available, use
186-
// RSS / cgroup-limit which matches what container orchestrators report.
184+
// Container-aware memory: when a cgroup limit is available, compute
185+
// actual usage via process.availableMemory() (Node 19+) which accounts
186+
// for reclaimable cache/buffers. Falls back to heapUsed / limit.
187+
// NOTE: RSS is NOT suitable here — it includes shared library pages and
188+
// memory-mapped files, often exceeding the cgroup limit (producing >100%).
187189
let effectiveMemoryUsage = memoryUsage;
188190
if (constrainedMemory && constrainedMemory > 0) {
189-
effectiveMemoryUsage = (processMemory.rss / constrainedMemory) * 100;
191+
let availableMemory = null;
192+
try {
193+
if (typeof process.availableMemory === "function") {
194+
availableMemory = process.availableMemory();
195+
}
196+
} catch (e) { /* ignore */ }
197+
198+
if (availableMemory != null && availableMemory >= 0) {
199+
effectiveMemoryUsage = ((constrainedMemory - availableMemory) / constrainedMemory) * 100;
200+
} else {
201+
effectiveMemoryUsage = (processMemory.heapUsed / constrainedMemory) * 100;
202+
}
190203
}
191204

192205
// Process-level CPU: measures actual Node.js CPU consumption rather than

lib/utils/buffer.js

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
/**
2+
* Trim an array to maxSize by dropping the oldest (first) entries in-place.
3+
* Only triggers when the array exceeds maxSize, and drops the excess plus a
4+
* small batch margin (10%) to avoid triggering on every single push.
5+
*
6+
* @param {Array} array - The array to trim (mutated in place)
7+
* @param {number} maxSize - Maximum allowed length
8+
*/
9+
export function trimToMaxSize(array, maxSize) {
10+
if (array.length > maxSize) {
11+
array.splice(0, array.length - maxSize);
12+
}
13+
}

lib/utils/percentile.js

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
/**
2+
* Calculate a percentile value from a sorted array.
3+
* Uses the ceiling method: index = ceil(count * p) - 1
4+
*
5+
* @param {number[]} sorted - Pre-sorted array of numeric values (ascending)
6+
* @param {number} p - Percentile as a fraction (e.g., 0.95 for P95)
7+
* @returns {number} The percentile value, or 0 if array is empty
8+
*/
9+
export function percentile(sorted, p) {
10+
if (!sorted || sorted.length === 0) return 0;
11+
const index = Math.ceil(sorted.length * p) - 1;
12+
return sorted[Math.max(0, index)] || 0;
13+
}
14+
15+
/**
16+
* Calculate common percentiles (P50, P95, P99) from an unsorted array.
17+
* Sorts a copy of the input array.
18+
*
19+
* @param {number[]} values - Array of numeric values (will not be mutated)
20+
* @returns {{ p50: number, p95: number, p99: number }}
21+
*/
22+
export function percentiles(values) {
23+
if (!values || values.length === 0) {
24+
return { p50: 0, p95: 0, p99: 0 };
25+
}
26+
const sorted = values.slice().sort((a, b) => a - b);
27+
return {
28+
p50: percentile(sorted, 0.5),
29+
p95: percentile(sorted, 0.95),
30+
p99: percentile(sorted, 0.99),
31+
};
32+
}

package.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package.describe({
22
name: "skysignal:agent",
3-
version: "1.0.18",
3+
version: "1.0.19",
44
summary:
55
"SkySignal APM agent for Meteor applications - monitors performance, errors, and system metrics",
66
git: "https://github.com/skysignalapm/agent.git",

0 commit comments

Comments
 (0)