Skip to content

Commit 199fd38

Browse files
sirughclaudekeoko
authored
Working Branch for PLP fixes (#271)
* adds local_fs capability, to debug without aio. bumps timeouts and memory Signed-off-by: Stephen Rugh <rugh@adobe.com> * increasing timeout to 3 hours and commenting out triggers for now Signed-off-by: Stephen Rugh <rugh@adobe.com> * add claude tool to summarize and check errors in activations * batch getUrlKey GraphQL queries in mark-up-clean-up to stay within catalog 100-product limit Passing all published SKUs in a single query fails when the catalog exceeds 100 products (observed at 23,398 SKUs: "Product count exceeds the maximum allowed (100)"). With no query result, every published product appears redundant and risks mass unpublishing. Fix: split SKUs into batches of 50 (reusing createBatches/BATCH_SIZE) and fan out requests in parallel, then merge results before the redundancy check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * add script to get the state file. add note to runbook about log availability. add watchdog exit script to exit early incase of recurring failures at commerce api (504, etc). * Updates comments on watchdog and ensures we dont exit while publishing batches. * Apply suggestion from @sirugh * Apply suggestions from code review Co-authored-by: Natxo Cabré <natxo.cabre@gmail.com> * missed loglevel env read. * extend get-state-csv to download both check-product-changes and render-all-categories; output to local-data/ - adds render-all-categories prefix alongside check-product-changes - saves files to local-data/{prefix}/ (project-root-anchored via __dirname) - uses path.basename instead of replace to derive local filenames - anchors dotenv to project root so script works from any cwd Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: use CATALOG_BATCH_SIZE=100 for GetUrlKeyQuery batching in mark-up-clean-up Catalog Service GraphQL rejects queries with >100 SKUs ("Product count exceeds the maximum allowed (100)"). The batching added in 4625e42 originally used BATCH_SIZE=50, which was safe, but 22f211d raised BATCH_SIZE to 600 for the AEM Admin rate limit, inadvertently breaking the catalog query batching. Fix: add CATALOG_BATCH_SIZE=100 constant, make createBatches accept an optional size param (defaulting to BATCH_SIZE), and pass CATALOG_BATCH_SIZE explicitly at the GetUrlKeyQuery call site. The AEM unpublish batches keep BATCH_SIZE=600. Limit confirmed against live API: 100 SKUs succeed, 101 fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * add comment explaining pLimit reduction from 50 to 20 in enrichProductWithRenderedHash Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move check-errors and add readme for it * fix tests --------- Signed-off-by: Stephen Rugh <rugh@adobe.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Natxo Cabré <natxo.cabre@gmail.com>
1 parent b9714b0 commit 199fd38

24 files changed

Lines changed: 412 additions & 74 deletions
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Run `bash tools/check-errors.sh $ARGUMENTS` and evaluate the results, summarizing the error types, affected actions, and likely root causes.

.claude/settings.local.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"WebFetch(domain:www.aem.live)"
5+
]
6+
}
7+
}

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,6 @@ logs
4747

4848
# AI
4949
.cursor/
50+
51+
# LOCAL_FS=true files
52+
local-data/

CLAUDE.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Project Rules
2+
3+
## App Builder Deployments
4+
5+
Use `npm run deploy` instead of `aio app deploy` for all App Builder deployments.

actions/check-product-changes/index.js

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,17 @@ governing permissions and limitations under the License.
1010
*/
1111

1212
const { Core, State, Files } = require('@adobe/aio-sdk');
13+
const { localFilesLib } = require('../lib/localFilesLib');
1314
const { poll } = require('./poller');
1415
const { StateManager } = require('../lib/state');
1516
const { ObservabilityClient } = require('../lib/observability');
1617
const { getRuntimeConfig } = require('../lib/runtimeConfig');
1718
const { handleActionError } = require('../lib/errorHandler');
1819

20+
// Must match timeout in app.config.yaml. The mutex TTL is derived from this so the
21+
// lock auto-expires if the runtime kills the process before the finally block runs.
22+
const ACTION_TIMEOUT_MS = 10800000; // 3 hours
23+
1924
/**
2025
* Entry point for the "Product changes check" action.
2126
* @param {Object} params
@@ -38,8 +43,11 @@ async function main(params) {
3843
});
3944

4045
// Init SDK libs and state manager
41-
const stateLib = await State.init(params.libInit || {});
42-
const filesLib = await Files.init(params.libInit || {});
46+
const isLocal = !!params.LOCAL_FS;
47+
const stateLib = isLocal
48+
? { get: async () => null, put: async () => {}, delete: async () => {} }
49+
: await State.init(params.libInit || {});
50+
const filesLib = isLocal ? localFilesLib : await Files.init(params.libInit || {});
4351
const stateMgr = new StateManager(stateLib, { logger });
4452

4553
let activationResult;
@@ -61,7 +69,7 @@ async function main(params) {
6169

6270
try {
6371
// Mark job as running with TTL to avoid permanent lock on unexpected failures
64-
await stateMgr.put('running', 'true', { ttl: 3600 });
72+
await stateMgr.put('running', 'true', { ttl: ACTION_TIMEOUT_MS / 1000 });
6573

6674
// Core logic
6775
activationResult = await poll(cfg, { stateLib: stateMgr, filesLib }, logger);

actions/check-product-changes/poller.js

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ const { JobFailedError, ERROR_CODES } = require('../lib/errorHandler');
3535
const crypto = require('crypto');
3636
const BATCH_SIZE = 50;
3737
const DATA_KEY = 'skus';
38+
// Exit early rather than burning the full 3-hour timeout when processing stalls.
39+
// Activity is recorded after each render completes and after each AEM batch completes.
40+
const WATCHDOG_TIMEOUT_MS = 5 * 60 * 1000;
41+
const WATCHDOG_CHECK_INTERVAL_MS = 30 * 1000;
3842

3943
function checkParams(params) {
4044
validateRequiredParams(params, [
@@ -95,8 +99,12 @@ async function enrichProductWithRenderedHash(product, context) {
9599
const { logger } = context;
96100
const { sku, urlKey, path } = product;
97101

102+
// Reduced from 50 → 20: each concurrent render makes a full-detail GraphQL
103+
// request to the Commerce API (ProductQuery) plus HTML compilation. At 50
104+
// concurrent renders the Commerce API returned 504s, stalling the action and
105+
// triggering the watchdog.
98106
if (!renderLimit$) {
99-
renderLimit$ = import('p-limit').then(({ default: pLimit }) => pLimit(50));
107+
renderLimit$ = import('p-limit').then(({ default: pLimit }) => pLimit(20));
100108
}
101109

102110
return (await renderLimit$)(async () => {
@@ -122,6 +130,7 @@ async function enrichProductWithRenderedHash(product, context) {
122130
logger.error(`Error generating product HTML for SKU ${sku}:`, e);
123131
}
124132

133+
context.recordActivity?.();
125134
return product;
126135
});
127136
}
@@ -284,11 +293,31 @@ async function poll(params, aioLibs, logger) {
284293

285294
let stateText = 'completed';
286295

296+
// Watchdog: fires if no progress (render or AEM batch completion) for WATCHDOG_TIMEOUT_MS.
297+
// recordActivity is called after each render (success or failure) and after each AEM batch completes.
298+
let lastActivityAt = Date.now();
299+
let watchdogIntervalId;
300+
const watchdogPromise = new Promise((_, reject) => {
301+
watchdogIntervalId = setInterval(() => {
302+
const idleMs = Date.now() - lastActivityAt;
303+
if (idleMs > WATCHDOG_TIMEOUT_MS) {
304+
clearInterval(watchdogIntervalId);
305+
const err = Object.assign(
306+
new Error(`Watchdog: no processing activity for ${Math.floor(idleMs / 60000)} minutes — aborting`),
307+
{ isWatchdog: true },
308+
);
309+
logger.error(err.message);
310+
reject(err);
311+
}
312+
}, WATCHDOG_CHECK_INTERVAL_MS);
313+
});
314+
sharedContext.recordActivity = () => { lastActivityAt = Date.now(); };
315+
287316
try {
288317
// start processing preview and publish queues
289318
await adminApi.startProcessing();
290319

291-
const results = await Promise.all(
320+
const localeProcessing = Promise.all(
292321
locales.map(async (locale) => {
293322
const timings = new Timings();
294323
const context = { ...sharedContext, startTime: new Date() };
@@ -343,13 +372,14 @@ async function poll(params, aioLibs, logger) {
343372
const records = products.map(({ sku, path, renderedAt }) => ({ sku, path, renderedAt }));
344373
return adminApi
345374
.previewAndPublish(records, locale, batchNumber + 1)
346-
.then((publishedBatch) =>
347-
processPublishedBatch(publishedBatch, state, counts, products, aioLibs, {
375+
.then((publishedBatch) => {
376+
context.recordActivity?.();
377+
return processPublishedBatch(publishedBatch, state, counts, products, aioLibs, {
348378
dataKey: DATA_KEY,
349379
keyField: 'sku',
350380
filePrefix: FILE_PREFIX,
351-
}),
352-
)
381+
});
382+
})
353383
.catch((error) => {
354384
// Handle batch errors gracefully - don't fail the entire job
355385
if (error.code === ERROR_CODES.BATCH_ERROR) {
@@ -386,6 +416,8 @@ async function poll(params, aioLibs, logger) {
386416
}),
387417
);
388418

419+
const results = await Promise.race([localeProcessing, watchdogPromise]);
420+
389421
await adminApi.stopProcessing();
390422

391423
// aggregate timings
@@ -406,8 +438,14 @@ async function poll(params, aioLibs, logger) {
406438
code: e.code,
407439
stack: e.stack,
408440
});
409-
// wait for queues to finish, even in error case
410-
await adminApi.stopProcessing();
441+
if (e.isWatchdog) {
442+
// Don't drain queues — processing is stalled. Abort immediately so the mutex is
443+
// cleared promptly and the next scheduled run can start.
444+
adminApi.abortProcessing();
445+
} else {
446+
// wait for queues to finish, even in error case
447+
await adminApi.stopProcessing();
448+
}
411449
stateText = 'failure';
412450

413451
// If it's a JobFailedError, re-throw it
@@ -422,6 +460,8 @@ async function poll(params, aioLibs, logger) {
422460
e.statusCode || 500,
423461
{ originalError: e.message },
424462
);
463+
} finally {
464+
clearInterval(watchdogIntervalId);
425465
}
426466

427467
// get memory usage

actions/fetch-all-products/index.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ governing permissions and limitations under the License.
1313
*/
1414

1515
const { Core, Files } = require('@adobe/aio-sdk');
16+
const { localFilesLib } = require('../lib/localFilesLib');
1617
const { getConfig, getSiteType, requestSaaS, SITE_TYPES, FILE_PREFIX } = require('../utils');
1718
const { ProductsQuery, ProductCountQuery } = require('../queries');
1819
const { Timings } = require('../lib/benchmark');
@@ -224,7 +225,7 @@ async function main(params) {
224225
const allProducts = await getAllProducts(siteType, context, cfg.categoryFamilies);
225226

226227
timings.sample('getAllProducts');
227-
const filesLib = await Files.init(params.libInit || {});
228+
const filesLib = params.LOCAL_FS ? localFilesLib : await Files.init(params.libInit || {});
228229
timings.sample('saveFile');
229230
const productsFileName = `${FILE_PREFIX}/${stateFilePrefix}-products.json`;
230231
logger.debug(`Saving products to ${productsFileName}`);

actions/lib/aem.js

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ class AdminAPI {
3131
inflight = [];
3232
MAX_RETRIES = 3;
3333
RETRY_DELAY = 5000;
34-
/** Max number of pending jobs (queues + inflight). Keep low to avoid "noisy neighbor" effect. */
35-
MAX_PENDING_JOBS = 20;
36-
/** Poll interval for job status (ms). Avoid polling too frequently (e.g. once every 5 seconds). */
37-
JOB_STATUS_POLL_INTERVAL_MS = 5000;
34+
/** Max number of pending jobs (queues + inflight). With BATCH_SIZE=600, keep small to avoid large memory backlog. */
35+
MAX_PENDING_JOBS = 4;
36+
/** Poll interval for job status (ms). At BATCH_SIZE=600, jobs take ~60s so 15s polling is sufficient. */
37+
JOB_STATUS_POLL_INTERVAL_MS = 15000;
3838
/** Resolvers for backpressure: call when pending drops below MAX_PENDING_JOBS */
3939
_backpressureResolvers = [];
4040

@@ -134,6 +134,11 @@ class AdminAPI {
134134
return this.stopProcessing$;
135135
}
136136

137+
abortProcessing() {
138+
clearInterval(this.interval);
139+
this.interval = null;
140+
}
141+
137142
trackInFlight(name, callback) {
138143
const executeTask = () => {
139144
const promise = new Promise(callback);

actions/lib/localFilesLib.js

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
const fs = require('fs');
2+
const path = require('path');
3+
4+
const BASE_DIR = path.resolve(__dirname, '../../local-data');
5+
6+
function resolvePath(filePath) {
7+
return path.join(BASE_DIR, filePath);
8+
}
9+
10+
const localFilesLib = {
11+
async write(filePath, content) {
12+
const fullPath = resolvePath(filePath);
13+
fs.mkdirSync(path.dirname(fullPath), { recursive: true });
14+
fs.writeFileSync(fullPath, content);
15+
},
16+
17+
async read(filePath) {
18+
return fs.readFileSync(resolvePath(filePath));
19+
},
20+
21+
async delete(filePath) {
22+
const fullPath = resolvePath(filePath);
23+
if (fs.existsSync(fullPath)) fs.unlinkSync(fullPath);
24+
},
25+
26+
async list(prefix) {
27+
const fullPath = resolvePath(prefix);
28+
if (!fs.existsSync(fullPath)) return [];
29+
return fs.readdirSync(fullPath).map((name) => ({ name: path.join(prefix, name) }));
30+
},
31+
};
32+
33+
module.exports = { localFilesLib };

actions/mark-up-clean-up/index.js

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ governing permissions and limitations under the License.
1111
*/
1212

1313
const { Core, Files } = require('@adobe/aio-sdk');
14+
const { localFilesLib } = require('../lib/localFilesLib');
1415
const { ObservabilityClient } = require('../lib/observability');
1516
const { GetUrlKeyQuery } = require('../queries');
1617
const { getRuntimeConfig } = require('../lib/runtimeConfig');
@@ -19,6 +20,7 @@ const {
1920
requestSaaS,
2021
requestPublishedProductsIndex,
2122
createBatches,
23+
CATALOG_BATCH_SIZE,
2224
} = require('../utils');
2325
const { getHtmlFilePath } = require('../renderUtils');
2426

@@ -66,8 +68,11 @@ async function markUpCleanUP(context, filesLib, logger, adminApi) {
6668
try {
6769
const publishedProducts = await requestPublishedProductsIndex(context);
6870
const publishedSkus = publishedProducts.data.map((product) => product.sku);
69-
let queryResult = await requestSaaS(GetUrlKeyQuery, 'getUrlKey', { skus: publishedSkus }, context);
70-
queryResult = queryResult.data.products;
71+
const skuBatches = createBatches(publishedSkus, CATALOG_BATCH_SIZE);
72+
const batchResults = await Promise.all(
73+
skuBatches.map((batch) => requestSaaS(GetUrlKeyQuery, 'getUrlKey', { skus: batch }, context))
74+
);
75+
const queryResult = batchResults.flatMap((result) => result.data.products);
7176

7277
const redundantpublishedProducts = publishedProducts.data.filter((product) => !urlkeymatch(product, queryResult, context))
7378
context.counts.detected = redundantpublishedProducts.length;
@@ -117,7 +122,7 @@ async function main(params) {
117122
org: cfg.org,
118123
site: cfg.site
119124
});
120-
const filesLib = await Files.init(params.libInit || {});
125+
const filesLib = params.LOCAL_FS ? localFilesLib : await Files.init(params.libInit || {});
121126

122127
const {
123128
// required

0 commit comments

Comments
 (0)