Skip to content

Commit 1afbf7c

Browse files
committed
[FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory
``FsFileGC._gc_files_unsafe`` loaded the entire backlog of orphan files into a single Python list (via ``array_agg(store_fname)`` grouped by storage) and iterated ``fs.rm`` over all of them in one shot. With the Azure Blob backend (``adlfs``, same class of issue on ``s3fs`` and other fsspec-based clients) and tens of thousands of queued orphans, each HEAD+DELETE pair held onto response buffers and connection-pool state inside the SDK that was only released when the worker exited. In production (Odoo.sh, 30k+ orphans, Azure Blob backend) every autovacuum run hit Odoo's ``limit_memory_hard`` and received ``SIGKILL`` at around minute 17 — 60k+ blob requests per run, zero ``DELETE`` committed, queue never drained, next worker re-ran the same failing loop. 14 kills observed in a single 24 h window. Fix: paginate both the ``SELECT`` and the ``fs.rm`` loop per storage, in batches of ``_GC_BATCH_SIZE = 500``, with an explicit ``gc.collect()`` between batches to reclaim the SDK's buffered state. The caller ``_gc_files`` still holds the ``SHARE`` lock on ``fs_file_gc`` / ``ir_attachment`` and performs the final commit, so consistency guarantees and transactional semantics are unchanged. Signed-off-by: TecnologiaIG <tecnologia@intensegroupgt.com>
1 parent a482807 commit 1afbf7c

3 files changed

Lines changed: 69 additions & 40 deletions

File tree

fs_attachment/__manifest__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
{
66
"name": "Base Attachment Object Store",
77
"summary": "Store attachments on external object store",
8-
"version": "16.0.2.0.1",
8+
"version": "16.0.2.0.3",
99
"author": "Camptocamp, ACSONE SA/NV, Odoo Community Association (OCA)",
1010
"license": "AGPL-3",
1111
"development_status": "Beta",

fs_attachment/models/fs_file_gc.py

Lines changed: 52 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Copyright 2023 ACSONE SA/NV
22
# License AGPL-3.0 or later (https://www.gnu.org/licenses/agpl).
3+
import gc
34
import logging
45
import threading
56
from contextlib import closing, contextmanager
@@ -118,51 +119,63 @@ def _gc_files(self) -> None:
118119
# commit to release the lock
119120
cr.commit() # pylint: disable=invalid-commit
120121

122+
# Max orphan files processed per batch. Bounds the per-run memory
123+
# footprint of the autovacuum job when many orphans are queued:
124+
# fsspec backend clients (adlfs, s3fs, ...) keep response buffers
125+
# and connection-pool state alive until their referents are
126+
# collected. Loading tens of thousands of fnames into a single
127+
# Python list and iterating ``fs.rm`` in one pass was enough to
128+
# hit Odoo's ``limit_memory_hard`` on production workers.
129+
_GC_BATCH_SIZE = 500
130+
121131
def _gc_files_unsafe(self) -> None:
122132
# get the list of fs.storage codes that must be autovacuumed
123133
codes = (
124134
self.env["fs.storage"].search([]).filtered("autovacuum_gc").mapped("code")
125135
)
126136
if not codes:
127137
return
128-
# we process by batch of storage codes.
129-
self._cr.execute(
130-
"""
131-
SELECT
132-
fs_storage_code,
133-
array_agg(store_fname)
134-
135-
FROM
136-
fs_file_gc
137-
WHERE
138-
fs_storage_code IN %s
139-
AND NOT EXISTS (
140-
SELECT 1
141-
FROM ir_attachment
142-
WHERE store_fname = fs_file_gc.store_fname
143-
)
144-
GROUP BY
145-
fs_storage_code
146-
""",
147-
(tuple(codes),),
148-
)
149-
for code, store_fnames in self._cr.fetchall():
138+
# Process one storage at a time and paginate both the SELECT and
139+
# the fs.rm loop so neither the Python list of file names nor
140+
# the storage SDK's response buffers can grow unbounded.
141+
for code in codes:
150142
self.env["fs.storage"].get_by_code(code)
151143
fs = self.env["fs.storage"].get_fs_by_code(code)
152-
for store_fname in store_fnames:
153-
try:
154-
file_path = store_fname.partition("://")[2]
155-
fs.rm(file_path)
156-
except Exception:
157-
_logger.debug("Failed to remove file %s", store_fname)
158-
159-
# delete the records from the table fs_file_gc
160-
self._cr.execute(
161-
"""
162-
DELETE FROM
163-
fs_file_gc
164-
WHERE
165-
fs_storage_code IN %s
166-
""",
167-
(tuple(codes),),
168-
)
144+
while True:
145+
self._cr.execute(
146+
"""
147+
SELECT store_fname
148+
FROM fs_file_gc
149+
WHERE fs_storage_code = %s
150+
AND NOT EXISTS (
151+
SELECT 1
152+
FROM ir_attachment
153+
WHERE store_fname = fs_file_gc.store_fname
154+
)
155+
LIMIT %s
156+
""",
157+
(code, self._GC_BATCH_SIZE),
158+
)
159+
rows = self._cr.fetchall()
160+
if not rows:
161+
break
162+
deleted = []
163+
for (store_fname,) in rows:
164+
try:
165+
file_path = store_fname.partition("://")[2]
166+
fs.rm(file_path)
167+
deleted.append(store_fname)
168+
except Exception:
169+
_logger.debug("Failed to remove file %s", store_fname)
170+
if deleted:
171+
self._cr.execute(
172+
"DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
173+
(deleted,),
174+
)
175+
# Force a collection between batches to reclaim response
176+
# buffers and connection objects held by the storage SDK
177+
# that would otherwise only be freed on worker exit. Do
178+
# NOT commit here: the caller (_gc_files) holds a SHARE
179+
# lock on fs_file_gc and ir_attachment for consistency
180+
# and commits at the end.
181+
gc.collect()
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Paginate the autovacuum GC loop to bound worker memory.
2+
3+
``FsFileGC._gc_files_unsafe`` used to load the entire backlog of orphan
4+
files into a single Python list via ``array_agg(store_fname)`` and iterate
5+
``fs.rm`` over all of them in one pass. With the Azure Blob backend and
6+
tens of thousands of orphans queued, each HEAD+DELETE pair retained
7+
response buffers and connection-pool state inside the adlfs client that
8+
was only released when the worker exited. The autovacuum cron hit Odoo's
9+
``limit_memory_hard`` and got ``SIGKILL``'d mid-run every time, so the
10+
queue never drained and the next worker ran the same failing loop.
11+
12+
The SELECT and the ``fs.rm`` loop are now paginated in batches of 500 per
13+
storage, with an explicit ``gc.collect()`` between batches. The caller
14+
(``_gc_files``) still holds the ``SHARE`` lock and performs the final
15+
commit, so the consistency guarantees and transactional semantics are
16+
unchanged.

0 commit comments

Comments
 (0)