Skip to content

Cap archive row limits on legacy MEDIUMBLOB archive_blob tables#24403

Draft
sgiehl wants to merge 2 commits into
5.x-devfrom
archive-blob-row-limit-cap
Draft

Cap archive row limits on legacy MEDIUMBLOB archive_blob tables#24403
sgiehl wants to merge 2 commits into
5.x-devfrom
archive-blob-row-limit-cap

Conversation

@sgiehl
Copy link
Copy Markdown
Member

@sgiehl sgiehl commented Apr 21, 2026

Problem

Matomo's archive_blob_YYYY_MM partitions were switched from MEDIUMBLOB to LONGBLOB in a recent schema change, but existing installs still carry MEDIUMBLOB partitions for months created before that change. MEDIUMBLOB holds 16 MB max.

Operators can legitimately configure very high datatable_archiving_maximum_rows_* values. When those limits produce a gzip-compressed serialized DataTable that exceeds 16 MB, the insert silently truncates into a MEDIUMBLOB column and the archive ends up corrupt. No error surfaces at write time — the reports just come out wrong and the cause is effectively invisible to the operator.

What this PR does

Automatically caps the effective archive row limit at 50 000 whenever all three conditions are met:

  1. The configured limit exceeds 100 000.
  2. The target archive_blob_YYYY_MM partition is detected as MEDIUMBLOB.
  3. The new [database] archive_blob_tables_may_contain_mediumblob flag is set.

Applies to every archiving path (log-based day, non-day aggregation, flat-first), not just one record type.

Why a config flag rather than always checking the schema

A flag avoids any INFORMATION_SCHEMA queries on fresh installs that have never had MEDIUMBLOB archive tables — zero runtime cost for the common case.

  • The migration in core/Updates/5.10.0-b1.php inspects the current schema on upgrade. If any archive_blob_% partition with a MEDIUMBLOB value column is found, it emits a Migration\Config\Set call that sets the flag to 1. If the install is already LONGBLOB-only, no migration is emitted and the flag stays unset.
  • A new CLI command ./console core:recheck-archive-blob-types + two calls from core/Updater.php (co-located with the existing ServerFilesGenerator::createFilesForSecurity() calls, following that precedent) re-run the check after updates. Once retention has purged the last legacy partition, the flag is automatically cleared and the cap logic disengages.

Why thresholds are hardcoded

The trigger (100 000) and cap (50 000) are private const in ArchiveBlobRowCap, intentionally not config-tunable. The whole point of the feature is that a too-high limit is unsafe on MEDIUMBLOB; exposing the cap would let an operator defeat the safety and reintroduce the truncation risk.

Files

New

  • core/ArchiveProcessor/ArchiveBlobRowCap.php — cap applicator with hardcoded constants; short-circuits to the configured value when the flag is unset.
  • core/DataAccess/ArchiveBlobColumnType.phpINFORMATION_SCHEMA.COLUMNS detector with per-request static cache, CONFIG_KEY constant, recheckAndUpdateFlag() lifecycle helper, fail-safe behavior on empty/error results.
  • core/Updates/5.10.0-b1.php — upgrade migration.
  • plugins/CoreUpdater/Commands/RecheckArchiveBlobTypes.phpcore:recheck-archive-blob-types CLI command.

Modified

  • core/ArchiveProcessor/RecordBuilder.php — wraps `maxRowsInTable` / `maxRowsInSubtable` through the cap before `getSerialized()`.
  • core/ArchiveProcessor.php — same wrap in `aggregateDataTableRecords`.
  • core/Updater.php — two direct calls to `ArchiveBlobColumnType::recheckAndUpdateFlag()` right after the `ServerFilesGenerator::createFilesForSecurity()` calls (same pattern).

Tests

  • tests/PHPUnit/Unit/ArchiveProcessor/ArchiveBlobRowCapTest.php — 13 unit tests covering boundaries, null/unlimited, LONGBLOB pass-through, fail-safe paths.
  • tests/PHPUnit/Integration/DataAccess/ArchiveBlobColumnTypeTest.php — 8 integration tests including mixed schemas, per-request caching, prefix-scoped table discovery (non-Matomo `archive_blob_*` tables are ignored).
  • tests/PHPUnit/Integration/Updates/Updates5100b1Test.php — 3 tests: MEDIUMBLOB present → migration emitted; LONGBLOB only → no migration; no tables → no migration.
  • plugins/CoreUpdater/tests/Integration/Commands/RecheckArchiveBlobTypesTest.php — 3 tests for the CLI's three state transitions (flag unset, MEDIUMBLOB still present, all LONGBLOB).

Behavior on various installs

Install state Flag DB queries at archive time Effect
Fresh 5.10+ install unset 0 No cap, no overhead
Long-running install, upgraded past this migration, still has MEDIUMBLOB months set 1 `INFORMATION_SCHEMA` query per `archive_blob_*` partition per request, cached Cap applies only when the target partition is MEDIUMBLOB and configured limit > 100 000
Same install after retention purges all MEDIUMBLOB partitions cleared on next update or via `core:recheck-archive-blob-types` 0 No cap, no overhead

Non-goals

  • Does not `ALTER TABLE` legacy `MEDIUMBLOB` partitions to `LONGBLOB` — that's a much heavier operation and out of scope.
  • Does not attempt to recover archives that have already been silently truncated — separate concern.
  • Thresholds are intentionally not exposed as config.

Test plan

  • On a fixture DB with a MEDIUMBLOB `archive_blob_YYYY_MM` partition and `datatable_archiving_maximum_rows_actions = 200000`: resulting archive blob contains ≤ 50 001 top-level rows (50 000 + summary).
  • On a LONGBLOB-only schema: row count tracks the configured limit, no cap applied.
  • After running `./console core:update` against a fixture with mixed partitions: `config.ini.php` gains `[database] archive_blob_tables_may_contain_mediumblob = 1` and the change shows in the pre-update migration list.
  • After manually `ALTER TABLE`ing the last MEDIUMBLOB partition to LONGBLOB and running `./console core:recheck-archive-blob-types`: the config entry is removed; subsequent archive writes skip the cap logic.

Checklist

  • [✔] I have understood, reviewed, and tested all AI outputs before use
  • [✔] All AI instructions respect security, IP, and privacy rules

@sgiehl sgiehl added this to the 5.10.0 milestone Apr 21, 2026
@sgiehl sgiehl force-pushed the archive-blob-row-limit-cap branch 2 times, most recently from fe60fa5 to 6646b99 Compare April 21, 2026 10:24
@chippison chippison modified the milestones: 5.10.0, 5.11.0 May 1, 2026
@github-actions

This comment was marked as outdated.

@github-actions github-actions Bot added the Stale The label used by the Close Stale Issues action label May 16, 2026
sgiehl and others added 2 commits May 26, 2026 13:46
Installs that upgraded before the MEDIUMBLOB→LONGBLOB schema change can
have archive_blob tables where gzip-compressed DataTable blobs may exceed
the 16 MB MEDIUMBLOB limit and be silently truncated when row limits are
set above 100 000.

Add ArchiveBlobRowCap (hardcoded trigger=100k, cap=50k) and
ArchiveBlobColumnType (INFORMATION_SCHEMA detector + per-request cache).
Apply the cap at both blob-write sites in ArchiveProcessor and
RecordBuilder, guarded by a [database] config flag so fresh installs pay
zero I/O overhead. A 5.10.0-b1 migration sets the flag on affected
installs; core:recheck-archive-blob-types CLI removes it once all tables
are LONGBLOB. Updater.php calls the recheck helper after each update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bump version to 5.10.0-b1
- Centralise the MEDIUMBLOB-list + flag-clear logic in
  ArchiveBlobColumnType::recheckAndUpdateFlag() and have the CLI command
  delegate to it, removing the duplicated write path.
- Make ArchiveBlobColumnType::isMediumBlob() return true on empty/missing
  INFORMATION_SCHEMA results, matching the documented fail-safe contract.
- Add ArchiveBlobColumnType::CONFIG_KEY constant and use it everywhere the
  flag is read or written, preventing silent key drift.
- Narrow the archive_blob table discovery LIKE pattern to the configured
  table prefix.
- Use DatabaseConfig instead of Config class

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sgiehl sgiehl force-pushed the archive-blob-row-limit-cap branch from 6646b99 to 1b4e77d Compare May 26, 2026 11:51
@github-actions github-actions Bot removed the Stale The label used by the Close Stale Issues action label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants