Cap archive row limits on legacy MEDIUMBLOB archive_blob tables#24403
Draft
sgiehl wants to merge 2 commits into
Draft
Cap archive row limits on legacy MEDIUMBLOB archive_blob tables#24403sgiehl wants to merge 2 commits into
sgiehl wants to merge 2 commits into
Conversation
fe60fa5 to
6646b99
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Installs that upgraded before the MEDIUMBLOB→LONGBLOB schema change can have archive_blob tables where gzip-compressed DataTable blobs may exceed the 16 MB MEDIUMBLOB limit and be silently truncated when row limits are set above 100 000. Add ArchiveBlobRowCap (hardcoded trigger=100k, cap=50k) and ArchiveBlobColumnType (INFORMATION_SCHEMA detector + per-request cache). Apply the cap at both blob-write sites in ArchiveProcessor and RecordBuilder, guarded by a [database] config flag so fresh installs pay zero I/O overhead. A 5.10.0-b1 migration sets the flag on affected installs; core:recheck-archive-blob-types CLI removes it once all tables are LONGBLOB. Updater.php calls the recheck helper after each update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bump version to 5.10.0-b1 - Centralise the MEDIUMBLOB-list + flag-clear logic in ArchiveBlobColumnType::recheckAndUpdateFlag() and have the CLI command delegate to it, removing the duplicated write path. - Make ArchiveBlobColumnType::isMediumBlob() return true on empty/missing INFORMATION_SCHEMA results, matching the documented fail-safe contract. - Add ArchiveBlobColumnType::CONFIG_KEY constant and use it everywhere the flag is read or written, preventing silent key drift. - Narrow the archive_blob table discovery LIKE pattern to the configured table prefix. - Use DatabaseConfig instead of Config class Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6646b99 to
1b4e77d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Matomo's
archive_blob_YYYY_MMpartitions were switched fromMEDIUMBLOBtoLONGBLOBin a recent schema change, but existing installs still carryMEDIUMBLOBpartitions for months created before that change.MEDIUMBLOBholds 16 MB max.Operators can legitimately configure very high
datatable_archiving_maximum_rows_*values. When those limits produce a gzip-compressed serialized DataTable that exceeds 16 MB, the insert silently truncates into aMEDIUMBLOBcolumn and the archive ends up corrupt. No error surfaces at write time — the reports just come out wrong and the cause is effectively invisible to the operator.What this PR does
Automatically caps the effective archive row limit at 50 000 whenever all three conditions are met:
archive_blob_YYYY_MMpartition is detected asMEDIUMBLOB.[database] archive_blob_tables_may_contain_mediumblobflag is set.Applies to every archiving path (log-based day, non-day aggregation, flat-first), not just one record type.
Why a config flag rather than always checking the schema
A flag avoids any
INFORMATION_SCHEMAqueries on fresh installs that have never hadMEDIUMBLOBarchive tables — zero runtime cost for the common case.core/Updates/5.10.0-b1.phpinspects the current schema on upgrade. If anyarchive_blob_%partition with aMEDIUMBLOBvaluecolumn is found, it emits aMigration\Config\Setcall that sets the flag to1. If the install is already LONGBLOB-only, no migration is emitted and the flag stays unset../console core:recheck-archive-blob-types+ two calls fromcore/Updater.php(co-located with the existingServerFilesGenerator::createFilesForSecurity()calls, following that precedent) re-run the check after updates. Once retention has purged the last legacy partition, the flag is automatically cleared and the cap logic disengages.Why thresholds are hardcoded
The trigger (100 000) and cap (50 000) are
private constinArchiveBlobRowCap, intentionally not config-tunable. The whole point of the feature is that a too-high limit is unsafe on MEDIUMBLOB; exposing the cap would let an operator defeat the safety and reintroduce the truncation risk.Files
New
core/ArchiveProcessor/ArchiveBlobRowCap.php— cap applicator with hardcoded constants; short-circuits to the configured value when the flag is unset.core/DataAccess/ArchiveBlobColumnType.php—INFORMATION_SCHEMA.COLUMNSdetector with per-request static cache,CONFIG_KEYconstant,recheckAndUpdateFlag()lifecycle helper, fail-safe behavior on empty/error results.core/Updates/5.10.0-b1.php— upgrade migration.plugins/CoreUpdater/Commands/RecheckArchiveBlobTypes.php—core:recheck-archive-blob-typesCLI command.Modified
core/ArchiveProcessor/RecordBuilder.php— wraps `maxRowsInTable` / `maxRowsInSubtable` through the cap before `getSerialized()`.core/ArchiveProcessor.php— same wrap in `aggregateDataTableRecords`.core/Updater.php— two direct calls to `ArchiveBlobColumnType::recheckAndUpdateFlag()` right after the `ServerFilesGenerator::createFilesForSecurity()` calls (same pattern).Tests
tests/PHPUnit/Unit/ArchiveProcessor/ArchiveBlobRowCapTest.php— 13 unit tests covering boundaries, null/unlimited, LONGBLOB pass-through, fail-safe paths.tests/PHPUnit/Integration/DataAccess/ArchiveBlobColumnTypeTest.php— 8 integration tests including mixed schemas, per-request caching, prefix-scoped table discovery (non-Matomo `archive_blob_*` tables are ignored).tests/PHPUnit/Integration/Updates/Updates5100b1Test.php— 3 tests: MEDIUMBLOB present → migration emitted; LONGBLOB only → no migration; no tables → no migration.plugins/CoreUpdater/tests/Integration/Commands/RecheckArchiveBlobTypesTest.php— 3 tests for the CLI's three state transitions (flag unset, MEDIUMBLOB still present, all LONGBLOB).Behavior on various installs
Non-goals
Test plan
Checklist