#139 - Support checksums and conditional loading for a given CSV row#322
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces opt-in per-row checksum tracking for CSV loaders so that, when enabled, unchanged rows can be skipped and only changed rows are processed on subsequent loads.
Changes:
- Add
initializer.row.checksums.enabledconfiguration and expose it viaInitializerConfig. - Implement row-level checksum computation and persistence (
*.rows.checksum) inConfigDirUtil, and use it inBaseCsvLoaderto skip unchanged rows and persist hashes for successfully processed rows. - Add unit/integration tests to validate row checksum stability and basic sidecar-file behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| api/src/main/java/org/openmrs/module/initializer/api/loaders/BaseCsvLoader.java | Skips unchanged CSV rows based on previously persisted row hashes; persists new row hashes after processing. |
| api/src/main/java/org/openmrs/module/initializer/api/ConfigDirUtil.java | Adds row checksum computation plus read/write/delete support for per-row checksum sidecar files. |
| api/src/main/java/org/openmrs/module/initializer/InitializerConstants.java | Adds the initializer.row.checksums.enabled property constant. |
| api/src/main/java/org/openmrs/module/initializer/InitializerConfig.java | Loads and exposes the new row-checksums-enabled flag. |
| api/src/test/java/org/openmrs/module/initializer/api/ConfigDirUtilTest.java | Adds tests for row checksum stability and disk round-tripping. |
| api/src/test/java/org/openmrs/module/initializer/api/LocationsLoaderRowChecksumIntegrationTest.java | Adds an integration test around row-checksum sidecar creation and reading. |
Comments suppressed due to low confidence (1)
api/src/main/java/org/openmrs/module/initializer/api/loaders/BaseCsvLoader.java:156
- When per-row checksums are enabled and some/all rows are skipped,
totalCountstill reflects all CSV rows, but the loader may process/save fewer (or zero) rows in this run. This makes the success log message misleading (e.g., it can claim alltotalCountentities were saved even if all rows were skipped). Track and log a separate processed/saved count (and optionally skipped count) when row checksums are enabled.
if (isEmpty(result.getFailingLines())) {
log.info(file.getName() + " ('" + getDomainName() + "' domain) was entirely successfully processed.");
log.info(totalCount + " entities were saved.");
return;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mogoodrich
left a comment
There was a problem hiding this comment.
LGTM, thanks @mseaton ! Though maybe (ask Claude to) update the README with how to turn this on? Is enabling an all-or-nothing thing, or can we enable it just for certain domain objects (probably fine if all or nothing, just wondering).
And when upgrading to this method, all iniz domains using the new method will need to be reloaded (again, not a blocker/issue but just noting).
Good point @mogoodrich - I have added a bullet point to the release notes and to the runtime properties README configuration.
No. The first guard is still the file-level checksum. As long as a file hasn't changed, it won't need to be reloaded. If a file is changed and this is enabled, then it will reload everything in that file if no row-level checksums have previously been generated. Then, on subsequent changes it will be able to identify the specific changed rows. If one wanted to "preload" all of the row-level changes to guard against performance issues in the future, one could force a reload of all domains after enabling this at a predictable time so that any future updates and restarts would benefit from these improvements. |
|
I plan to merge this PR later today unless I hear otherwise @ibacher or others... |
Sounds great. |
Fine with me |
No description provided.