Skip to content

refactor(backend, website, ingest)!: Improve naming of data at different stages in lifecycle#6629

Open
anna-parker wants to merge 9 commits into
mainfrom
download-submitted-data
Open

refactor(backend, website, ingest)!: Improve naming of data at different stages in lifecycle#6629
anna-parker wants to merge 9 commits into
mainfrom
download-submitted-data

Conversation

@anna-parker

@anna-parker anna-parker commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

resolves #

When reviewing #5856 I realized it is quite confusing that we call an endpoint get-original-data but we actually return data from the unprocessed_data column.

Additionally, I noticed that "unprocessed data" is used inconsistently in the code as the mutable copy of the submitted data and the copy of the submitted data plus additional fields such as submitter, time, accession (I introduced this in https://github.com/loculus-project/loculus/pull/6254/changes so my bad!). This is confusing.

I decided it would be best to introduce a consistent naming pattern that is reflected in how the data is called in different stages of its lifecycle but also in the endpoints themselves (let me know what you think about the names I can also change this!)

ArchiveOfSubmittedData (renamed from original_data, stored as immutable column in db for administrators)
        ↓
SubmittedData        (renamed from unprocessed_data, initially copy of OriginalSubmissionData but can be modified for db surgery and should be used as input for producing unprocessedData and data returned to the user)
        ↓ + accession/version/submitter/submittedAt
UnprocessedData       (name kept as is)
        ↓ processing service
ProcessedData

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://download-submitted-data.loculus.org

@claude claude Bot added backend related to the loculus backend component website Tasks related to the web application ingest Ingest pipeline labels Jun 7, 2026
@anna-parker anna-parker added preview Triggers a deployment to argocd update_db_schema labels Jun 7, 2026
@anna-parker

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

This comment was marked as outdated.

@anna-parker

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 3m 33s —— View job


Code Review

  • Gather context and read changed files
  • Review naming consistency across all changed files
  • Review database migration
  • Review API/endpoint changes
  • Review documentation updates
  • Review test coverage
  • Post final review

Summary

This is a solid naming cleanup PR that introduces a clear, documented hierarchy (OriginalSubmissionData → SubmissionData → UnprocessedData → ProcessedData). The motivation and implementation are both clear. I found a few naming inconsistencies that were missed in the rename sweep.


Issues

1. SequenceEntriesTable.kt:20 — Kotlin variable not renamed to match new DB column

The DB column was renamed original_dataoriginal_submission_data, but the Kotlin accessor variable is still originalDataColumn:

// Still named originalDataColumn, but maps to DB column "original_submission_data"
val originalDataColumn = jacksonSerializableJsonb<SubmittedData<CompressedSequence>>(
    "original_submission_data",
).nullable()

Meanwhile the paired column follows the new convention: submissionDataColumn"submission_data". This inconsistency will confuse readers.

SubmissionDatabaseService.kt:1001 also still uses SequenceEntriesTable.originalDataColumn — this reads correctly at runtime (maps to the right column) but the Kotlin name originalDataColumn no longer describes what the column is (immutable original submission copy).

Fix this →


2. DownloadSubmittedDataButton.tsx — Several old names not updated

The file was renamed from DownloadOriginalDataButton.tsx but three things inside still use the old naming:

  • Line 10: export const MAX_ORIGINAL_DATA_DOWNLOAD_ENTRIES = 500 — should be MAX_SUBMITTED_DATA_DOWNLOAD_ENTRIES (also imported under the old name in SearchFullUI.tsx:8)
  • Line 12: type DownloadOriginalDataButtonProps — should be DownloadSubmittedDataButtonProps
  • Line 96: const filename = \${organism}original_data${timestamp}.zip`— inconsistent with the backend, which now generates${organism}submitted_data${timestamp}.zip(seeSubmissionController.kt:submittedDataDownloadFilename`)

Fix this →


3. Test files/classes not renamed for consistency

Several test files/helpers were partially updated but not fully renamed:

  • GetUnprocessedMetadataEndpointTest.kt (file + class name) — tests the /get-submitted-metadata endpoint now; should be renamed GetSubmittedMetadataEndpointTest
  • OriginalDataRevisionWorkflowTest.kt (class name) — tests the getSubmittedData endpoint; class still named OriginalDataRevisionWorkflowTest
  • PreparedOriginalData.kt (file + variable names) — defaultOriginalData and emptyOriginalData now hold SubmittedData values, but their names imply OriginalData

These are test-only inconsistencies and not runtime bugs, but they undermine the goal of a consistent naming convention through the codebase.


Minor Nit

SubmissionTypes.kt:319-320 — Redundant typealias chain

typealias SubmittedContent<SequenceType> = SubmittedContentInternal<SequenceType, FileIdAndName>
typealias SubmittedData<SequenceType> = SubmittedContent<SequenceType>  // alias of an alias

SubmittedData<T> is just an alias for SubmittedContent<T> — both refer to the same thing. The old code had the same pattern (OriginalData = UnprocessedDataContent) to distinguish it from UnprocessedDataContentWithFileUrls, but a single well-named alias would be cleaner. Collapsing to just SubmittedData<T> = SubmittedContentInternal<T, FileIdAndName> (and keeping SubmittedContentWithFileUrls) would be clearer. Not a blocker.


What looks good

  • The documented hierarchy (OriginalSubmissionData → SubmissionData → UnprocessedData → ProcessedData) is clear and the motivation in the PR description is well articulated
  • runtime_view.md now includes an excellent explanation of why original_submission_data and submission_data are separate (the host validation example is perfect)
  • The migration (V1.28__rename_columns.sql) is correct: renames both columns, drops the view with CASCADE, and fully recreates it with the new column names
  • UploadDatabaseService.kt raw SQL column names are updated in sync with the migration
  • The ingest/scripts/loculus_client.py endpoint URL update is correctly aligned

@anna-parker anna-parker marked this pull request as ready for review June 7, 2026 20:35
@claude

claude Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 3m 12s —— View job


Code Review

  • Gather context and understand the PR
  • Review DB migration
  • Review backend API/types
  • Review backend services and database layer
  • Review tests
  • Review website/ingest changes
  • Post findings

This is a well-executed rename refactor. The lifecycle naming hierarchy in the PR description (OriginalSubmissionData → SubmissionData → UnprocessedData → ProcessedData) is clear and the changes are applied consistently across the stack. The DB migration is sound — RENAME COLUMN is safe and the DROP VIEW ... CASCADE + recreation correctly handles the view dependency.

A few items worth addressing:

DownloadSubmittedDataButton.tsx — incomplete rename (3 issues)

The file was renamed and the component/function were updated, but these three things were missed:

  1. Constant not renamed (line 10): MAX_ORIGINAL_DATA_DOWNLOAD_ENTRIES → should be MAX_SUBMITTED_DATA_DOWNLOAD_ENTRIES (the backend already uses the new name). Also referenced on lines 41, 74, 120. The import in SearchFullUI.tsx also still imports the old constant name.

  2. Props interface not renamed (line 12): type DownloadOriginalDataButtonPropstype DownloadSubmittedDataButtonProps.

  3. Generated filename not updated (line 96): ${organism}_original_data_${timestamp}.zip → should be ${organism}_submitted_data_${timestamp}.zip to match what the backend now generates.

GetUnprocessedMetadataEndpointTest.kt — file not renamed

GetOriginalDataEndpointTest.kt was correctly renamed to GetSubmittedDataEndpointTest.kt, but GetUnprocessedMetadataEndpointTest.kt was not renamed to GetSubmittedMetadataEndpointTest.kt. These should be treated consistently since both endpoints were renamed.

CompressionService.kt — the private method compressOriginalSequence (line 31) is still named with "Original". Not a blocking issue since it's private, but worth renaming to compressSequence for consistency.

SubmissionTypes.ktSubmittedData<T> and SubmittedContent<T> are both typealiases for SubmittedContentInternal<T, FileIdAndName> (one aliasing the other). This was carried forward from the OriginalData = UnprocessedDataContent pattern. Consider collapsing to a single alias in a follow-up.

Comment thread backend/src/main/resources/db/migration/V1.28__rename_columns.sql Outdated
@anna-parker anna-parker changed the base branch from download-original-data to main June 7, 2026 20:49
@anna-parker anna-parker changed the base branch from main to download-original-data June 7, 2026 20:49
Comment thread backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt
@anna-parker

Copy link
Copy Markdown
Contributor Author

@claude review

@anna-parker anna-parker requested a review from theosanderson June 8, 2026 11:15
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 2m 54s —— View job


Code Review

  • Gather context and read changed files
  • Check whether issues from previous reviews were addressed
  • Review naming consistency across all changed files
  • Review DB migration and schema
  • Review API/endpoint changes
  • Post final review

Summary

Good progress since the previous reviews — most flagged issues have been resolved. The Theo-suggested naming changes (SubmittedData, archive_of_submitted_data) are applied correctly and the typealias chain has been cleaned up. Two small remaining issues, one of which was flagged in both prior reviews.


What was fixed ✅

  • SequenceEntriesTable.kt: originalDataColumnarchiveOfSubmittedDataColumn, correctly maps to "archive_of_submitted_data" column
  • DownloadSubmittedDataButton.tsx: all three issues addressed — constant renamed, props type renamed, filename updated to ${organism}_submitted_data_${timestamp}.zip
  • Test files all renamed: GetUnprocessedMetadataEndpointTestGetSubmittedMetadataEndpointTest, OriginalDataRevisionWorkflowTestSubmittedDataRevisionWorkflowTest, PreparedOriginalDataPreparedSubmittedData with variables renamed consistently
  • SubmittedData typealias chain collapsed — SubmittedData<T> now directly aliases SubmittedContentInternal<T, FileIdAndName> (no alias-of-alias)
  • Theo's naming suggestions applied: SubmittedData (not SubmissionData), archive_of_submitted_data (not original_data)
  • runtime_view.md documentation is excellent, with a concrete host-validation example that clearly explains the archive_of_submitted_data/submitted_data split

Remaining Issues

1. CompressionService.kt:31compressOriginalSequence not renamed

This was flagged in both previous reviews and is still unaddressed. The outer functions were correctly renamed (compressSequencesInOriginalDatacompressSequencesInSubmittedData), but the inner public method still uses the old naming:

// CompressionService.kt:31 — public method still named with "Original"
fun compressOriginalSequence(sequenceData: GeneticSequence, organism: Organism) = compress(...)

// CompressionService.kt:81 — call site inside compressSequencesInSubmittedData
else -> compressOriginalSequence(sequenceData, organism)

The mismatch is jarring: compressSequencesInSubmittedData calls compressOriginalSequence internally. Also called from UploadDatabaseService.kt:150.

Suggested rename: compressOriginalSequencecompressSubmittedSequence

Fix this →


2. SubmissionTypes.kt:172 — stale @Schema description

// Line 172 — still says "original Data"
description = "The key is the segment name, the value is the fastaHeader of the original Data",

The original Data phrasing is a leftover and should be updated (e.g. "the fastaHeader of the submitted data" or simply "the FASTA header").


Minor Nit

Button text inconsistency: DownloadSubmittedDataButton.tsx:46,49 renders "Download originally submitted data (…)". The word "originally" was presumably kept for user clarity, but it's a slight inconsistency with the internal rename. If the intent is for user-facing text to say "submitted data" (consistent with the endpoint name), this could be simplified to "Download submitted data (…)". Not a blocking issue.


@chaoran-chen chaoran-chen force-pushed the download-original-data branch 2 times, most recently from a4927a7 to a5f4387 Compare June 9, 2026 06:48
Base automatically changed from download-original-data to main June 9, 2026 07:23
@anna-parker anna-parker force-pushed the download-submitted-data branch 2 times, most recently from 78479f8 to c579cfc Compare June 9, 2026 08:43
@theosanderson

Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 452ca84aba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ingest/scripts/loculus_client.py
Comment thread preprocessing/nextclade/src/loculus_preprocessing/embl.py Outdated

@corneliusroemer corneliusroemer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few remnants of submission instead of submitted

@anna-parker anna-parker force-pushed the download-submitted-data branch from 9d6505a to 18eeda6 Compare June 9, 2026 15:28
@anna-parker

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8f3a87e08

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ingest/scripts/loculus_client.py
@theosanderson

Copy link
Copy Markdown
Member

@claude review very carefully

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Claude finished @theosanderson's task in 3m 24s —— View job


Code Review

  • Gather context and understand the PR
  • Read all changed files carefully
  • Check issues flagged in previous reviews
  • Review DB migration correctness
  • Review backend API/types and services
  • Review ingest changes
  • Review website changes
  • Review test coverage and consistency
  • Post findings

This is a well-executed naming cleanup that introduces a clear, well-documented lifecycle hierarchy. Most of the issues raised in prior reviews are fixed. The migration is structurally sound. I found three remaining issues in a careful pass.


Issues

1. OriginalDataDownloadTestHelpers.kt — file not renamed at all (missed by all prior reviews)

This file was not part of the PR diff. It still has the old name, an old function name, and an old docstring, and is called 13 times in GetSubmittedDataEndpointTest.kt:

// OriginalDataDownloadTestHelpers.kt (entire file unchanged)
/**
 * Extracts the metadata.tsv and sequences.fasta contents from a downloaded original-data zip.
 */
fun extractOriginalDataZipContents(zipContent: ByteArray): Pair<String, String> { ... }

GetSubmittedDataEndpointTest.kt calls it 13 times as extractOriginalDataZipContents(...). The file should be renamed to SubmittedDataDownloadTestHelpers.kt and the function renamed to extractSubmittedDataZipContents.

Fix this →


2. website/src/types/backend.ts + EditPage.spec.tsxUnprocessedMetadataRecord type and method not renamed

backend.ts:47-48 defines and exports an internal Zod schema type that still carries the old name:

const unprocessedMetadataRecord = z.record(z.string());
export type UnprocessedMetadataRecord = z.infer<typeof unprocessedMetadataRecord>;

This type is referenced at backend.ts:178,195,226 and re-used in EditPage.spec.tsx:118:

unprocessedMetadata: (metadata: UnprocessedMetadataRecord): void => ...

The test helper object on EditPage.spec.tsx:118 still has unprocessedMetadata as its method name, and is called on lines 78, 85, 106, 113 as expectTextInSequenceData.unprocessedMetadata(...).

Suggested renames: UnprocessedMetadataRecordSubmittedMetadataRecord, unprocessedMetadataRecordsubmittedMetadataRecord, and the helper method name updated to match.


3. website/vitest.setup.ts:53 — test fixture key still says "unprocessed"

export const metadataKey = 'unprocessedMetadataField';

This string value ('unprocessedMetadataField') is used as a metadata field name in test fixtures. With the naming PR, this should probably be 'submittedMetadataField'. Minor, but undermines naming consistency in tests.


What looks good ✅

  • DB migration V1.29__rename_columns.sql is correct: RENAME COLUMN is safe, DROP VIEW ... CASCADE handles the dependency properly, and the recreated view correctly references the new column names. Historical migrations (V1.24–V1.28) correctly keep old column names since they run before V1.29.
  • CompressionService.ktcompressOriginalSequencecompressSubmittedSequence is done; all prior review flags are resolved
  • SubmissionTypes.kt — typealias chain collapsed cleanly: SubmittedData<T> directly aliases SubmittedContentInternal<T, FileIdAndName>
  • compare_hashes.pyunprocessedMetadatasubmittedMetadata key rename is complete throughout, including construct_submitted_dict and test fixtures in previous_submissions.ndjson
  • Test files renamed: GetSubmittedDataEndpointTest, GetSubmittedMetadataEndpointTest, PreparedSubmittedData, SubmittedDataRevisionWorkflowTest
  • runtime_view.md is excellent — the concrete host-validation example explaining archive_of_submitted_data vs submitted_data makes the distinction immediately clear
  • DownloadSubmittedDataButton.tsx — all three previously flagged issues resolved: constant renamed, props type renamed, filename updated to ${organism}_submitted_data_${timestamp}.zip
  • SequenceEntriesTable.ktarchiveOfSubmittedDataColumn and submittedDataColumn correctly match DB column names

Minor Nit

UI button text: The button renders "Download originally submitted data" while the endpoint is get-submitted-data. The word "originally" is a small residue of the old name, though it may be intentional for user clarity. The integration test expects this exact string (/Download originally submitted data/), so any change needs to be made in both places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component ingest Ingest pipeline preview Triggers a deployment to argocd update_db_schema website Tasks related to the web application

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants