Skip to content

bug: missing commas in Alliance column lists cause silent string concatenation #118

@AaryanCode69

Description

@AaryanCode69

Labels

bug, data-integrity, alliance, data-generation


Description

The Alliance of Genome Resources column definition lists in src/data_generation/alliance/__init__.py have missing commas between adjacent string literals. Python silently concatenates adjacent strings without a comma this is a language feature, not a syntax error so the column lists end up with fewer entries than intended, producing wrong column names.

This causes a silent schema mismatch when parsing Alliance TSV files: columns after the concatenation point get mapped to the wrong names, leading to incorrect metadata on ingested documents with no error raised.


Affected Locations

Location 1 : molecular_interaction column list (line ~141)

"Host organism(s)" "Interaction parameter(s)",

Expected: Two list entries → "Host organism(s)" and "Interaction parameter(s)"
Actual: One entry → "Host organism(s)Interaction parameter(s)" (concatenated)

The list has 1 entry instead of 2. Every column index after this point is shifted by 1.

Location 2 : genetic_interaction column list (lines ~180–184)

"Annotation(s) interactor A"
"Annotation(s) interactor B"
"Interaction annotation(s)"
"Host organism(s)"
"Interaction parameter(s)"
"Creation date",

Expected: Six separate list entries
Actual: The first five strings concatenate into one:

"Annotation(s) interactor AAnnotation(s) interactor BInteraction annotation(s)Host organism(s)Interaction parameter(s)"

Then "Creation date" is a separate entry (it has a comma). The list has 2 entries instead of 6 — missing 4 column names entirely.


Impact

  • Data integrity: Column-to-name mapping is silently wrong for all Alliance molecular_interaction and genetic_interaction data. Fields like "Host organism(s)", "Interaction parameter(s)", "Annotation(s) interactor A/B", and "Interaction annotation(s)" are never properly indexed as standalone columns.
  • Retrieval quality: Any metadata-based filtering or search against these column names will match nothing, since the names don't exist as standalone entries in the schema.
  • Silent failure: No error or warning is raised at any point the data loads successfully but with incorrect field assignments.

Steps to Reproduce

# Quick verification in a Python REPL:
columns = [
    "Annotation(s) interactor A"
    "Annotation(s) interactor B"
    "Interaction annotation(s)"
    "Host organism(s)"
    "Interaction parameter(s)"
    "Creation date",
]
print(len(columns))   # Expected: 6, Actual: 2
print(columns[0])     # Shows the concatenated mega-string

Proposed Fix

Add the missing commas after each string literal:

Location 1

-            "Host organism(s)" "Interaction parameter(s)",
+            "Host organism(s)",
+            "Interaction parameter(s)",

Location 2

-            "Annotation(s) interactor A"
-            "Annotation(s) interactor B"
-            "Interaction annotation(s)"
-            "Host organism(s)"
-            "Interaction parameter(s)"
+            "Annotation(s) interactor A",
+            "Annotation(s) interactor B",
+            "Interaction annotation(s)",
+            "Host organism(s)",
+            "Interaction parameter(s)",
             "Creation date",

Files to change

  • src/data_generation/alliance/__init__.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions