Skip to content

Commit 998ec20

Browse files
feat(silo-import, helm, website): Revive hierarchical filter functionality (#6575)
This PR reintroduces hierarchical filtering functionality that was added in #6302 and #6404 but reverted so that we could roll out the required breaking change from `hostTaxonId: int` to `hostTaxonId: string` on its own first. Below is an updated description taken from the original PR: ## Summary Introduces a new "hierarchical filter" concept for metadata fields whose values form a hierarchy resolved dynamically at SILO import time (rather than statically pinned per pipeline version, the way lineageSystem works). The first use case is hostTaxonId: at import time the SILO importer collects the set of host taxon IDs observed in the released data, asks the taxonomy service to render a SILO-format lineage YAML for those taxa, and writes it into SILO's input directory. SILO then indexes the column as a lineage, which gives the website a hierarchical search UI ("Culicidae" matches records of all mosquito species, etc.). This PR introduces changes to the helm chart, the SILO importer, and the website. ## Helm / config - New metadata-field property hierarchicalFilter: <URL>. Mutually exclusive with lineageSystem (enforced in values.schema.json). When set, the field is treated like a lineage system for SILO indexing (generateLineageIndex) and surfaces a new hierarchicalSearch: true flag in the website config. - _hierarchical-filters-for-organism.tpl mirrors the existing lineageSystemForOrganism helper. The SILO database config combines both lists into lineageDefinitionFilenames. - silo-deployment.yaml builds and injects the HIERARCHICAL_FILTERS env var (JSON map) consumed by the importer. - West-nile given hierarchicalFilter: *taxonomy_service_url to observe the new feature on previews. - New section in helm-chart-config.mdx documenting the feature. ## SILO import - HIERARCHICAL_FILTERS env var parsed at startup. - update_hierarchical_filters checks whether the most recent import cycle returned new values for the hierarchical filter field, and fetches an updated lineage if so. The response is written to <input_dir>/<field>.yaml. If the service responds 413 (file exceeds SILO's Jackson YAML size limit), the importer retries with prune=true. - New tests in tests/test_runner.py cover both the happy path and the skip-when-unchanged path; new tests in tests/test_config.py cover env-var parsing. ## Frontend - New config options (types/config.ts). Two optional fields added to the metadata schema: hierarchicalSearch: boolean — opt a field into the hierarchical filter UI. hierarchicalSearchLabel: string — custom label for the "include sublineages/subcategories" checkbox - LineageField.tsx is renamed and extended to HierarchicalField.tsx. The component was now has a mode prop ('lineage' | 'default') with a MODE_CONFIGS table: | | `lineage` mode | `default` (hierarchical) mode | |---|---|---| | default "include sublineages" | off | **on** | | `showAlias` | false | **true** | | checkbox label | "include sublineages" | "include subcategories" | | `includeZeroCounts` | true | true | - SearchForm.tsx renders the HierarchicalField when either field.lineageSearch or field.hierarchicalSearch is set, passing mode='lineage' for the former and 'default' for the latter. - The lineage options hook in AutoCompleteOptions.ts gains two parameters: 1) showAlias — when a lineage/node has aliases, display the first alias as the user-facing label (the value stays the canonical name). This is the hierarchical filter's user-facing label. 2) includeZeroCounts — when false, options with a count of 0 are filtered out. - Added a valueToLabel map to SingleChoiceAutoCompleteField.tsx so the input's displayValue shows the human-readable label (e.g., the alias) instead of the raw stored value. ## Screenshots & manual testing I configured a hierarchical filter for hostTaxonId and set up a preview. On the preview, I tested filtering west-nile sequences using the new hierarchical filter autocomplete. It seems to work as intended, and also the interplay with the existing host name - common / scientific looks reasonable: if I set a filter for the species 'Culex' using host name-scientific, all the options with zero count (birds etc.) no longes show up in the hierarchical filter autocomplete. This is how the autocomplete looks: <img width="274" height="122" alt="grafik" src="https://github.com/user-attachments/assets/8224989c-a64e-4154-b45a-72f4551faa0b" /> <img width="274" height="343" alt="grafik" src="https://github.com/user-attachments/assets/89b0eb28-4f9e-4dd8-9273-a01dd0702d65" /> ### PR Checklist - [ ] All necessary documentation has been adapted. - [ ] The implemented feature is covered by appropriate, automated tests. - [ ] Any manual testing that has been done is documented (i.e. what exactly was tested?) 🚀 Preview: https://revive-hierarchical-filte.loculus.org --------- Co-authored-by: Anna (Anya) Parker <50943381+anna-parker@users.noreply.github.com>
1 parent 987fe18 commit 998ec20

25 files changed

Lines changed: 623 additions & 56 deletions

docs/src/content/docs/reference/helm-chart-config.mdx

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,15 @@ lineageSystemDefinitions:
8181
8282
Note: Lineages should be determined by the preprocessing pipeline and not the user, set the lineage field to `noInput: true` to reflect this or alternatively only allow users to submit valid lineage values by adding an `options` list.
8383

84+
### Hierarchical filter sources
85+
86+
For metadata fields whose values form a hierarchy that's resolved at SILO import
87+
time (rather than fetched as a static file per pipeline version), set
88+
`hierarchicalFilter: <URL>`, where URL is the URL of the service that will create the
89+
hierarchical file for SILO. The importer collects observed
90+
values for the field, sends them to that service, and writes the returned YAML
91+
into SILO's input directory as `<MetadataName>.yaml`.
92+
8493
### Organism (type)
8594

8695
Each organism object has the following fields:

kubernetes/loculus/templates/_common-metadata.tpl

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,12 @@ organisms:
372372
{{- if .lineageSystem }}
373373
lineageSearch: true
374374
{{- end }}
375+
{{- if .hierarchicalFilter }}
376+
hierarchicalSearch: true
377+
{{- end }}
378+
{{- if and .hierarchicalFilter .hierarchicalSearchLabel }}
379+
hierarchicalSearchLabel: {{ .hierarchicalSearchLabel }}
380+
{{- end }}
375381
{{- if .hideOnSequenceDetailsPage }}
376382
hideOnSequenceDetailsPage: {{ .hideOnSequenceDetailsPage }}
377383
{{- end }}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{{- define "loculus.hierarchicalFiltersForOrganism" -}}
2+
{{- $organism := . -}}
3+
{{- $schema := $organism.schema | include "loculus.patchMetadataSchema" | fromYaml }}
4+
{{- $filters := dict }}
5+
{{- range $entry := $schema.metadata }}
6+
{{- if hasKey $entry "hierarchicalFilter" }}
7+
{{- $_ := set $filters $entry.name $entry.hierarchicalFilter }}
8+
{{- end }}
9+
{{- end }}
10+
{{- $filters | toYaml -}}
11+
{{- end }}

kubernetes/loculus/templates/_siloDatabaseConfig.tpl

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
{{- define "loculus.siloDatabaseShared" -}}
22
{{- $type := default "string" .type -}}
3+
{{- $lineageName := ternary .name (default "" .lineageSystem) (not (empty .hierarchicalFilter)) -}}
34
- type: {{ ($type | eq "timestamp") | ternary "int" (($type | eq "authors") | ternary "string" $type) }}
4-
{{- if .generateIndex }}
5+
{{- if and .generateIndex (not $lineageName) }}
56
generateIndex: {{ .generateIndex }}
67
{{- end }}
7-
{{- if .lineageSystem }}
8+
{{- if $lineageName }}
89
generateIndex: true
9-
generateLineageIndex: {{ .lineageSystem }} {{- /* must match the file name in the lineageDefinitionFilenames */}}
10+
generateLineageIndex: {{ $lineageName }} {{- /* must match the file name in the lineageDefinitionFilenames */}}
1011
{{- end }}
1112
{{- end }}
1213

kubernetes/loculus/templates/lapis-silo-database-config.yaml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,11 @@
55
{{- $organismContent := $item.contents }}
66

77
{{- $lineageSystem := $organismContent | include "loculus.lineageSystemForOrganism" | fromYamlArray }}
8+
{{- $hierarchicalFilters := $organismContent | include "loculus.hierarchicalFiltersForOrganism" | fromYaml }}
9+
{{- $allLineageNames := $lineageSystem }}
10+
{{- range $name, $_ := $hierarchicalFilters }}
11+
{{- $allLineageNames = append $allLineageNames $name }}
12+
{{- end }}
813
---
914
apiVersion: v1
1015
kind: ConfigMap
@@ -24,10 +29,10 @@ data:
2429
outputDirectory: /preprocessing/output
2530
ndjsonInputFilename: data.ndjson.zst
2631
referenceGenomeFilename: reference_genomes.json
27-
{{- if $lineageSystem }}
32+
{{- if $allLineageNames }}
2833
lineageDefinitionFilenames:
29-
{{- range $ls := $lineageSystem }}
30-
- {{ printf "%s.yaml" $ls }}
34+
{{- range $name := $allLineageNames }}
35+
- {{ printf "%s.yaml" $name }}
3136
{{- end }}
3237
{{- end }}
3338

kubernetes/loculus/templates/silo-deployment.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
{{- $key := $item.key }}
66
{{- $organismContent := $item.contents }}
77
{{- $lineageSystem := $organismContent | include "loculus.lineageSystemForOrganism" | fromYamlArray }}
8+
{{- $hierarchicalFilters := $organismContent | include "loculus.hierarchicalFiltersForOrganism" | fromYaml }}
89
---
910
apiVersion: apps/v1
1011
kind: Deployment
@@ -94,6 +95,10 @@ spec:
9495
- name: LINEAGE_DEFINITIONS
9596
value: {{ $defsBySystem | toJson | quote }}
9697
{{- end }}
98+
{{- if $hierarchicalFilters }}
99+
- name: HIERARCHICAL_FILTERS
100+
value: {{ $hierarchicalFilters | toJson | quote }}
101+
{{- end }}
97102
- name: SILO_RUN_TIMEOUT_SECONDS
98103
value: {{ $.Values.siloImport.siloTimeoutSeconds | quote }}
99104
- name: HARD_REFRESH_INTERVAL

kubernetes/loculus/values.schema.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,16 @@
232232
"groups": ["metadata"],
233233
"type": "string",
234234
"description": "Use this on string fields that contain lineages, if you want to enable searches that can include sublineages. The value needs to be a lineage system that is defined under the `lineageSystemDefinitions` key."
235+
},
236+
"hierarchicalFilter": {
237+
"groups": ["metadata"],
238+
"type": "string",
239+
"description": "Use this on string fields whose values form a hierarchy resolved at SILO import time. The value should be an URL to the lineage file."
240+
},
241+
"hierarchicalSearchLabel": {
242+
"groups": ["metadata"],
243+
"type": "string",
244+
"description": "CheckBox label for hierarchical search - only applies when using hierarchicalFilter."
235245
}
236246
},
237247
"required": ["name"],
@@ -290,6 +300,24 @@
290300
"errorMessage": "When lineageSystem is set, type must be 'string' or 'authors'"
291301
}
292302
},
303+
{
304+
"if": {
305+
"properties": { "hierarchicalFilter": { "type": "string" } },
306+
"required": ["hierarchicalFilter"]
307+
},
308+
"then": {
309+
"properties": {
310+
"type": { "enum": ["string", "authors"] }
311+
},
312+
"errorMessage": "When hierarchicalFilter is set, type must be 'string' or 'authors'"
313+
}
314+
},
315+
{
316+
"not": {
317+
"description": "lineageSystem and hierarchicalFilter cannot both be set",
318+
"required": ["lineageSystem", "hierarchicalFilter"]
319+
}
320+
},
293321
{
294322
"if": {
295323
"properties": { "rangeSearch": { "const": true } },

kubernetes/loculus/values.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1256,7 +1256,7 @@ defaultOrganismConfig: &defaultOrganismConfig
12561256
hostNameScientific: hostNameScientific
12571257
hostTaxonId: hostTaxonId
12581258
args:
1259-
taxonomy_service_url: "http://loculus-taxonomy-service:5000"
1259+
taxonomy_service_url: &taxonomy_service_url http://loculus-taxonomy-service:5000
12601260
- name: hostNameScientific
12611261
displayName: Host name - scientific
12621262
definition: "The scientific name of the host from which the sample was collected."
@@ -1276,7 +1276,7 @@ defaultOrganismConfig: &defaultOrganismConfig
12761276
hostTaxonId: processed.hostTaxonId
12771277
hostNameScientific: hostNameScientific
12781278
args:
1279-
taxonomy_service_url: "http://loculus-taxonomy-service:5000"
1279+
taxonomy_service_url: *taxonomy_service_url
12801280
- name: hostNameCommon
12811281
displayName: Host name - common
12821282
definition: "The common name of the host from which the sample was collected."
@@ -1295,7 +1295,7 @@ defaultOrganismConfig: &defaultOrganismConfig
12951295
inputs:
12961296
hostTaxonId: processed.hostTaxonId
12971297
args:
1298-
taxonomy_service_url: "http://loculus-taxonomy-service:5000"
1298+
taxonomy_service_url: *taxonomy_service_url
12991299
- name: isLabHost
13001300
displayName: Is lab host
13011301
definition: "If a laboratory host (e.g. cultured cell line) was used to propagate the sample."
@@ -1535,7 +1535,7 @@ defaultOrganismConfig: &defaultOrganismConfig
15351535
- "prepro"
15361536
replicas: 1
15371537
taxonomyServiceArgs: &preprocessingTaxonomyServiceArgs
1538-
taxonomy_service_url: "http://loculus-taxonomy-service:5000"
1538+
taxonomy_service_url: *taxonomy_service_url
15391539
configFile: &preprocessingConfigFile
15401540
log_level: DEBUG
15411541
batch_size: 100
@@ -1657,6 +1657,9 @@ defaultOrganisms:
16571657
regex_pattern: '^(?:[^/]+/)?[^/]+/(?P<identifier>[^/]+)/\d{4}(?:-\d{2}){0,2}$'
16581658
- <<: *hostTaxonIdConfig
16591659
required: true
1660+
hierarchicalFilter: *taxonomy_service_url
1661+
hierarchicalSearchLabel: "include subtaxa"
1662+
initiallyVisible: true
16601663
- name: lineage
16611664
displayName: Lineage
16621665
definition: "Assigned clade label from the nearest reference tree node."

loculus-silo/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,10 @@ Python service that retrieves released datasets from the Loculus backend, transf
44
for SILO, and runs SILO preprocessing directly.
55

66
The importer downloads data from the Loculus backend, [transforms it into the format required by SILO](https://github.com/GenSpectrum/LAPIS-SILO/tree/main/tools/legacyNdjsonTransformer), and runs SILO preprocessing only if the following conditions hold:
7+
78
1. The data is in a valid format (e.g. ndjson format where each line is a valid json, has number of expected records, and the pipeline version exists and is a valid integer if a lineage definition is required).
89
2. The lineage definitions file can be produced if it is required.
9-
3. The data has changed since the last download or it has been over more than `HARD_REFRESH_INTERVAL` since the last hard refresh. We determine if the data has changed from the header (e.g. 304 not modified) and by comparing a hash of the data.
10+
3. The data has changed since the last download or it has been over more than `HARD_REFRESH_INTERVAL` since the last hard refresh. We determine if the data has changed from the header (e.g. 304 not modified) and by comparing a hash of the data.
1011

1112
## Local development
1213

loculus-silo/src/silo_import/config.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55
from dataclasses import dataclass
66
from pathlib import Path
77

8+
MetadataField = str
9+
HierarchicalServiceUrl = str
10+
811

912
@dataclass(frozen=True)
1013
class ImporterConfig:
@@ -16,6 +19,7 @@ class ImporterConfig:
1619
root_dir: Path
1720
silo_binary: Path
1821
preprocessing_config: Path
22+
hierarchical_filters: dict[MetadataField, HierarchicalServiceUrl] | None = None
1923

2024
@classmethod
2125
def from_env(cls) -> ImporterConfig:
@@ -47,6 +51,8 @@ def from_env(cls) -> ImporterConfig:
4751
except TypeError as exc:
4852
raise RuntimeError(str(exc)) from exc
4953

54+
hierarchical_filters = _parse_hierarchical_filters(env.get("HIERARCHICAL_FILTERS"))
55+
5056
hard_refresh_interval = int(env.get("HARD_REFRESH_INTERVAL", "3600"))
5157
poll_interval = int(env.get("SILO_IMPORT_POLL_INTERVAL_SECONDS", "30"))
5258
silo_run_timeout = int(env.get("SILO_RUN_TIMEOUT_SECONDS", "3600"))
@@ -66,8 +72,28 @@ def from_env(cls) -> ImporterConfig:
6672
root_dir=root_dir,
6773
silo_binary=silo_binary,
6874
preprocessing_config=preprocessing_config,
75+
hierarchical_filters=hierarchical_filters,
6976
)
7077

7178
@property
7279
def released_data_endpoint(self) -> str:
7380
return f"{self.backend_base_url}/get-released-data?compression=zstd"
81+
82+
83+
def _parse_hierarchical_filters(
84+
raw: str | None,
85+
) -> dict[MetadataField, HierarchicalServiceUrl] | None:
86+
if not raw:
87+
return None
88+
89+
try:
90+
data = json.loads(raw)
91+
except json.JSONDecodeError as exc:
92+
msg = "HIERARCHICAL_FILTERS must be valid JSON"
93+
raise RuntimeError(msg) from exc
94+
95+
if not isinstance(data, dict):
96+
msg = f"HIERARCHICAL_FILTERS must be a JSON object, got: {raw}"
97+
raise TypeError(msg)
98+
99+
return data or None

0 commit comments

Comments
 (0)