Skip to content

Upgrade Elasticsearch to OpenSearch 3.x #3576

@davidsmejia

Description

@davidsmejia

Upgrade Elasticsearch to OpenSearch 3.x

Goal

Move off Elasticsearch 6.3 (the version aws_elasticsearch_domain.es currently runs) and land on OpenSearch 3.x as the long-term target. Update the AWS Terraform resource to aws_opensearch_domain. Migrate the Python client stack from the legacy elasticsearch-py 6.x line to opensearch-py. Migrate the Django integration (django-elasticsearch-dsl + django-elasticsearch-dsl-drf) to whatever OpenSearch-compatible equivalents exist.

This is a multi-step, multi-PR effort. The actual landing version may be OpenSearch 2.x as an intermediate step, with 3.x as the eventual target. Discovery during implementation will pin that down.

Prerequisite

Land the Django 5.2 upgrade first (see upgrade-django-to-5-2.md). The Django upgrade will already have flushed out most of the django-elasticsearch-dsl / djangorestframework compatibility constraints. This issue then focuses purely on the server-side jump and the client-side rewrite.

If the Django upgrade discovery surfaces that the ES stack cannot move forward without first upgrading ES, the two issues may need to swap order or merge - that is itself a discovery item below.

Why

  • AWS deprecated the standalone Elasticsearch service in favor of OpenSearch. The aws_elasticsearch_domain Terraform resource is the deprecated path; new domains created today use aws_opensearch_domain.
  • ES 6.x is end-of-life. No security patches, no new features.
  • The 6.5.0 pin on django-elasticsearch-dsl and the version coupling to elasticsearch==6.8.2 / elasticsearch-dsl==6.4.0 means we cannot move Django ecosystem packages forward in any meaningful way without addressing this first.
  • ES 6.x predates the removal of mapping types. The current documents.py and query code likely encodes assumptions (e.g. _type, mapping shape) that have been gone from ES 7+ and OpenSearch since 2019.

Current state

  • Server: aws_elasticsearch_domain.es with elasticsearch_version = "6.3", instance type m5.large.elasticsearch, single-node, EBS 10GB. Defined at infrastructure/instances.tf:32-78.
  • Python clients: elasticsearch==6.8.2, elasticsearch-dsl==6.4.0 (resolved in api/requirements.txt:29-30).
  • Django integration: django-elasticsearch-dsl==6.5.0, django-elasticsearch-dsl-drf==0.22.5 (pulled in via api/, common/, foreman/).
  • Document definitions: common/data_refinery_common/models/documents.py.
  • Index management command: api/data_refinery_api/management/commands/update_es_index.py.
  • Heaviest query/filter surface: api/data_refinery_api/views/experiment_document.py (faceted search, term aggregations, DocumentViewSet).
  • bin/rbio es:rebuild and es:reset exist for local index management.

Target landing point

OpenSearch 3.x server-side, opensearch-py + opensearch-dsl client-side, with the Django integration migrated to whichever maintained package provides DRF-compatible document/search views against OpenSearch.

Realistic phased path:

  1. ES 6.3 -> ES 6.8 (minor bump within AWS). Smoke test that nothing else breaks.
  2. ES 6.8 -> ES 7.10 (this is where mapping types die). Likely the most invasive client code change.
  3. ES 7.10 -> OpenSearch 1.x (fork point; drop-in for ES 7.10 wire compat).
  4. OpenSearch 1.x -> 2.x (in-place upgrade supported by AWS).
  5. OpenSearch 2.x -> 3.x (in-place upgrade supported by AWS).

Whether each of these gets its own PR or they are batched depends on discovery findings.

Discovery items

Things that need to be answered during implementation, captured here so the implementer does not skip them:

  • Is the AWS managed domain already OpenSearch-engine under the hood? AWS auto-converted some ES domains. Check the actual engine_version of the running prod domain - it may already be "OpenSearch_x.y" rather than "Elasticsearch_6.3". If so, the server-side migration is partially done and the work is mostly client-side.
  • Does any current code touch _type / mapping types directly? Grep documents.py, the views in api/data_refinery_api/views/, and update_es_index.py for doc_type, _type, DocType. Anything found needs a 6 -> 7 mapping cleanup.
  • What is the actual index size and document count in prod? Determines whether a reindex is minutes or hours and informs the cutover strategy (in-place reindex vs blue/green index alias swap).
  • Is django-elasticsearch-dsl maintained for OpenSearch? Check the upstream repo activity and OpenSearch compatibility. If not, candidates to evaluate: django-opensearch-dsl (community fork), or hand-rolling the Document layer against opensearch-py directly.
  • Is django-elasticsearch-dsl-drf maintained? This is the package with the heaviest API surface for us (faceted search, filter backends, document viewsets). If unmaintained, the DRF-side query and filter code in experiment_document.py may need a manual rewrite.
  • What does the AWS-managed-OpenSearch upgrade path look like for our cluster size and version? Check that 6.3 is on a supported in-place upgrade path. If not, a snapshot-restore migration may be required.
  • Single-node EBS 10GB - is that the prod sizing or just dev? A reindex during cutover may need more headroom temporarily. Confirm prod sizing before planning.
  • Query parity testing strategy? Captured production query patterns are not in the repo. Decide whether to record-and-replay production queries against the new cluster, or rely on the existing test suite plus manual validation against the live API.

Scope of changes

Infrastructure

  • Replace aws_elasticsearch_domain.es with aws_opensearch_domain.os (or rename inline). Update the engine_version to the chosen OpenSearch version.
  • Update all references in infrastructure/instances.tf, infrastructure/variables.tf, infrastructure/networking.tf that name the resource.
  • Update Terraform outputs (elasticsearch_endpoint, etc.) and any deploy-time scripts that read them.

Python deps

  • Replace elasticsearch==6.8.2 / elasticsearch-dsl==6.4.0 with opensearch-py / opensearch-dsl.
  • Replace django-elasticsearch-dsl and django-elasticsearch-dsl-drf with their OpenSearch equivalents (decision pending discovery).
  • Regenerate requirements.txt across api/, common/, foreman/.

Django code

  • common/data_refinery_common/models/documents.py: rewrite imports against the chosen OpenSearch DSL package; remove mapping-type assumptions.
  • api/data_refinery_api/views/experiment_document.py: rewrite all django_elasticsearch_dsl_drf imports and filter backends.
  • api/data_refinery_api/management/commands/update_es_index.py: rewrite against the new registry API.
  • api/data_refinery_api/settings.py: update INSTALLED_APPS entries.
  • bin/lib/es.py (es:rebuild, es:reset commands): update any client-specific paths.

Data migration

  • Define and document the cutover plan: in-place AWS-managed upgrade, blue/green via index alias swap, or snapshot-restore.
  • Validate all documents reindex cleanly and counts match pre/post.

Risks

  • The mapping-types removal between ES 6 and 7 is the single largest source of code churn. Any code that selects, filters, or aggregates by _type needs to be rewritten.
  • The Django integration packages are community-maintained with varying activity. If django-elasticsearch-dsl-drf has no OpenSearch successor in a usable state, the DRF filter / facet surface in experiment_document.py will need to be hand-rolled.
  • Multi-step server upgrades on AWS managed clusters can have multi-hour blue/green migrations per step. Plan for read-mostly windows during cutover.
  • Reindex storage spike: the new index lives alongside the old during cutover. Confirm EBS headroom or temporarily bump the cluster size.
  • Query semantics around scoring, analyzers, and tokenizers can shift subtly between versions. The analyzer and token_filter config in documents.py should be reviewed against the target version's defaults.

Acceptance criteria

  • Terraform aws_elasticsearch_domain.es replaced with aws_opensearch_domain running OpenSearch 3.x (or the agreed phased target).
  • elasticsearch-py / elasticsearch-dsl-py no longer present in any requirements.txt; opensearch-py / opensearch-dsl in their place.
  • Django integration packages migrated; no remaining references to django_elasticsearch_dsl or django_elasticsearch_dsl_drf.
  • bin/rbio es:rebuild and es:reset work against the new cluster locally.
  • Document counts post-reindex match counts pre-reindex.
  • API endpoints exercised by experiment_document.py return equivalent results pre/post migration on a representative sample of production query patterns.
  • All test suites pass (bin/rbio test:common, test:api, test:foreman, test:workers).
  • No references to elasticsearch_version, _type, doc_type (other than archived migration files) remain in the repo.

Suggested sequencing

  1. Discovery pass: run through the discovery items above and produce a one-page summary. Decide whether to do a single big-bang migration to OpenSearch 3.x or phase it.
  2. Client-side rewrite first (against the existing 6.3 cluster, using elasticsearch-py 6.x's compat mode if needed): get the code talking through an abstracted client so the server-side switch is purely a config change.
  3. Server-side phased upgrades: ES 6.3 -> 6.8 -> 7.10 -> OS 1.x -> 2.x -> 3.x, one PR per step, with smoke tests in between.
  4. Cleanup PR to remove any remaining ES branding, comments, and Terraform output names that still say "elasticsearch".

References

  • infrastructure/instances.tf:32-78 (AWS ES domain definition)
  • common/data_refinery_common/models/documents.py (Document models)
  • api/data_refinery_api/views/experiment_document.py (DRF integration, biggest query surface)
  • api/data_refinery_api/management/commands/update_es_index.py (index management)
  • bin/lib/es.py (rbio es:* commands)
  • Related upgrade work: .workspace/upgrade-django-to-5-2.md, .workspace/upgrade-base-image-from-ubuntu-18.md

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions