Skip to content

HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in Recon UI.#10162

Draft
ArafatKhan2198 wants to merge 1 commit intoapache:masterfrom
ArafatKhan2198:csvExport2
Draft

HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in Recon UI.#10162
ArafatKhan2198 wants to merge 1 commit intoapache:masterfrom
ArafatKhan2198:csvExport2

Conversation

@ArafatKhan2198
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Apr 30, 2026

What changes were proposed in this pull request?

ExportJob

A small record of one export run: who asked (userId), which unhealthy type (state), job id, current stage (queued / running / done / failed), how many rows written so far, an estimate of total rows (for a percent), where the TAR file will live, any error text, and queue position while it’s waiting. Think of it as the status object the API returns to the UI.

ExportJobManager

The brain of the feature. It accepts new export requests, queues them (up to a fixed max), runs them one at a time on a background thread, reads unhealthy rows from the DB in a streaming way, writes CSV files in chunks (e.g. 500k rows per file), packs them into a TARdeletes temporary CSV folders, and updates the job’s status. It also handles cancel (stop work, clean up) and shutdown when Recon stops.

ContainerEndpoint

The HTTP layer for containers. For exports it exposes: start an export (POST), check status (GET with job id), download the finished TAR (GET … /download), and cancel (DELETE). It checks that state is valid, calls ExportJobManager, and turns “queue full” into a 429 Too Many Requests style response.

ContainerHealthSchemaManager

Already knows how to talk to Recon’s Derby unhealthy-container data. New pieces: getUnhealthyContainersCount — “how many rows will this export roughly have?” (for progress), and getUnhealthyContainersCursor — a streaming cursor over rows for a given unhealthy state so the export doesn’t load millions of rows into memory at once.

containers.tsx (Recon UI)`

The Containers page: adds an “Export CSV” button for the currently selected unhealthy tab. It starts the job, polls every few seconds for status, shows queued / runn

Config keys (in ReconServerConfigKeys)

Config key Default Purpose
ozone.recon.export.directory /tmp/recon/exports Directory where finished TAR archives (and per-job temp CSV dirs) are created under.
ozone.recon.export.worker.threads 1 Intended to control how many background worker threads run exports (comment in the patch: keep DB access from overlapping).
ozone.recon.export.max.jobs.total 10 Intended global cap on how many export jobs can be queued / accepted before rejecting new ones.
ozone.recon.unhealthy.container.fetch.size 10000 Intended JDBC fetch size (rows per round-trip) when streaming unhealthy rows for export.

Not a config (hardcoded in ExportJobManager)

Constant Value Purpose
CSV chunk size 500,000 Max rows per CSV file before starting the next partNNN.csv inside the TAR.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14913

How was this patch tested?

Log Changes -


2026-04-30 15:16:48 2026-04-30 09:46:48,962 [pool-56-thread-1] INFO api.ExportJobManager: Starting export job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:16:54 2026-04-30 09:46:54,625 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 will process approximately 3040000 records
2026-04-30 15:16:54 2026-04-30 09:46:54,628 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part1
2026-04-30 15:17:28 2026-04-30 09:47:28,413 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part2
2026-04-30 15:17:57 2026-04-30 09:47:57,420 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part3
2026-04-30 15:17:58 2026-04-30 09:47:58,876 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part4
2026-04-30 15:18:00 2026-04-30 09:48:00,646 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part5
2026-04-30 15:18:02 2026-04-30 09:48:02,488 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part6
2026-04-30 15:18:04 2026-04-30 09:48:04,261 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part7
2026-04-30 15:18:04 2026-04-30 09:48:04,429 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 wrote 3040000 records across 7 files
2026-04-30 15:18:05 2026-04-30 09:48:05,730 [pool-56-thread-1] INFO api.ExportJobManager: Created TAR archive: /tmp/recon/exports/export_missing_webui_ac16b513.tar
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Deleted temporary CSV files for job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Completed export job ac16b513-f3f0-4e2d-a124-f208155697c3 (3040000 records)
video2985213205.mp4

@devmadhuu devmadhuu self-requested a review April 30, 2026 10:17
@devmadhuu
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 as discussed, please design the solution server based for single Recon user. We don't have user based logins in Recon. We should not localize the logic at browser for job progress. All browser windows opened in multiple machines opening the recon page should see the same job and its progress. At a time only job should be allowed to run and remaining 2 should go in queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants