Skip to content

Feature/gcs bronze sync databricks api#239

Draft
chapmanhk wants to merge 4 commits into
developfrom
feature/gcs-bronze-sync-databricks-api
Draft

Feature/gcs bronze sync databricks api#239
chapmanhk wants to merge 4 commits into
developfrom
feature/gcs-bronze-sync-databricks-api

Conversation

@chapmanhk
Copy link
Copy Markdown
Collaborator

@chapmanhk chapmanhk commented May 18, 2026

feat(data): trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)

Description

After a successful file validation (POST .../input/validate-upload/{file_name} and SFTP validate path), the API schedules a non-blocking background task that starts the Databricks job edvise_validated_gcs_to_bronze_sync to copy the object from GCS validated/ into the institution's bronze volume (gcs_uploads). Validation and batch creation are unchanged; sync failures are logged and do not fail the validation response.

Behavior

  • Runs only for institutions with edvise_id or legacy_id (PDP-only institutions are skipped).
  • Uses existing Databricks auth (DATABRICKS_HOST_URL, GCP service account).
  • Resolves the job by optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID, otherwise by job name (with duplicate-name detection).
  • Structured JSON trace logs: validation_request, gcs_bronze_sync_background_start, gcs_bronze_sync_background_done with outcome (success | trigger_failed | skipped) and correlation_id for cross-log lookup.

New / updated

  • src/webapp/databricks.pyrun_validated_gcs_to_bronze_sync, job resolution, bundle-aligned job parameters
  • src/webapp/routers/data.pyBackgroundTasks hook in validation_helper
  • src/webapp/databricks_test.py, src/webapp/routers/data_test.py
  • src/webapp/.env.example — documents optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID

Kill switch: ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false (default: enabled).

Deployment Readiness*

Testing

Describe or check:

  • Created or updated unit, feature, and/or integration tests
  • Typical manual testing in the local env browser, dev pipeline, etc.

Automated: databricks_test.py (job ID resolution, run_now params contract); data_test.py (Edvise/Legacy trigger paths, PDP-only skip, env disabled).

Manual (dev): Deployed feature branch to dev; validated upload as Legacy institution — validation succeeded; background task logged gcs_bronze_sync_background_start then outcome: trigger_failed with Job named 'edvise_validated_gcs_to_bronze_sync' not found until the edvise bundle (pipelines/ingestion/shared) is deployed to the dev Databricks workspace. Will re-test after bundle deploy for outcome: success and databricks_job_run_id in logs.

Deployment Notes

Describe or check:

  • No special deployment steps required
  • Special deployment steps required (see below)
  1. edvise-api: Deploy as usual. No new required env vars if job resolution by name works.
  2. Databricks (per environment): Deploy the edvise ingestion shared bundle so job edvise_validated_gcs_to_bronze_sync exists in the workspace matching DATABRICKS_HOST_URL (dev: pipelines/ingestion/shared, databricks bundle deploy --target=dev, or deploy-manual / version tag).
  3. Optional: ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false to disable without redeploying edvise.

Rollback Plan

Describe or check:

  • Standard revert is sufficient (git revert)

Revert the merge commit. Optionally set ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false immediately if a hot disable is needed before revert ships. Validation and existing Databricks flows are unaffected.

Reviewer Guidance / Questions*

  • Noting that testing of the full workflow cannot be done until the Databricks bundle has been deployed to dev. Will leave this as a draft until deployment is done.
  • Noting that job parameters are pinned to the edvise bundle contract (github_validated_bronze_sync.yml); changes there need a matching API update.

Screenshots / Testing Evidence*

Dev log (validation OK, job missing pre-bundle deploy):

{"event":"validation_request","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."}
{"event":"gcs_bronze_sync_background_start","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."}
{"event":"gcs_bronze_sync_background_done","outcome":"trigger_failed","databricks_job_name":"edvise_validated_gcs_to_bronze_sync",...}

Expected after bundle deploy: outcome:"success" with databricks_job_run_id; corresponding run visible under Workflows → Jobs in Databricks.

SOC 2 Change Management Checklist

  • None of the below are true in this code
  • New roles/permissions are introduced without review and approval by the product manager
  • Hardcoded credentials, secrets, or API keys are present in this code
  • Secrets are being managed outside of the approved secrets management process (e.g., GitHub Secrets, environment variables)
  • PII or sensitive data handling is introduced or changed without being reviewed against our data classification policy
  • Sensitive data is written to logs
  • Input validation and sanitization is missing
  • An unnecessary attack surface has been introduced (e.g., unused endpoints, open ports, debug modes left enabled)
  • Common vulnerabilities have been introduced in the code (inc. any dependencies added or updated)
  • No review for common vulnerabilities has been conducted
  • Not tested in a non-production environment
  • Breaking changes to existing APIs or integrations with downstream consumers being notified
  • Performance impact has not been considered or acceptable
  • Appropriate audit logging is missing for any security-relevant actions introduced by this change
  • Log entries contain sensitive or PII data
  • All existing tests do not pass locally (./vendor/bin/pest)

Provide justification if you are submitting a PR with any boxes checked other than the first.


Reminder for Reviewers: By approving this PR you are confirming that you have reviewed the code for correctness, security, and compliance with our engineering and SOC 2 standards. Do not approve PRs where SOC 2 checklist items are checked without documented justification.


chapmanhk and others added 4 commits April 28, 2026 15:09
- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
  with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
  is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.

Made-with: Cursor
…lution

Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/
writes. Add correlation_id and JSON trace logs (validation_request, background
start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by
name with duplicate detection when unset. Refine skip reasons for PDP vs
Edvise/Legacy.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extract Databricks helpers and job-parameter constants, use specific
exceptions (ValueError, DatabricksError), and split background logging
into focused functions under 50 lines. Add tests for PDP-only and env
kill-switch skips plus run_now parameter contract coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant