Feature/gcs bronze sync databricks api#239
Draft
chapmanhk wants to merge 4 commits into
Draft
Conversation
- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync
with include_blob_paths_json for validated/{file_name}.
- Call after successful validate-upload / validate-sftp when edvise_id or legacy_id
is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable.
- Failures to start the job are logged and do not fail validation.
- Extend data tests with DatabricksControl mock and assertions.
Made-with: Cursor
…lution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com>
Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(data): trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)
Description
After a successful file validation (
POST .../input/validate-upload/{file_name}and SFTP validate path), the API schedules a non-blocking background task that starts the Databricks jobedvise_validated_gcs_to_bronze_syncto copy the object from GCSvalidated/into the institution's bronze volume (gcs_uploads). Validation and batch creation are unchanged; sync failures are logged and do not fail the validation response.Behavior
edvise_idorlegacy_id(PDP-only institutions are skipped).DATABRICKS_HOST_URL, GCP service account).DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID, otherwise by job name (with duplicate-name detection).validation_request,gcs_bronze_sync_background_start,gcs_bronze_sync_background_donewithoutcome(success|trigger_failed|skipped) andcorrelation_idfor cross-log lookup.New / updated
src/webapp/databricks.py—run_validated_gcs_to_bronze_sync, job resolution, bundle-aligned job parameterssrc/webapp/routers/data.py—BackgroundTaskshook invalidation_helpersrc/webapp/databricks_test.py,src/webapp/routers/data_test.pysrc/webapp/.env.example— documents optionalDATABRICKS_VALIDATED_BRONZE_SYNC_JOB_IDKill switch:
ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false(default: enabled).Deployment Readiness*
Testing
Describe or check:
Automated:
databricks_test.py(job ID resolution,run_nowparams contract);data_test.py(Edvise/Legacy trigger paths, PDP-only skip, env disabled).Manual (dev): Deployed feature branch to dev; validated upload as Legacy institution — validation succeeded; background task logged
gcs_bronze_sync_background_startthenoutcome: trigger_failedwith Job named 'edvise_validated_gcs_to_bronze_sync' not found until the edvise bundle (pipelines/ingestion/shared) is deployed to the dev Databricks workspace. Will re-test after bundle deploy foroutcome: successanddatabricks_job_run_idin logs.Deployment Notes
Describe or check:
edvise_validated_gcs_to_bronze_syncexists in the workspace matchingDATABRICKS_HOST_URL(dev:pipelines/ingestion/shared,databricks bundle deploy --target=dev, ordeploy-manual/ version tag).ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=falseto disable without redeploying edvise.Rollback Plan
Describe or check:
git revert)Revert the merge commit. Optionally set
ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=falseimmediately if a hot disable is needed before revert ships. Validation and existing Databricks flows are unaffected.Reviewer Guidance / Questions*
github_validated_bronze_sync.yml); changes there need a matching API update.Screenshots / Testing Evidence*
Dev log (validation OK, job missing pre-bundle deploy):
{"event":"validation_request","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."} {"event":"gcs_bronze_sync_background_start","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."} {"event":"gcs_bronze_sync_background_done","outcome":"trigger_failed","databricks_job_name":"edvise_validated_gcs_to_bronze_sync",...}Expected after bundle deploy:
outcome:"success"withdatabricks_job_run_id; corresponding run visible under Workflows → Jobs in Databricks.SOC 2 Change Management Checklist
./vendor/bin/pest)Provide justification if you are submitting a PR with any boxes checked other than the first.
Reminder for Reviewers: By approving this PR you are confirming that you have reviewed the code for correctness, security, and compliance with our engineering and SOC 2 standards. Do not approve PRs where SOC 2 checklist items are checked without documented justification.