Skip to content

feat: ingest logic to use FileService - BED-8317,BED-8318#2808

Merged
mykeelium merged 34 commits into
mainfrom
mcuomo/BED-8317
Jun 24, 2026
Merged

feat: ingest logic to use FileService - BED-8317,BED-8318#2808
mykeelium merged 34 commits into
mainfrom
mcuomo/BED-8317

Conversation

@mykeelium

@mykeelium mykeelium commented May 21, 2026

Copy link
Copy Markdown
Contributor

Description

This change migrates the ingestion logic to use the new FileService. In addition, the cleanup daemon was updated to be able to delete files using the new FileService.

Motivation and Context

Resolves BED-8317,BED-8318

How Has This Been Tested?

The application was started and multiple ingested files were created. The data from these files were ingested correctly. In addition new tests were added.

For the cleanup, files were added to the directories, and both with exemptions and without PruneData was run to ensure the proper files were deleted.

Screenshots (optional):

Types of changes

  • Chore (a change that does not modify the application functionality)
  • New feature (non-breaking change which adds functionality)

Checklist:

Summary by CodeRabbit

  • New Features

    • Retained ingest files can now be downloaded as a gzip-compressed tar archive.
    • Disabling ingest file retention now triggers automatic best-effort background cleanup.
  • Bug Fixes & Improvements

    • Improved retention handling and lock behavior, including safer failure cleanup and correct lock-conflict responses.
    • Orphan cleanup now unifies storage-backed cleanup with legacy local remnants and scratch cleanup using stronger exclusion/retention rules.
    • Uploads and ingest archive extraction now use storage-backed temp-file handling for more reliable lifecycle management.
  • Tests

    • Expanded/updated unit and integration tests for retention downloads, cleanup, uploads, and storage-backed ingest workflows.

@mykeelium mykeelium self-assigned this May 21, 2026
@mykeelium mykeelium added enhancement New feature or request api A pull request containing changes affecting the API code. labels May 21, 2026
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors ingest file handling from filesystem paths to a storage.FileService abstraction across the entire ingest pipeline. Ingest files are now uploaded via storage with concurrent validation, retained files are managed through a file service, archives are extracted using scratch directories, temporary files are cleaned up through the service, and orphan cleanup atomically handles storage, legacy local, and scratch cleanup. The legacy cmd/api/src/services/fs package is removed in favor of storage-layer abstractions.

Changes

Storage-backed ingest file lifecycle refactor

Layer / File(s) Summary
Retained ingest file management
cmd/api/src/api/tools/ingest.go, cmd/api/src/api/tools/ingest_test.go
Tools API daemon now injects storage.FileService for retained files. FetchRetainedIngestFiles lists files via storage and streams a gzip+tar response; DisableIngestFileRetention uses DeleteFile to clean up asynchronously via a cancellation-free context. Tests verify path sanitization, tar generation, endpoint behavior under enabled/disabled/error conditions, and lock conflict handling.
Upload validation and storage persistence
cmd/api/src/services/upload/upload.go, cmd/api/src/services/upload/upload_test.go, cmd/api/src/api/v2/fileingest.go, cmd/api/src/api/v2/fileingest_test.go
Upload refactored to accept context.Context and storage.FileService; SaveIngestFile and WriteAndValidateFile perform concurrent streaming validation while writing to storage via fileService.WriteTempFile. On error, cleanup uses fileService.DeleteFile. API handler resolves ingest file service with error handling. Tests mock storage operations and verify validation/write error precedence and cleanup behavior.
Archive extraction via scratch directories
cmd/api/src/services/graphify/ingest_storage.go, cmd/api/src/services/graphify/ingest_storage_test.go
Archive extraction refactored to spool stored files to scratch via SpoolToScratch, extract entries via WriteArchiveFileToStorage, and clean up via fileService.DeleteFile. JSON files bypass extraction. ZIP archives are spooled, expanded, and originals deleted. Tests use in-memory zip creation and storage mocks to verify spooling, opening, writing, and per-file error recording.
Per-file ingest processing with storage-aware cleanup
cmd/api/src/services/graphify/service.go, cmd/api/src/services/graphify/tasks.go, cmd/api/src/services/graphify/tasks_integration_test.go
New IngestFileData struct and processSingleFile helper centralize per-file ingest with measured logging; cleanup always executes: closing scratch reader, removing scratch path (ignoring missing-file errors), and deleting original stored file via fileService.DeleteFile. GraphifyService stores resolver for per-task service resolution. Integration tests wire local ingest storage for all test cases.
Unified cleanup: storage, legacy, and scratch
cmd/api/src/daemons/datapipe/cleanup.go, cmd/api/src/daemons/datapipe/cleanup_internal_test.go, cmd/api/src/daemons/datapipe/cleanup_test.go
OrphanFileSweeper now handles storage-backed ingest files via storage.FileService, legacy local temp entries, and scratch cleanup under one atomic TryLock. Supports excluded storage prefixes, expected files, and age-based cutoff across all phases. Internal tests cover prefix normalization, exclusion/expected-path logic, and per-phase behavior.
Service and daemon construction with FileServiceResolver
cmd/api/src/daemons/datapipe/pipeline.go, cmd/api/src/daemons/api/toolapi/api.go, cmd/api/src/services/entrypoint.go, cmd/api/src/services/entrypoint_test.go, cmd/api/src/services/graphify/graphify_integration_test.go, cmd/api/src/daemons/datapipe/datapipe_integration_test.go, cmd/api/src/daemons/datapipe/pipeline_integration_test.go, cmd/api/src/daemons/datapipe/delete_integration_test.go
All daemon and service constructors updated to accept storage.FileServiceResolver. NewPipeline stores resolver and resolves ingest service before orphan cleanup. toolapi.NewDaemon accepts resolved retained file service. Entrypoint validates and propagates resolver. All integration tests initialize file-service infrastructure and wire services into ingest operations.
Legacy fs package removal
cmd/api/src/services/fs/fs.go, cmd/api/src/services/fs/mocks/fs.go
The Service interface and Client implementation wrapping os.CreateTemp and os.ReadFile are removed, along with their generated mocks, as storage-layer abstractions supersede this functionality.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • SpecterOps/BloodHound#2800: Both PRs modify the shared ingest-storage layer—adding/updating cmd/api/src/services/graphify/ingest_storage.go helpers like ExtractIngestFiles and refactoring cmd/api/src/services/graphify/tasks.go to delegate ingestion/extraction to those helpers—so the changes are directly connected at the code/function level.

Suggested reviewers

  • superlinkx
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: migrating ingest logic to use FileService and references the relevant Jira tickets (BED-8317, BED-8318).
Description check ✅ Passed The description includes all required sections: Description, Motivation and Context with ticket references, testing methodology, types of changes, and completed checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mcuomo/BED-8317

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mykeelium mykeelium changed the title feat: ingest logic to use FileService - BED-8317 feat: ingest logic to use FileService - BED-8317,BED-8318 May 21, 2026
}
defer sourceFile.Close()

scratchFile, err = os.CreateTemp(scratchDirectory, "archive-*")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused why we're doing direct OS file operations here, doesn't file service have a local storage driver?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but this is supposed to be a stage earlier than the fileService, in this case spooling is done to take the users file and validate it prior to saving it to the file service. In the event that we have a file service that is not on the os, this handles the resources ensuring they are available for the fileService. Not in this scope, but maybe the upload could be shifted to a presigned url so we do not need this step at all.

Comment thread cmd/api/src/services/upload/upload.go Outdated
Comment thread cmd/api/src/services/upload/upload.go
@mykeelium mykeelium force-pushed the mcuomo/BED-8316 branch 2 times, most recently from 37fba2f to aed5917 Compare June 12, 2026 15:27
Base automatically changed from mcuomo/BED-8316 to main June 12, 2026 15:42
@mykeelium mykeelium marked this pull request as ready for review June 12, 2026 16:45
@coderabbitai coderabbitai Bot added documentation Improvements or additions to documentation infrastructure A pull request containing changes affecting the infrastructure code. go Pull requests that update go code labels Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/api/src/api/tools/ingest.go`:
- Around line 128-151: The handler currently calls
response.WriteHeader(http.StatusOK) before any archive data is produced, so
clients can get a 200 with an empty/broken .tar.gz; change this by delaying the
success status until the archive is actually created—either build the tar+gzip
into a temporary sink (e.g., a buffer or temp file) and only set headers and
stream it after the first successful file write, or defer writing the 200 until
after s.writeRetainedFileToTar has successfully written the first entry; locate
response.WriteHeader(http.StatusOK), gzip.NewWriter, tar.NewWriter and
s.writeRetainedFileToTar to implement the chosen approach and ensure errors from
GetFile/file writes result in an error status and not a premature 200.
- Around line 128-130: The response currently sets both Content-Type and
Content-Encoding for the attachment; remove the line that sets Content-Encoding
to "gzip" (the call response.Header().Set(headers.ContentEncoding.String(),
"gzip")) so the handler that builds the attachment only sets Content-Type and
Content-Disposition (the existing
response.Header().Set(headers.ContentType.String(), ...) and
response.Header().Set(headers.ContentDisposition.String(), ...)) and does not
advertise an HTTP-level gzip encoding for the attachment body.
- Around line 197-221: The cleanup goroutine currently uses
context.WithoutCancel(request.Context()) which can hang indefinitely; replace it
with a bounded background context (e.g. ctx, cancel :=
context.WithTimeout(context.Background(), 30*time.Second)) and use that ctx for
retainedFileService.ListFiles and DeleteFile calls, and ensure you call defer
cancel() inside the goroutine so the timeout is enforced; update the cleanupCtx
variable usage and add the time import as needed, keeping the defer
s.retainedFileLock.Unlock() in place.

In `@cmd/api/src/api/v2/fileingest.go`:
- Around line 173-176: The error path uses the original request.Context() when
calling ingestFileService.DeleteFile, so if the request context is canceled the
cleanup won't run; change the delete call to use a non-cancelable short-lived
context (mirror upload.cleanupTempFile): create a ctx using
context.WithoutCancel(request.Context()) combined with a small timeout (e.g.,
context.WithTimeout(..., 5*time.Second)) and pass that ctx into
ingestFileService.DeleteFile; update imports if needed and ensure the cancel
function is deferred where applicable.

In `@cmd/api/src/daemons/datapipe/cleanup.go`:
- Around line 232-245: Normalize the storage path used for map lookup and
deletion by applying the same cleaning used elsewhere: replace uses of file.Path
with a cleaned version (logicalPath := path.Clean(file.Path)) before checking
expectedFiles and before calling ingestFileService.DeleteFile; ensure
addExpectedStoragePath and the expectedFiles map keys match that cleaned form,
and update the loop that references isExcludedStoragePath,
expectedFiles[logicalPath], and ingestFileService.DeleteFile to all use the same
logicalPath variable so lookups and deletions are consistent.
- Around line 293-307: The directory branch currently removes directories
immediately; modify the loop in cleanup.go (where cutoff is computed and entries
are iterated) to apply the same orphanMinimumAge gate used for files: call
entry.Info() and check the entry.Info().ModTime() against cutoff (as you do for
files) before invoking s.fileOps.RemoveAll(fullPath) for directories (keeping
isExpectedLocalEntry logic intact); ensure errors from entry.Info() are handled
similarly to file handling so directories younger than cutoff are skipped.

In `@cmd/api/src/services/entrypoint.go`:
- Around line 113-116: Entrypoint currently only checks that
dependencies.FileServiceResolver is non-nil but doesn't verify it can actually
provide the ingest service; update Entrypoint to preflight the resolver by
attempting to obtain the ingest service (e.g., call the resolver method used
elsewhere to produce storage.FileServiceIngest) and return an error if that call
fails or returns nil so the app fails fast on startup; apply the same preflight
check wherever CreateRuntimeDependencies expects a populated FileServiceResolver
(also update the second nil-check block referenced around lines 170-173) so
runtime ingest paths can't later 500 due to a missing storage.FileServiceIngest.

In `@cmd/api/src/services/graphify/tasks.go`:
- Around line 118-121: In ProcessIngestFile, do not early-return when
ExtractIngestFiles returns (fileData, err); instead assign the returned fileData
regardless of err and proceed into the per-file processing loop so valid members
are handled (keep ExtractIngestFiles, ProcessIngestFile and fileData references
to locate the code). Remove the "return fileData, err" early exit, optionally
log or stash the err for diagnostics, and let each fileData[i].Errors drive
skipping/failure logic in the loop; return the collected fileData and a combined
or nil error only after processing the loop.

In `@cmd/api/src/services/upload/upload.go`:
- Around line 100-113: The pipeline currently persists the raw request bytes
(teeReader := io.TeeReader(fileData, pw)) while the validator (validationFunc)
writes its normalized output to io.Discard, so the on-disk temp file can differ
from what validation accepted (e.g. UTF-16 -> UTF-8). Fix by making the
validator produce the stream that is persisted: invoke validationFunc with the
original request reader as src and the pipe writer (pw) as dst
(validationFunc(fileData, pw)) in the goroutine, then have the main writer read
from the pipe reader (pr) for fileService.WriteTempFile instead of using
io.TeeReader; remove the TeeReader usage so WriteTempFile consumes the
validator-normalized stream.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 059df95f-e49a-42e8-8ca3-96bc1ddc13a8

📥 Commits

Reviewing files that changed from the base of the PR and between 2fad80f and 48491d4.

📒 Files selected for processing (24)
  • cmd/api/src/api/tools/ingest.go
  • cmd/api/src/api/tools/ingest_test.go
  • cmd/api/src/api/v2/fileingest.go
  • cmd/api/src/api/v2/fileingest_test.go
  • cmd/api/src/daemons/api/toolapi/api.go
  • cmd/api/src/daemons/datapipe/cleanup.go
  • cmd/api/src/daemons/datapipe/cleanup_internal_test.go
  • cmd/api/src/daemons/datapipe/cleanup_test.go
  • cmd/api/src/daemons/datapipe/datapipe_integration_test.go
  • cmd/api/src/daemons/datapipe/delete_integration_test.go
  • cmd/api/src/daemons/datapipe/pipeline.go
  • cmd/api/src/daemons/datapipe/pipeline_integration_test.go
  • cmd/api/src/services/entrypoint.go
  • cmd/api/src/services/entrypoint_test.go
  • cmd/api/src/services/fs/fs.go
  • cmd/api/src/services/fs/mocks/fs.go
  • cmd/api/src/services/graphify/graphify_integration_test.go
  • cmd/api/src/services/graphify/ingest_storage.go
  • cmd/api/src/services/graphify/ingest_storage_test.go
  • cmd/api/src/services/graphify/service.go
  • cmd/api/src/services/graphify/tasks.go
  • cmd/api/src/services/graphify/tasks_integration_test.go
  • cmd/api/src/services/upload/upload.go
  • cmd/api/src/services/upload/upload_test.go
💤 Files with no reviewable changes (2)
  • cmd/api/src/services/fs/mocks/fs.go
  • cmd/api/src/services/fs/fs.go

Comment thread cmd/api/src/api/tools/ingest.go
Comment thread cmd/api/src/api/tools/ingest.go
Comment thread cmd/api/src/api/tools/ingest.go Outdated
Comment thread cmd/api/src/api/v2/fileingest.go
Comment thread cmd/api/src/daemons/datapipe/cleanup.go Outdated
Comment thread cmd/api/src/daemons/datapipe/cleanup.go
Comment thread cmd/api/src/services/entrypoint.go
Comment thread cmd/api/src/services/graphify/tasks.go
Comment thread cmd/api/src/services/upload/upload.go Outdated
@mykeelium mykeelium force-pushed the mcuomo/BED-8317 branch 4 times, most recently from b540eb4 to 70013a1 Compare June 18, 2026 16:56
mykeelium and others added 5 commits June 23, 2026 23:10
* feat: move storage to packages

* fix: remove config from test

* chore: pfc

* chore: pfc

* chore: pfc

* fix: move mockgen to appropriate file name
@mykeelium mykeelium merged commit 2aa9ad9 into main Jun 24, 2026
12 checks passed
@mykeelium mykeelium deleted the mcuomo/BED-8317 branch June 24, 2026 03:27
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api A pull request containing changes affecting the API code. documentation Improvements or additions to documentation enhancement New feature or request go Pull requests that update go code infrastructure A pull request containing changes affecting the infrastructure code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants