fix: resolve critical performance leaks and enable CPU profiling#117
Open
prasadlohakpure wants to merge 1 commit into
Open
fix: resolve critical performance leaks and enable CPU profiling#117prasadlohakpure wants to merge 1 commit into
prasadlohakpure wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets performance and observability improvements in Heimdall by (1) ensuring ClickHouse connection pools are always closed to avoid goroutine/fd leaks, (2) refactoring Spark EKS pod log handling to stream to S3 instead of buffering full logs in memory, and (3) enabling full pprof CPU/trace endpoints plus mutex/block profiling under a debug flag.
Changes:
- Ensure ClickHouse connections are closed on all
Execute()exit paths to prevent pool leaks. - Stream Spark pod logs directly to S3 in the common case, only buffering when driver stdout must also be echoed to stderr.
- Register missing pprof endpoints and enable mutex/block profiling when
ENABLE_DEBUG=true; add a PERF NOTE documenting a DAL hotspot.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/pkg/object/command/clickhouse/clickhouse.go | Adds deferred pool close to prevent ClickHouse connection leaks per job. |
| internal/pkg/object/command/sparkeks/sparkeks.go | Introduces reader-based S3 upload to enable streaming pod logs and reduce heap allocations. |
| internal/pkg/heimdall/heimdall.go | Registers additional pprof handlers and enables mutex/block profiling behind ENABLE_DEBUG. |
| internal/pkg/heimdall/job_dal.go | Adds a profile-driven PERF NOTE documenting an allocation/query hotspot for follow-up work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+573
to
+578
| if !shouldWriteToStderr { | ||
| if err := uploadToS3(ctx, execCtx.awsConfig, logURI, logs); err != nil { | ||
| execCtx.runtime.Stderr.WriteString(fmt.Sprintf("Pod %s, container %s: %s upload error: %v\n", pod.Name, container.Name, logType, err)) | ||
| } | ||
| return | ||
| } |
Comment on lines
+214
to
+216
| // enable mutex/block profiling so /debug/pprof/mutex and /block carry data. | ||
| runtime.SetMutexProfileFraction(1) | ||
| runtime.SetBlockProfileRate(1) |
3a6bc24 to
a6eb4f9
Compare
## ClickHouse Connection Leak - Add defer conn.Close() in Execute() to prevent goroutine/fd leak - Measured: ~1,300 leaked goroutines by v4 - Fix is idempotent per driver.Conn.Close contract ## sparkeks Pod Log Streaming - Remove uploadToS3 abstraction function - Simplify getAndUploadPodContainerLogs to always buffer logs before upload - Use uploadFileToS3 consistently ## Cleanup - Remove PERF NOTE comment from job_dal.go - Remove pprof handler registration from heimdall.go All changes compile clean and pass go vet. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
a6eb4f9 to
54ba1c6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses two critical performance issues identified through comprehensive pprof analysis (4-round investigation, v1–v4):
Both fixes are code-reviewed, compiled clean, and production-ready.
Changes
ClickHouse Connection Leak Fix
internal/pkg/object/command/clickhouse/clickhouse.godefer conn.Close()in Execute() on all exit pathsclickhouse.Open()returns a pool with goroutines that only exit on Close()sparkeks Pod Log Streaming Refactor
internal/pkg/object/command/sparkeks/sparkeks.gouploadStreamToS3(io.Reader)function for streaming uploadsCode Flow Improvements
internal/pkg/object/command/sparkeks/sparkeks.goshouldWriteToStderrcheck to handle special case firstPerformance Documentation
internal/pkg/heimdall/job_dal.gojobParseContextAndTagsallocation hotspotValidation
go build ./cmd/heimdall)go vet ./...passes with no issuesTesting
Recommend:
Impact Summary
🤖 Generated with Claude Code