Skip to content

[feat] [evaluate] code evaluator commercial#210

Merged
tpfz merged 5 commits into
mainfrom
feat/wzq/code_evaluator_commercial
Sep 28, 2025
Merged

[feat] [evaluate] code evaluator commercial#210
tpfz merged 5 commits into
mainfrom
feat/wzq/code_evaluator_commercial

Conversation

@tpfz
Copy link
Copy Markdown
Collaborator

@tpfz tpfz commented Sep 24, 2025

What type of PR is this?

Check the PR title.

  • This PR title match the format: [<type>][<scope>]: <description>. For example: [fix][backend] flaky fix
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Add documentation if the current PR requires user awareness at the usage level.
  • This PR is written in English. PRs not in English will not be reviewed.

(Optional) Translate the PR title into Chinese.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:
zh(optional):

(Optional) Which issue(s) this PR fixes:

@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 83.18662% with 382 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ation/domain/service/evaluator_source_code_impl.go 84.08% 110 Missing and 31 partials ⚠️
...nd/modules/evaluation/application/evaluator_app.go 78.68% 47 Missing and 8 partials ⚠️
backend/api/router/coze/loop/apis/middleware.go 0.00% 36 Missing ⚠️
...tion/domain/service/evaluator_code_builder_impl.go 90.59% 16 Missing and 11 partials ⚠️
...d/api/handler/coze/loop/apis/experiment_service.go 69.69% 15 Missing and 5 partials ⚠️
...kend/modules/evaluation/domain/entity/evaluator.go 90.90% 19 Missing and 1 partial ⚠️
...ackend/api/router/coze/loop/apis/coze.loop.apis.go 0.00% 15 Missing ⚠️
...nd/api/handler/coze/loop/apis/evaluator_service.go 0.00% 13 Missing ⚠️
...odules/evaluation/domain/service/evaluator_impl.go 60.71% 7 Missing and 4 partials ⚠️
...modules/evaluation/domain/entity/sandbox_config.go 0.00% 7 Missing ⚠️
... and 13 more

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #210      +/-   ##
==========================================
+ Coverage   61.60%   62.75%   +1.14%     
==========================================
  Files         470      479       +9     
  Lines       50017    52927    +2910     
==========================================
+ Hits        30813    33213    +2400     
- Misses      16934    17379     +445     
- Partials     2270     2335      +65     
Flag Coverage Δ
unittests 62.75% <83.18%> (+1.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
backend/cmd/consumer.go 0.00% <ø> (ø)
...cation/convertor/evaluator/evaluator_input_data.go 100.00% <100.00%> (+58.33%) ⬆️
.../modules/evaluation/application/eval_target_app.go 70.08% <100.00%> (+3.62%) ⬆️
...d/modules/evaluation/application/experiment_app.go 74.09% <100.00%> (ø)
...valuation/domain/component/metrics/metrics_code.go 85.10% <100.00%> (ø)
...dules/evaluation/domain/entity/evaluator_record.go 100.00% <ø> (+40.00%) ⬆️
...luation/domain/service/evaluation_set_item_impl.go 100.00% <100.00%> (ø)
...tion/domain/service/evaluation_set_version_impl.go 91.17% <100.00%> (ø)
...es/evaluation/domain/service/expt_annotate_impl.go 74.50% <100.00%> (ø)
...modules/evaluation/domain/service/expt_run_impl.go 90.69% <100.00%> (ø)
... and 41 more

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4b7973...cbb223d. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tpfz added a commit that referenced this pull request Sep 26, 2025
… coverage

(LogID: 20250926135253010091110134060FD66)

Co-Authored-By: Coda <coda@bytedance.com>
tpfz added a commit that referenced this pull request Sep 27, 2025
(LogID: 20250927194646172020010002380FB80)

Co-Authored-By: Coda <coda@bytedance.com>
@tpfz tpfz force-pushed the feat/wzq/code_evaluator_commercial branch 3 times, most recently from 370557e to 95d4509 Compare September 28, 2025 05:55
@tpfz tpfz changed the title Feat/wzq/code evaluator commercial feat: commercial code evaluator Sep 28, 2025
…、FaaS集成、监控配置等完整功能实现

(LogID: 20250928143922010091110134199CBD2)

Co-Authored-By: Coda <coda@bytedance.com>
@tpfz tpfz changed the title feat: commercial code evaluator [feat] [evaluate] code evaluator commercial Sep 28, 2025
@tpfz tpfz force-pushed the feat/wzq/code_evaluator_commercial branch from c6d5e14 to c36439a Compare September 28, 2025 06:58
@tpfz tpfz merged commit d027d8b into main Sep 28, 2025
19 of 20 checks passed
@tpfz tpfz deleted the feat/wzq/code_evaluator_commercial branch September 28, 2025 11:45
Copy link
Copy Markdown
Collaborator

@CozeLoop CozeLoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aime CR Stage 2 summary:

  • Must-Fix: New evaluator/experiment routes register with middleware stubs returning nil; add auth/rate/audit/trace/recover chains. Fix ValidateEvaluator and insight_analysis handlers to call kitex services and return structured errors.
  • Should-Fix: Avoid raw err.Error() in handlers; protect mock_output; verify ACL on records.
  • Suggestion: Verify Hertz v0.10.1 compatibility and strengthen HTTP client connection reuse.
    See inline comments for precise lines and proposals.


func _validateevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
// your code...
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Must-Fix: Middleware stub returns nil — missing security and operational guarantees.

Problem: This middleware currently returns nil, which effectively attaches no middleware to the new /evaluators/validate route. That leaves the endpoint without authentication, rate limiting, audit logging, or standardized error handling. In CozeLoop and ByteDance service specs, public-facing write endpoints must enforce authn/authz, request limits, and trace/audit propagation.

Why it matters: Without these guards, the route can be abused (e.g., batch requests, unauthenticated access), observability is reduced, and consistent error/metrics propagation breaks. Hertz v0.10.1 does not add defaults; explicit handlers are required.

Proposal: Provide a middleware chain that at minimum includes: auth (token/session), rate limiter, request size/timeout limits, audit/trace injection, and a panic/error recovery shim. Example (illustrative):

func _validateevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
    return []app.HandlerFunc{
        handler.Middleware.Auth(),           // authn/authz
        handler.Middleware.RateLimit(100, time.Minute), // QoS per user/workspace
        handler.Middleware.RequestLimit(1<<20),         // 1MB body max
        handler.Middleware.Trace(),          // tracing span/context
        handler.Middleware.Audit("evaluator.validate"),
        handler.Middleware.Recover(),        // uniform error mapping
    }
}

Ensure the same baseline is applied to other newly-added evaluator endpoints (batch_debug, etc.), tuned by route sensitivity.


func _batchdebugevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
// your code...
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Must-Fix: Batch debug route lacks protective middleware.

Problem: _batchdebugevaluatorMw returns nil, so /evaluators/batch_debug runs without auth, rate limiting, payload size limits, or audit/trace context. Batch operations are resource-intensive and can be abused for DoS if not guarded.

Impact: Elevated compute and sandbox invocations without QoS lead to performance hotspots and noisy neighbors. Missing audit breaks accountability.

Fix: Attach a strict middleware chain tailored for batch ops:

func _batchdebugevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
    return []app.HandlerFunc{
        handler.Middleware.Auth(),
        handler.Middleware.RateLimitPerUser(20, time.Minute), // conservative per-user burst
        handler.Middleware.ConcurrencyLimit(4),               // bound concurrent batch execution
        handler.Middleware.RequestLimit(1<<20),               // cap body size
        handler.Middleware.Trace(),
        handler.Middleware.Audit("evaluator.batch_debug"),
        handler.Middleware.Recover(),
    }
}

Also consider deferring long-running execution via job queue with polling APIs to avoid tying up request threads.


func _insightanalysisexperimentMw(handler *apis.APIHandler) []app.HandlerFunc {
// your code...
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Must-Fix: Insight analysis endpoints are exposed without middleware.

Problem: _insightanalysisexperimentMw returns nil, so /experiments/:expt_id/insight_analysis executes without authn/authz checks, rate limiting, or audit/trace context. Insight analysis is typically expensive and should be permission-scoped to the experiment’s space/owner.

Fix: Attach auth (space-scoped), rate/QoS, audit, and tracing middleware. Example:

func _insightanalysisexperimentMw(handler *apis.APIHandler) []app.HandlerFunc {
    return []app.HandlerFunc{
        handler.Middleware.AuthSpaceScoped(), // verify access to :expt_id
        handler.Middleware.RateLimit(30, time.Minute),
        handler.Middleware.Trace(),
        handler.Middleware.Audit("experiment.insight_analysis"),
        handler.Middleware.Recover(),
    }
}

Consider async job submission with status polling to avoid request blocking for long-running analysis.


func _insight_analysis_recordsMw(handler *apis.APIHandler) []app.HandlerFunc {
// your code...
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 Must-Fix: Records/feedback/comments routes lack auth and audit middleware.

Problem: _insight_analysis_recordsMw (and related _insight_analysis_record_idMw, _feedbackexptinsightanalysisreportMw, _listexptinsightanalysiscommentMw, etc.) all return nil. These endpoints modify/access experiment-scoped records; without authn/authz, rate limiting, and audit, they are vulnerable to unauthorized access and abuse.

Fix: Define a shared middleware set for the records group that enforces space-scoped permission checks, rate limits, and audit/trace propagation:

func _insight_analysis_recordsMw(handler *apis.APIHandler) []app.HandlerFunc {
    return []app.HandlerFunc{
        handler.Middleware.AuthSpaceScoped(),
        handler.Middleware.RateLimit(60, time.Minute),
        handler.Middleware.Trace(),
        handler.Middleware.Audit("experiment.insight_analysis.records"),
        handler.Middleware.Recover(),
    }
}

Apply complementary guards for child routes (DELETE/POST) to ensure proper authorization on record IDs.


func _mockevaltargetoutputMw(handler *apis.APIHandler) []app.HandlerFunc {
// your code...
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Should-Fix: Mock output endpoint should be guarded to avoid misuse.

Problem: _mockevaltargetoutputMw returns nil for /eval_targets/mock_output. Even for mock/testing paths, unauthenticated use in production environments can leak target metadata or be abused for traffic amplification.

Suggestion: Attach lightweight but explicit guards: auth (or admin-only), rate limiting, and audit tagging so mock usage is observable.

func _mockevaltargetoutputMw(handler *apis.APIHandler) []app.HandlerFunc {
    return []app.HandlerFunc{
        handler.Middleware.Auth(),
        handler.Middleware.RateLimit(60, time.Minute),
        handler.Middleware.Audit("eval_target.mock_output"),
        handler.Middleware.Trace(),
    }
}

_expt_id.POST("/associate_tag", append(_associateannotationtagMw(handler), apis.AssociateAnnotationTag)...)
_expt_id.POST("/clone", append(_cloneexperimentMw(handler), apis.CloneExperiment)...)
_expt_id.DELETE("/delete_tag", append(_deleteannotationtagMw(handler), apis.DeleteAnnotationTag)...)
_expt_id.POST("/insight_analysis", append(_insightanalysisexperimentMw(handler), apis.InsightAnalysisExperiment)...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Should-Fix: Insight analysis should verify experiment-level permissions and guard resource usage.

Observation: /experiments/:expt_id/insight_analysis is registered with _insightanalysisexperimentMw(handler) which returns nil. This endpoint should verify access rights to :expt_id (space-scoped), add rate limiting, and emit audit/trace. Consider async job submission to avoid request blocking for long-running analysis.

_export_records.POST("/list", append(_listexptresultexportrecordMw(handler), apis.ListExptResultExportRecord)...)
}
{
_insight_analysis_records := _expt_id.Group("/insight_analysis_records", _insight_analysis_recordsMw(handler)...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Should-Fix: Records and nested routes need ACL checks and audit.

Observation: The group /experiments/:expt_id/insight_analysis_records attaches _insight_analysis_recordsMw(handler) which returns nil. Ensure space-scoped ACL, rate limiting, trace/audit, and consistent error handling across nested routes (DELETE, POST list/get, feedback, comments).


# Database configuration - use SQLite for testing
sqlite:
path: "/tmp/coze_loop.db"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Should-Fix: Local-only infrastructure.yaml risks environment parity and accidental production use.

Observation: New config defaults to SQLite and disables Redis/ClickHouse/MQ, suitable for local testing but problematic if committed without environment scoping. Database规范 requires parity across Docker/K8s deployments and proper SQL init paths.

Recommendation:

  • Move this into a clearly local/dev profile (e.g., conf/infrastructure.local.yaml) and exclude from production packaging.
  • Ensure docker-compose and helm chart paths continue to apply the same SQL init changes; do not diverge schema.
  • Add comments and build-time gating to prevent accidental use in production.

Python = 1
JS = 2
}
typedef string LanguageType(ts.enum="true")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: Confirm generated code and usage align with LanguageType as string enum.

Observation: LanguageType moved from numeric enum to typedef string with constants. Ensure downstream code (builders, runtime selection, validation) treat language type as canonical strings and normalize appropriately. Domain code CodeEvaluatorVersion.ValidateBaseInfo appears to normalize casing — good. Verify kitex-generated types and conversion paths are updated accordingly.


// 检查HTTP状态码
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
bodyBytes, _ := io.ReadAll(resp.Body)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: Improve HTTP client robustness and connection reuse.

  • Consider draining response body on non-2xx before close to allow persistent connection reuse (e.g., io.Copy(io.Discard, resp.Body)).
  • For Body typed as io.Reader, defaulting Content-Type to text/plain may be surprising; prefer requiring explicit content-type from caller or default to application/octet-stream.
  • Optionally add retry/backoff for transient 5xx, with context-aware cancellation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants