[feat] [evaluate] code evaluator commercial by tpfz · Pull Request #210 · coze-dev/coze-loop

tpfz · 2025-09-24T14:15:32Z

What type of PR is this?

Check the PR title.

This PR title match the format: [<type>][<scope>]: <description>. For example: [fix][backend] flaky fix
The description of this PR title is user-oriented and clear enough for others to understand.
Add documentation if the current PR requires user awareness at the usage level.
This PR is written in English. PRs not in English will not be reviewed.

(Optional) Translate the PR title into Chinese.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:
zh(optional):

(Optional) Which issue(s) this PR fixes:

codecov · 2025-09-25T08:40:45Z

Codecov Report

❌ Patch coverage is 83.18662% with 382 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ation/domain/service/evaluator_source_code_impl.go	84.08%	110 Missing and 31 partials ⚠️
...nd/modules/evaluation/application/evaluator_app.go	78.68%	47 Missing and 8 partials ⚠️
backend/api/router/coze/loop/apis/middleware.go	0.00%	36 Missing ⚠️
...tion/domain/service/evaluator_code_builder_impl.go	90.59%	16 Missing and 11 partials ⚠️
...d/api/handler/coze/loop/apis/experiment_service.go	69.69%	15 Missing and 5 partials ⚠️
...kend/modules/evaluation/domain/entity/evaluator.go	90.90%	19 Missing and 1 partial ⚠️
...ackend/api/router/coze/loop/apis/coze.loop.apis.go	0.00%	15 Missing ⚠️
...nd/api/handler/coze/loop/apis/evaluator_service.go	0.00%	13 Missing ⚠️
...odules/evaluation/domain/service/evaluator_impl.go	60.71%	7 Missing and 4 partials ⚠️
...modules/evaluation/domain/entity/sandbox_config.go	0.00%	7 Missing ⚠️
... and 13 more

@@            Coverage Diff             @@
##             main     #210      +/-   ##
==========================================
+ Coverage   61.60%   62.75%   +1.14%     
==========================================
  Files         470      479       +9     
  Lines       50017    52927    +2910     
==========================================
+ Hits        30813    33213    +2400     
- Misses      16934    17379     +445     
- Partials     2270     2335      +65

Flag	Coverage Δ
unittests	`62.75% <83.18%> (+1.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
backend/cmd/consumer.go	`0.00% <ø> (ø)`
...cation/convertor/evaluator/evaluator_input_data.go	`100.00% <100.00%> (+58.33%)`	⬆️
.../modules/evaluation/application/eval_target_app.go	`70.08% <100.00%> (+3.62%)`	⬆️
...d/modules/evaluation/application/experiment_app.go	`74.09% <100.00%> (ø)`
...valuation/domain/component/metrics/metrics_code.go	`85.10% <100.00%> (ø)`
...dules/evaluation/domain/entity/evaluator_record.go	`100.00% <ø> (+40.00%)`	⬆️
...luation/domain/service/evaluation_set_item_impl.go	`100.00% <100.00%> (ø)`
...tion/domain/service/evaluation_set_version_impl.go	`91.17% <100.00%> (ø)`
...es/evaluation/domain/service/expt_annotate_impl.go	`74.50% <100.00%> (ø)`
...modules/evaluation/domain/service/expt_run_impl.go	`90.69% <100.00%> (ø)`
... and 41 more

... and 9 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4b7973...cbb223d. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… coverage (LogID: 20250926135253010091110134060FD66) Co-Authored-By: Coda <coda@bytedance.com>

(LogID: 20250927194646172020010002380FB80) Co-Authored-By: Coda <coda@bytedance.com>

…、FaaS集成、监控配置等完整功能实现 (LogID: 20250928143922010091110134199CBD2) Co-Authored-By: Coda <coda@bytedance.com>

CozeLoop

Aime CR Stage 2 summary:

Must-Fix: New evaluator/experiment routes register with middleware stubs returning nil; add auth/rate/audit/trace/recover chains. Fix ValidateEvaluator and insight_analysis handlers to call kitex services and return structured errors.
Should-Fix: Avoid raw err.Error() in handlers; protect mock_output; verify ACL on records.
Suggestion: Verify Hertz v0.10.1 compatibility and strengthen HTTP client connection reuse.
See inline comments for precise lines and proposals.

CozeLoop · 2025-09-28T12:09:34Z

+
+func _validateevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
+	// your code...
+	return nil


🚫 Must-Fix: Middleware stub returns nil — missing security and operational guarantees.

Problem: This middleware currently returns nil, which effectively attaches no middleware to the new /evaluators/validate route. That leaves the endpoint without authentication, rate limiting, audit logging, or standardized error handling. In CozeLoop and ByteDance service specs, public-facing write endpoints must enforce authn/authz, request limits, and trace/audit propagation.

Why it matters: Without these guards, the route can be abused (e.g., batch requests, unauthenticated access), observability is reduced, and consistent error/metrics propagation breaks. Hertz v0.10.1 does not add defaults; explicit handlers are required.

Proposal: Provide a middleware chain that at minimum includes: auth (token/session), rate limiter, request size/timeout limits, audit/trace injection, and a panic/error recovery shim. Example (illustrative):

func _validateevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc { return []app.HandlerFunc{ handler.Middleware.Auth(), // authn/authz handler.Middleware.RateLimit(100, time.Minute), // QoS per user/workspace handler.Middleware.RequestLimit(1<<20), // 1MB body max handler.Middleware.Trace(), // tracing span/context handler.Middleware.Audit("evaluator.validate"), handler.Middleware.Recover(), // uniform error mapping } }

Ensure the same baseline is applied to other newly-added evaluator endpoints (batch_debug, etc.), tuned by route sensitivity.

CozeLoop · 2025-09-28T12:09:43Z

+
+func _batchdebugevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc {
+	// your code...
+	return nil


🚫 Must-Fix: Batch debug route lacks protective middleware.

Problem: _batchdebugevaluatorMw returns nil, so /evaluators/batch_debug runs without auth, rate limiting, payload size limits, or audit/trace context. Batch operations are resource-intensive and can be abused for DoS if not guarded.

Impact: Elevated compute and sandbox invocations without QoS lead to performance hotspots and noisy neighbors. Missing audit breaks accountability.

Fix: Attach a strict middleware chain tailored for batch ops:

func _batchdebugevaluatorMw(handler *apis.APIHandler) []app.HandlerFunc { return []app.HandlerFunc{ handler.Middleware.Auth(), handler.Middleware.RateLimitPerUser(20, time.Minute), // conservative per-user burst handler.Middleware.ConcurrencyLimit(4), // bound concurrent batch execution handler.Middleware.RequestLimit(1<<20), // cap body size handler.Middleware.Trace(), handler.Middleware.Audit("evaluator.batch_debug"), handler.Middleware.Recover(), } }

Also consider deferring long-running execution via job queue with polling APIs to avoid tying up request threads.

CozeLoop · 2025-09-28T12:09:51Z

+
+func _insightanalysisexperimentMw(handler *apis.APIHandler) []app.HandlerFunc {
+	// your code...
+	return nil


🚫 Must-Fix: Insight analysis endpoints are exposed without middleware.

Problem: _insightanalysisexperimentMw returns nil, so /experiments/:expt_id/insight_analysis executes without authn/authz checks, rate limiting, or audit/trace context. Insight analysis is typically expensive and should be permission-scoped to the experiment’s space/owner.

Fix: Attach auth (space-scoped), rate/QoS, audit, and tracing middleware. Example:

func _insightanalysisexperimentMw(handler *apis.APIHandler) []app.HandlerFunc { return []app.HandlerFunc{ handler.Middleware.AuthSpaceScoped(), // verify access to :expt_id handler.Middleware.RateLimit(30, time.Minute), handler.Middleware.Trace(), handler.Middleware.Audit("experiment.insight_analysis"), handler.Middleware.Recover(), } }

Consider async job submission with status polling to avoid request blocking for long-running analysis.

CozeLoop · 2025-09-28T12:09:59Z

+
+func _insight_analysis_recordsMw(handler *apis.APIHandler) []app.HandlerFunc {
+	// your code...
+	return nil


🚫 Must-Fix: Records/feedback/comments routes lack auth and audit middleware.

Problem: _insight_analysis_recordsMw (and related _insight_analysis_record_idMw, _feedbackexptinsightanalysisreportMw, _listexptinsightanalysiscommentMw, etc.) all return nil. These endpoints modify/access experiment-scoped records; without authn/authz, rate limiting, and audit, they are vulnerable to unauthorized access and abuse.

Fix: Define a shared middleware set for the records group that enforces space-scoped permission checks, rate limits, and audit/trace propagation:

func _insight_analysis_recordsMw(handler *apis.APIHandler) []app.HandlerFunc { return []app.HandlerFunc{ handler.Middleware.AuthSpaceScoped(), handler.Middleware.RateLimit(60, time.Minute), handler.Middleware.Trace(), handler.Middleware.Audit("experiment.insight_analysis.records"), handler.Middleware.Recover(), } }

Apply complementary guards for child routes (DELETE/POST) to ensure proper authorization on record IDs.

CozeLoop · 2025-09-28T12:10:06Z

+
+func _mockevaltargetoutputMw(handler *apis.APIHandler) []app.HandlerFunc {
+	// your code...
+	return nil


🚨 Should-Fix: Mock output endpoint should be guarded to avoid misuse.

Problem: _mockevaltargetoutputMw returns nil for /eval_targets/mock_output. Even for mock/testing paths, unauthenticated use in production environments can leak target metadata or be abused for traffic amplification.

Suggestion: Attach lightweight but explicit guards: auth (or admin-only), rate limiting, and audit tagging so mock usage is observable.

func _mockevaltargetoutputMw(handler *apis.APIHandler) []app.HandlerFunc { return []app.HandlerFunc{ handler.Middleware.Auth(), handler.Middleware.RateLimit(60, time.Minute), handler.Middleware.Audit("eval_target.mock_output"), handler.Middleware.Trace(), } }

CozeLoop · 2025-09-28T12:10:57Z

 					_expt_id.POST("/associate_tag", append(_associateannotationtagMw(handler), apis.AssociateAnnotationTag)...)
 					_expt_id.POST("/clone", append(_cloneexperimentMw(handler), apis.CloneExperiment)...)
 					_expt_id.DELETE("/delete_tag", append(_deleteannotationtagMw(handler), apis.DeleteAnnotationTag)...)
+					_expt_id.POST("/insight_analysis", append(_insightanalysisexperimentMw(handler), apis.InsightAnalysisExperiment)...)


🚨 Should-Fix: Insight analysis should verify experiment-level permissions and guard resource usage.

Observation: /experiments/:expt_id/insight_analysis is registered with _insightanalysisexperimentMw(handler) which returns nil. This endpoint should verify access rights to :expt_id (space-scoped), add rate limiting, and emit audit/trace. Consider async job submission to avoid request blocking for long-running analysis.

CozeLoop · 2025-09-28T12:11:03Z

 						_export_records.POST("/list", append(_listexptresultexportrecordMw(handler), apis.ListExptResultExportRecord)...)
 					}
+					{
+						_insight_analysis_records := _expt_id.Group("/insight_analysis_records", _insight_analysis_recordsMw(handler)...)


🚨 Should-Fix: Records and nested routes need ACL checks and audit.

Observation: The group /experiments/:expt_id/insight_analysis_records attaches _insight_analysis_recordsMw(handler) which returns nil. Ensure space-scoped ACL, rate limiting, trace/audit, and consistent error handling across nested routes (DELETE, POST list/get, feedback, comments).

CozeLoop · 2025-09-28T12:11:11Z

+
+# Database configuration - use SQLite for testing
+sqlite:
+  path: "/tmp/coze_loop.db"


🚨 Should-Fix: Local-only infrastructure.yaml risks environment parity and accidental production use.

Observation: New config defaults to SQLite and disables Redis/ClickHouse/MQ, suitable for local testing but problematic if committed without environment scoping. Database规范 requires parity across Docker/K8s deployments and proper SQL init paths.

Recommendation:

Move this into a clearly local/dev profile (e.g., conf/infrastructure.local.yaml) and exclude from production packaging.

Ensure docker-compose and helm chart paths continue to apply the same SQL init changes; do not diverge schema.

Add comments and build-time gating to prevent accidental use in production.

CozeLoop · 2025-09-28T12:11:32Z

-    Python = 1
-    JS = 2
-}
+typedef string LanguageType(ts.enum="true")


💡 Suggestion: Confirm generated code and usage align with LanguageType as string enum.

Observation: LanguageType moved from numeric enum to typedef string with constants. Ensure downstream code (builders, runtime selection, validation) treat language type as canonical strings and normalize appropriately. Domain code CodeEvaluatorVersion.ValidateBaseInfo appears to normalize casing — good. Verify kitex-generated types and conversion paths are updated accordingly.

CozeLoop · 2025-09-28T12:11:57Z

+
+	// 检查HTTP状态码
+	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
+		bodyBytes, _ := io.ReadAll(resp.Body)


💡 Suggestion: Improve HTTP client robustness and connection reuse.

Consider draining response body on non-2xx before close to allow persistent connection reuse (e.g., io.Copy(io.Discard, resp.Body)).

For Body typed as io.Reader, defaulting Content-Type to text/plain may be surprising; prefer requiring explicit content-type from caller or default to application/octet-stream.

Optionally add retry/backoff for transient 5xx, with context-aware cancellation.

tpfz added a commit that referenced this pull request Sep 26, 2025

test: [Coda] add comprehensive unit tests for PR #210 with 80%+ patch…

24ba1b5

… coverage (LogID: 20250926135253010091110134060FD66) Co-Authored-By: Coda <coda@bytedance.com>

tpfz added a commit that referenced this pull request Sep 27, 2025

test: [Coda] add comprehensive unit tests for PR #210 with 80%+ coverage

fd0652a

(LogID: 20250927194646172020010002380FB80) Co-Authored-By: Coda <coda@bytedance.com>

tpfz force-pushed the feat/wzq/code_evaluator_commercial branch 3 times, most recently from 370557e to 95d4509 Compare September 28, 2025 05:55

tpfz changed the title ~~Feat/wzq/code evaluator commercial~~ feat: commercial code evaluator Sep 28, 2025

tpfz changed the title ~~feat: commercial code evaluator~~ [feat] [evaluate] code evaluator commercial Sep 28, 2025

tpfz force-pushed the feat/wzq/code_evaluator_commercial branch from c6d5e14 to c36439a Compare September 28, 2025 06:58

tpfz added 4 commits September 28, 2025 15:17

fix lint

f145568

fix lint

b25ebb3

fix err

b852773

fix lint

cbb223d

dsf86 approved these changes Sep 28, 2025

View reviewed changes

kasarolzzw approved these changes Sep 28, 2025

View reviewed changes

tpfz merged commit d027d8b into main Sep 28, 2025
19 of 20 checks passed

tpfz deleted the feat/wzq/code_evaluator_commercial branch September 28, 2025 11:45

CozeLoop reviewed Sep 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] [evaluate] code evaluator commercial#210

[feat] [evaluate] code evaluator commercial#210
tpfz merged 5 commits into
mainfrom
feat/wzq/code_evaluator_commercial

tpfz commented Sep 24, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

CozeLoop left a comment

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

CozeLoop Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tpfz commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Check the PR title.

(Optional) Translate the PR title into Chinese.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

(Optional) Which issue(s) this PR fixes:

Uh oh!

codecov Bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

CozeLoop left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tpfz commented Sep 24, 2025 •

edited

Loading

codecov Bot commented Sep 25, 2025 •

edited

Loading