Skip to content

[BUG] Experiment Checkpoint Recovery Failure in FailRetry Mode #400

@Lysssyo

Description

@Lysssyo

📋 CheckList

  • I have searched existing issues to avoid duplicates
  • I am using a recently maintained version of Coze Loop
  • I have provided all required information
  • I understand this is a bug report and not a feature request
  • I have submitted this report in English (otherwise it will not be processed)

🐛 Bug Description

In the FailRetry execution mode, the system is designed to load historical results from previous runs and skip evaluators that have already successfully completed.

However, two critical defects have been identified that cause this mechanism to fail completely.


Defect 1: Incorrect Historical Result ID Mapping (Root Cause)

1.1 Method Context & Responsibility

  • Method: GetExptItemTurnResults
  • Location: backend/modules/evaluation/domain/service/expt_result_impl.go
  • Caller: ExptRecordEvalModeFailRetry.PreEval (in expt_run_item_event_impl.go), called during the initialization phase of FailRetry mode.
  • Business Goal: To fetch detailed results (Turn Results) of the Item's previous run from the database. These results are converted into a RunLog and stored in the expt_turn_result_run_log table, serving as the "historical memory" for the Worker execution.

1.2 Method Signature & Data Structure

  • Input: exptID, itemID (Locking the specific experiment and data item)
  • Output: []*entity.ExptTurnResult
    • This structure contains a critical field EvaluatorResults (Type: *entity.EvaluatorResults).
    • Internally, it holds a Map: EvalVerIDToResID map[int64]int64 (Expected Key: Evaluator Version ID -> Value: Evaluator Record ID).

1.3 Defect Analysis

When assembling the EvaluatorResults Map, the code incorrectly assigns EvaluatorVersionID as the Value, whereas it should assign EvaluatorResultID.

Code Snippet (The Bug):

// backend/modules/evaluation/domain/service/expt_result_impl.go

// refs are records fetched from the intermediate table expt_turn_evaluator_result_ref
// Contains: {ExptTurnResultID, EvaluatorVersionID, EvaluatorResultID}
for _, ref := range refs {
    // ...
    // [CRITICAL BUG]
    // Expected: Value = ref.EvaluatorResultID (Primary Key of evaluator_record table, e.g., 748392...)
    // Actual: Value = ref.EvaluatorVersionID (Config ID of the evaluator, e.g., 1001)
    turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorVersionID 
}

1.4 Downstream Chain Reaction (Consequences)

This error triggers a domino effect of failures:

  1. Persistence Phase (PreEval):

    • The PreEval method calls ToRunLogDO(), serializing this incorrect Map {VerID: VerID} into JSON.
    • This JSON is stored in the evaluator_result_ids field of the expt_turn_result_run_log table.
    • DB Actual: {"EvalVerIDToResID":{"1001":1001}} (Incorrect, points to VersionID)
    • DB Expected: {"EvalVerIDToResID":{"1001":74839201}} (Correct, points to RecordID)
  2. Worker Loading Phase (buildExptTurnEvalCtx):

    • When the Worker (ExptItemEvalCtxExecutor) starts, it reads the Log and parses out {1001: 1001}.
    • It calls evaluatorRecordService.BatchGetEvaluatorRecord(ids=[1001]).
    • Failure: The system attempts to find a record with ID=1001 in the evaluator_record table. Since 1001 is a VersionID and not a RecordID, the query returns empty.
  3. Execution Decision Phase (CallEvaluators):

    • The Worker finds no historical records loaded in memory.
    • The logic if existResult != nil fails.
    • Result: The Worker assumes the evaluator has never run and initiates a new RPC call, causing duplicate billing.

Defect 2: Worker Context Cache Key Mismatch (Secondary Issue)

2.1 Problem Description

Even if Defect 1 is fixed (ensuring the database stores the correct RecordID), the Worker's internal context caching mechanism has a logic flaw where the Write Key and Read Key do not match, causing cache lookups to fail.

2.2 Key Objects & Variables

  • Execution Context (Worker Context): etec (Type: *entity.ExptTurnEvalCtx)
  • Cache Map Variable: etec.ExptTurnRunResult.EvaluatorResults (Type: map[int64]*entity.EvaluatorRecord)
    • Responsibility: Caches loaded historical evaluator records in the Worker's memory for fast lookup in subsequent steps to avoid duplicate calls.

2.3 Logic Conflict Analysis

Writer Side (Map Construction)

  • Location: backend/modules/evaluation/domain/service/expt_run_item_impl.go
  • Method: buildExptTurnEvalCtx
  • Timing: Before the Worker starts processing a Turn, responsible for building the execution context.
  • Behavior: Upon successfully fetching evaluatorRecords from the DB, it builds the Map.
  • Flawed Logic:
    // Uses record.ID (RecordID) as the Map Key
    recordMap[record.ID] = record 
    etec.ExptTurnRunResult.EvaluatorResults = recordMap
  • Result: In-memory Map structure is {74839201: RecordObject} (Key is RecordID).

Reader Side (Map Lookup)

  • Location: backend/modules/evaluation/domain/service/expt_run_item_turn_impl.go
  • Method: CallEvaluators (calls GetEvaluatorRecord)
  • Timing: After the Worker finishes calling the LLM, preparing to execute evaluators one by one.
  • Behavior: Checks the cache to decide whether to skip an evaluator.
  • Flawed Logic:
    // Uses evaluatorVersion.GetEvaluatorVersionID() (VersionID) as the Key
    existResult := etec.ExptTurnRunResult.GetEvaluatorRecord(evaluatorVersion.GetEvaluatorVersionID())
    (Underlying implementation: return e.EvaluatorResults[evaluatorVersionID])
  • Lookup Action: Attempts to find Key 1001 in the Map.

2.4 Consequence: Cache Never Hits

This is a classic Key Mismatch.

  • The memory Map stores {74839201: Record}.
  • The code tries to retrieve Map[1001].
  • Result: Returns nil.
  • Business Impact: The Worker misjudges as "not executed", rendering the checkpoint recovery mechanism ineffective and re-executing the evaluator.

🔄 Steps to Reproduce

This issue can be stably reproduced by observing the logs and database state during a retry operation.

  1. Prepare an Experiment: Create an experiment with at least one evaluator (e.g., an LLM-based evaluator).
  2. Run & Fail: Run the experiment and ensure it fails after the evaluator has successfully run (e.g., simulate a timeout or error in a subsequent step, or manually interrupt it). Ensure the evaluator_record table has a successful record for this run.
  3. Trigger Retry: Restart the experiment in FailRetry mode.
  4. Observe Logs & DB:
    • Check DB: Inspect the expt_turn_result_run_log table for the new run. The evaluator_result_ids JSON field will show {"1001": 1001} (where 1001 is the VersionID), instead of {"1001": 74839...} (where 74839... is the RecordID).
    • Check Worker Logs: The Worker logs will show it calling the LLM/Evaluator again, instead of logging "skip evaluator".

✅ Expected Behavior

  1. Database: The expt_turn_result_run_log table should store a correct mapping of EvaluatorVersionID -> EvaluatorResultID (e.g., {"1001": 74839201}).
  2. Worker Logic:
    • The Worker should successfully load the EvaluatorRecord using the ID from the log.
    • The Worker should populate its internal cache map using EvaluatorVersionID as the key.
    • When iterating through evaluators, the GetEvaluatorRecord(VersionID) check should return the cached record.
    • The evaluator execution should be skipped.

❌ Actual Behavior

  1. Database: The expt_turn_result_run_log table stores an incorrect mapping of EvaluatorVersionID -> EvaluatorVersionID (e.g., {"1001": 1001}).(This phenomenon will eventually be fixed, meaning the evaluator_result_ids in the expt_turn_result_run_log database will ultimately hold the correct values. Even if an error occurs here, the issue will be resolved during the subsequent process of re-calling the LLM and re-writing to the database. )
  2. Worker Logic:
    • The Worker attempts to query evaluator_record with ID 1001 and finds nothing.
    • Even if the ID were correct, the Worker builds its cache map using RecordID as the key (Map[74839201] = Record).
    • The lookup logic tries to find Map[1001], which returns nil.
    • The system proceeds to re-execute the evaluator.

🚨 Severity

Low - Cosmetic issue or minor inconvenience

🔧 Component

None

💻 Environment

No response

🔧 Go Environment

No response

📋 Logs

📝 Additional Context

To fully resolve this issue, fixes must be applied to both the data source generation and the Worker's cache logic.

Fix 1: Correct Data Mapping in Domain Service

File: backend/modules/evaluation/domain/service/expt_result_impl.go
Method: GetExptItemTurnResults

Change the assignment to use the correct EvaluatorResultID:

// Before
turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorVersionID

// After (Fix)
turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorResultID

Fix 2: Align Worker Cache Key

File: backend/modules/evaluation/domain/service/expt_run_item_impl.go
Method: buildExptTurnEvalCtx

Change the map key to EvaluatorVersionID to match the lookup logic in GetEvaluatorRecord:

// Before
recordMap[record.ID] = record

// After (Fix)
recordMap[record.EvaluatorVersionID] = record

I have accurately identified and fixed the issue, and the test results meet expectations. Please assign this task to me, and I will submit a PR to resolve it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions