[BUG] Experiment Checkpoint Recovery Failure in FailRetry Mode

### 📋 CheckList

- [x] I have searched existing issues to avoid duplicates
- [x] I am using a recently maintained version of Coze Loop
- [x] I have provided all required information
- [x] I understand this is a bug report and not a feature request
- [x] I have submitted this report in English (otherwise it will not be processed)

### 🐛 Bug Description

In the **FailRetry** execution mode, the system is designed to load historical results from previous runs and skip evaluators that have already successfully completed. 

However, two critical defects have been identified that cause this mechanism to fail completely.

---

### Defect 1: Incorrect Historical Result ID Mapping (Root Cause)

#### 1.1 Method Context & Responsibility
*   **Method**: `GetExptItemTurnResults`
*   **Location**: `backend/modules/evaluation/domain/service/expt_result_impl.go`
*   **Caller**: `ExptRecordEvalModeFailRetry.PreEval` (in `expt_run_item_event_impl.go`), called during the initialization phase of FailRetry mode.
*   **Business Goal**: To fetch detailed results (Turn Results) of the Item's previous run from the database. These results are converted into a `RunLog` and stored in the `expt_turn_result_run_log` table, serving as the "historical memory" for the Worker execution.

#### 1.2 Method Signature & Data Structure
*   **Input**: `exptID`, `itemID` (Locking the specific experiment and data item)
*   **Output**: `[]*entity.ExptTurnResult`
    *   This structure contains a critical field `EvaluatorResults` (Type: `*entity.EvaluatorResults`).
    *   Internally, it holds a Map: `EvalVerIDToResID map[int64]int64` (Expected Key: Evaluator Version ID -> Value: Evaluator Record ID).

#### 1.3 Defect Analysis
When assembling the `EvaluatorResults` Map, the code incorrectly assigns **EvaluatorVersionID** as the Value, whereas it should assign **EvaluatorResultID**.

**Code Snippet (The Bug)**:
```go
// backend/modules/evaluation/domain/service/expt_result_impl.go

// refs are records fetched from the intermediate table expt_turn_evaluator_result_ref
// Contains: {ExptTurnResultID, EvaluatorVersionID, EvaluatorResultID}
for _, ref := range refs {
    // ...
    // [CRITICAL BUG]
    // Expected: Value = ref.EvaluatorResultID (Primary Key of evaluator_record table, e.g., 748392...)
    // Actual: Value = ref.EvaluatorVersionID (Config ID of the evaluator, e.g., 1001)
    turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorVersionID 
}
```

#### 1.4 Downstream Chain Reaction (Consequences)
This error triggers a domino effect of failures:

1.  **Persistence Phase (PreEval)**:
    *   The `PreEval` method calls `ToRunLogDO()`, serializing this incorrect Map `{VerID: VerID}` into JSON.
    *   This JSON is stored in the `evaluator_result_ids` field of the `expt_turn_result_run_log` table.
    *   **DB Actual**: `{"EvalVerIDToResID":{"1001":1001}}` (Incorrect, points to VersionID)
    *   **DB Expected**: `{"EvalVerIDToResID":{"1001":74839201}}` (Correct, points to RecordID)

2.  **Worker Loading Phase (buildExptTurnEvalCtx)**:
    *   When the Worker (`ExptItemEvalCtxExecutor`) starts, it reads the Log and parses out `{1001: 1001}`.
    *   It calls `evaluatorRecordService.BatchGetEvaluatorRecord(ids=[1001])`.
    *   **Failure**: The system attempts to find a record with `ID=1001` in the `evaluator_record` table. Since 1001 is a VersionID and not a RecordID, the query returns empty.

3.  **Execution Decision Phase (CallEvaluators)**:
    *   The Worker finds no historical records loaded in memory.
    *   The logic `if existResult != nil` fails.
    *   **Result**: The Worker assumes the evaluator has never run and initiates a new RPC call, causing duplicate billing.

---

### Defect 2: Worker Context Cache Key Mismatch (Secondary Issue)

#### 2.1 Problem Description
Even if Defect 1 is fixed (ensuring the database stores the correct RecordID), the Worker's internal **context caching mechanism** has a logic flaw where the **Write Key** and **Read Key** do not match, causing cache lookups to fail.

#### 2.2 Key Objects & Variables
*   **Execution Context (Worker Context)**: `etec` (Type: `*entity.ExptTurnEvalCtx`)
*   **Cache Map Variable**: `etec.ExptTurnRunResult.EvaluatorResults` (Type: `map[int64]*entity.EvaluatorRecord`)
    *   **Responsibility**: Caches loaded historical evaluator records in the Worker's memory for fast lookup in subsequent steps to avoid duplicate calls.

#### 2.3 Logic Conflict Analysis

**Writer Side (Map Construction)**
*   **Location**: `backend/modules/evaluation/domain/service/expt_run_item_impl.go`
*   **Method**: `buildExptTurnEvalCtx`
*   **Timing**: Before the Worker starts processing a Turn, responsible for building the execution context.
*   **Behavior**: Upon successfully fetching `evaluatorRecords` from the DB, it builds the Map.
*   **Flawed Logic**:
    ```go
    // Uses record.ID (RecordID) as the Map Key
    recordMap[record.ID] = record 
    etec.ExptTurnRunResult.EvaluatorResults = recordMap
    ```
*   **Result**: In-memory Map structure is `{74839201: RecordObject}` (Key is RecordID).

**Reader Side (Map Lookup)**
*   **Location**: `backend/modules/evaluation/domain/service/expt_run_item_turn_impl.go`
*   **Method**: `CallEvaluators` (calls `GetEvaluatorRecord`)
*   **Timing**: After the Worker finishes calling the LLM, preparing to execute evaluators one by one.
*   **Behavior**: Checks the cache to decide whether to skip an evaluator.
*   **Flawed Logic**:
    ```go
    // Uses evaluatorVersion.GetEvaluatorVersionID() (VersionID) as the Key
    existResult := etec.ExptTurnRunResult.GetEvaluatorRecord(evaluatorVersion.GetEvaluatorVersionID())
    ```
    (Underlying implementation: `return e.EvaluatorResults[evaluatorVersionID]`)
*   **Lookup Action**: Attempts to find Key `1001` in the Map.

#### 2.4 Consequence: Cache Never Hits
This is a classic **Key Mismatch**.
*   The memory Map stores `{74839201: Record}`.
*   The code tries to retrieve `Map[1001]`.
*   **Result**: Returns `nil`.
*   **Business Impact**: The Worker misjudges as "not executed", rendering the checkpoint recovery mechanism ineffective and re-executing the evaluator.

### 🔄 Steps to Reproduce

This issue can be stably reproduced by observing the logs and database state during a retry operation.

1.  **Prepare an Experiment**: Create an experiment with at least one evaluator (e.g., an LLM-based evaluator).
2.  **Run & Fail**: Run the experiment and ensure it fails *after* the evaluator has successfully run (e.g., simulate a timeout or error in a subsequent step, or manually interrupt it). Ensure the `evaluator_record` table has a successful record for this run.
3.  **Trigger Retry**: Restart the experiment in **FailRetry** mode.
4.  **Observe Logs & DB**:
    *   **Check DB**: Inspect the `expt_turn_result_run_log` table for the new run. The `evaluator_result_ids` JSON field will show `{"1001": 1001}` (where `1001` is the VersionID), instead of `{"1001": 74839...}` (where `74839...` is the RecordID).
    *   **Check Worker Logs**: The Worker logs will show it calling the LLM/Evaluator again, instead of logging "skip evaluator".

### ✅ Expected Behavior

1.  **Database**: The `expt_turn_result_run_log` table should store a correct mapping of `EvaluatorVersionID -> EvaluatorResultID` (e.g., `{"1001": 74839201}`).
2.  **Worker Logic**:
    *   The Worker should successfully load the `EvaluatorRecord` using the ID from the log.
    *   The Worker should populate its internal cache map using `EvaluatorVersionID` as the key.
    *   When iterating through evaluators, the `GetEvaluatorRecord(VersionID)` check should return the cached record.
    *   The evaluator execution should be skipped.

### ❌ Actual Behavior

1.  **Database**: The `expt_turn_result_run_log` table stores an incorrect mapping of `EvaluatorVersionID -> EvaluatorVersionID` (e.g., `{"1001": 1001}`).(This phenomenon will eventually be fixed, meaning the evaluator_result_ids in the expt_turn_result_run_log database will ultimately hold the correct values. Even if an error occurs here, the issue will be resolved during the subsequent process of re-calling the LLM and re-writing to the database. )
2.  **Worker Logic**:
    *   The Worker attempts to query `evaluator_record` with ID `1001` and finds nothing.
    *   Even if the ID were correct, the Worker builds its cache map using `RecordID` as the key (`Map[74839201] = Record`).
    *   The lookup logic tries to find `Map[1001]`, which returns `nil`.
    *   The system proceeds to re-execute the evaluator.

### 🚨 Severity

Low - Cosmetic issue or minor inconvenience

### 🔧 Component

None

### 💻 Environment

_No response_

### 🔧 Go Environment

_No response_

### 📋 Logs

```shell

```

### 📝 Additional Context

To fully resolve this issue, fixes must be applied to both the data source generation and the Worker's cache logic.

### Fix 1: Correct Data Mapping in Domain Service
**File**: `backend/modules/evaluation/domain/service/expt_result_impl.go`
**Method**: `GetExptItemTurnResults`

Change the assignment to use the correct `EvaluatorResultID`:

```go
// Before
turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorVersionID

// After (Fix)
turnEvaluatorVerIDToResultID[ref.ExptTurnResultID][ref.EvaluatorVersionID] = ref.EvaluatorResultID
```

### Fix 2: Align Worker Cache Key
**File**: `backend/modules/evaluation/domain/service/expt_run_item_impl.go`
**Method**: `buildExptTurnEvalCtx`

Change the map key to `EvaluatorVersionID` to match the lookup logic in `GetEvaluatorRecord`:

```go
// Before
recordMap[record.ID] = record

// After (Fix)
recordMap[record.EvaluatorVersionID] = record
```

### I have accurately identified and fixed the issue, and the test results meet expectations. Please assign this task to me, and I will submit a PR to resolve it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Experiment Checkpoint Recovery Failure in FailRetry Mode #400

📋 CheckList

🐛 Bug Description

Defect 1: Incorrect Historical Result ID Mapping (Root Cause)

1.1 Method Context & Responsibility

1.2 Method Signature & Data Structure

1.3 Defect Analysis

1.4 Downstream Chain Reaction (Consequences)

Defect 2: Worker Context Cache Key Mismatch (Secondary Issue)

2.1 Problem Description

2.2 Key Objects & Variables

2.3 Logic Conflict Analysis

2.4 Consequence: Cache Never Hits

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

🚨 Severity

🔧 Component

💻 Environment

🔧 Go Environment

📋 Logs

📝 Additional Context

Fix 1: Correct Data Mapping in Domain Service

Fix 2: Align Worker Cache Key

I have accurately identified and fixed the issue, and the test results meet expectations. Please assign this task to me, and I will submit a PR to resolve it.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Experiment Checkpoint Recovery Failure in FailRetry Mode #400

Description

📋 CheckList

🐛 Bug Description

Defect 1: Incorrect Historical Result ID Mapping (Root Cause)

1.1 Method Context & Responsibility

1.2 Method Signature & Data Structure

1.3 Defect Analysis

1.4 Downstream Chain Reaction (Consequences)

Defect 2: Worker Context Cache Key Mismatch (Secondary Issue)

2.1 Problem Description

2.2 Key Objects & Variables

2.3 Logic Conflict Analysis

2.4 Consequence: Cache Never Hits

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

🚨 Severity

🔧 Component

💻 Environment

🔧 Go Environment

📋 Logs

📝 Additional Context

Fix 1: Correct Data Mapping in Domain Service

Fix 2: Align Worker Cache Key

I have accurately identified and fixed the issue, and the test results meet expectations. Please assign this task to me, and I will submit a PR to resolve it.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions