Skip to content

Commit a0e168c

Browse files
sjarmakclaude
andcommitted
fix: create sg-evals mirrors and fix MCP config for 8 SWE-bench Pro tasks
The 8 new SWE-bench Pro tasks (1 teleport, 5 protonmail, 2 element-web) had MIRROR_NOT_FOUND in their Dockerfile.sg_only clone manifests, causing the verifier to fail (score=0) on MCP runs. Also missing SOURCEGRAPH_REPOS env var, so agents searched upstream repos instead of pinned mirrors. - Created 8 sg-evals mirrors pinned to task base_commits - Fixed clone manifests to reference correct mirrors - Added SOURCEGRAPH_REPOS env var to all 8 Dockerfile.sg_only - Regenerated instruction_mcp.md with correct repo: filter scopes - Triggered Sourcegraph sync — all 8 mirrors indexed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9e22153 commit a0e168c

File tree

18 files changed

+752
-57
lines changed

18 files changed

+752
-57
lines changed

benchmarks/csb_sdlc_fix/element-web-roomheaderbuttons-can-crash-fix-001/environment/Dockerfile.sg_only

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@ RUN cd /app && git config user.email "agent@example.com" && \
1919
git add -A && git commit -m "sg_only truncation" --allow-empty --quiet
2020

2121
# Clone manifest for verifier (clone-at-verify strategy)
22-
RUN echo '{"workdir":"/app","repos":[{"mirror":"MIRROR_NOT_FOUND","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
22+
RUN echo '{"workdir":"/app","repos":[{"mirror":"sg-evals/element-web--8ebdcab7","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
23+
24+
ENV SOURCEGRAPH_REPOS="sg-evals/element-web--8ebdcab7"
2325

2426
# Mark sg_only mode
2527
RUN touch /tmp/.sg_only_mode && echo '/app' > /tmp/.sg_only_workdir

benchmarks/csb_sdlc_fix/element-web-roomheaderbuttons-can-crash-fix-001/instruction_mcp.md

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,76 @@
11
# IMPORTANT: Source Code Access
22

3-
The source code files in this repository have been truncated or are not fully present
4-
in the local environment. You MUST use Sourcegraph MCP tools to search, read, and navigate
5-
the codebase to find the information you need to complete this task.
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
64

7-
Available Sourcegraph MCP tools: search (keyword), search_nls (semantic search),
8-
read_file, list_directory, get_context, find_references, get_hover_info.
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-evals/element-web--8ebdcab7` — use `repo:^github.com/sg-evals/element-web--8ebdcab7$` filter
8+
9+
Scope ALL keyword_search/nls_search queries to these repos.
10+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
11+
12+
13+
## Required Workflow
14+
15+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
16+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
17+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
18+
4. **Verify locally** — Run tests with Bash to check your changes
19+
20+
## Tool Selection
21+
22+
| Goal | Tool |
23+
|------|------|
24+
| Exact symbol/string | `sg_keyword_search` |
25+
| Concepts/semantic search | `sg_nls_search` |
26+
| Trace usage/callers | `sg_find_references` |
27+
| See implementation | `sg_go_to_definition` |
28+
| Read full file | `sg_read_file` |
29+
| Browse structure | `sg_list_files` |
30+
| Find repos | `sg_list_repos` |
31+
| Search commits | `sg_commit_search` |
32+
| Track changes | `sg_diff_search` |
33+
| Compare versions | `sg_compare_revisions` |
34+
35+
**Decision logic:**
36+
1. Know the exact symbol? → `sg_keyword_search`
37+
2. Know the concept, not the name? → `sg_nls_search`
38+
3. Need definition of a symbol? → `sg_go_to_definition`
39+
4. Need all callers/references? → `sg_find_references`
40+
5. Need full file content? → `sg_read_file`
41+
42+
## Scoping (Always Do This)
43+
44+
```
45+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
46+
repo:github.com/ORG/ # All repos in org
47+
file:.*\.ts$ # TypeScript only
48+
file:src/api/ # Specific directory
49+
```
50+
51+
Start narrow. Expand only if results are empty.
52+
53+
## Efficiency Rules
54+
55+
- Chain searches logically: search → read → references → definition
56+
- Don't re-search for the same pattern; use results from prior calls
57+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
58+
- Read 2-3 related files before synthesising, rather than one at a time
59+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
60+
61+
## If Stuck
62+
63+
If MCP search returns no results:
64+
1. Broaden the search query (synonyms, partial identifiers)
65+
2. Try `sg_nls_search` for semantic matching
66+
3. Use `sg_list_files` to browse the directory structure
67+
4. Use `sg_list_repos` to verify the repository name
68+
69+
---
970

1071
# Bug Fix Task
1172

12-
**Repository:** element-hq/element-web
73+
**Repository:** github.com/sg-evals/element-web--8ebdcab7 (mirror of element-hq/element-web)
1374

1475
## Problem Description
1576

benchmarks/csb_sdlc_fix/element-web-unread-indicators-diverge-fix-001/environment/Dockerfile.sg_only

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@ RUN cd /app && git config user.email "agent@example.com" && \
1919
git add -A && git commit -m "sg_only truncation" --allow-empty --quiet
2020

2121
# Clone manifest for verifier (clone-at-verify strategy)
22-
RUN echo '{"workdir":"/app","repos":[{"mirror":"MIRROR_NOT_FOUND","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
22+
RUN echo '{"workdir":"/app","repos":[{"mirror":"sg-evals/element-web--526645c7","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
23+
24+
ENV SOURCEGRAPH_REPOS="sg-evals/element-web--526645c7"
2325

2426
# Mark sg_only mode
2527
RUN touch /tmp/.sg_only_mode && echo '/app' > /tmp/.sg_only_workdir

benchmarks/csb_sdlc_fix/element-web-unread-indicators-diverge-fix-001/instruction_mcp.md

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,76 @@
11
# IMPORTANT: Source Code Access
22

3-
The source code files in this repository have been truncated or are not fully present
4-
in the local environment. You MUST use Sourcegraph MCP tools to search, read, and navigate
5-
the codebase to find the information you need to complete this task.
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
64

7-
Available Sourcegraph MCP tools: search (keyword), search_nls (semantic search),
8-
read_file, list_directory, get_context, find_references, get_hover_info.
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-evals/element-web--526645c7` — use `repo:^github.com/sg-evals/element-web--526645c7$` filter
8+
9+
Scope ALL keyword_search/nls_search queries to these repos.
10+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
11+
12+
13+
## Required Workflow
14+
15+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
16+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
17+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
18+
4. **Verify locally** — Run tests with Bash to check your changes
19+
20+
## Tool Selection
21+
22+
| Goal | Tool |
23+
|------|------|
24+
| Exact symbol/string | `sg_keyword_search` |
25+
| Concepts/semantic search | `sg_nls_search` |
26+
| Trace usage/callers | `sg_find_references` |
27+
| See implementation | `sg_go_to_definition` |
28+
| Read full file | `sg_read_file` |
29+
| Browse structure | `sg_list_files` |
30+
| Find repos | `sg_list_repos` |
31+
| Search commits | `sg_commit_search` |
32+
| Track changes | `sg_diff_search` |
33+
| Compare versions | `sg_compare_revisions` |
34+
35+
**Decision logic:**
36+
1. Know the exact symbol? → `sg_keyword_search`
37+
2. Know the concept, not the name? → `sg_nls_search`
38+
3. Need definition of a symbol? → `sg_go_to_definition`
39+
4. Need all callers/references? → `sg_find_references`
40+
5. Need full file content? → `sg_read_file`
41+
42+
## Scoping (Always Do This)
43+
44+
```
45+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
46+
repo:github.com/ORG/ # All repos in org
47+
file:.*\.ts$ # TypeScript only
48+
file:src/api/ # Specific directory
49+
```
50+
51+
Start narrow. Expand only if results are empty.
52+
53+
## Efficiency Rules
54+
55+
- Chain searches logically: search → read → references → definition
56+
- Don't re-search for the same pattern; use results from prior calls
57+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
58+
- Read 2-3 related files before synthesising, rather than one at a time
59+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
60+
61+
## If Stuck
62+
63+
If MCP search returns no results:
64+
1. Broaden the search query (synonyms, partial identifiers)
65+
2. Try `sg_nls_search` for semantic matching
66+
3. Use `sg_list_files` to browse the directory structure
67+
4. Use `sg_list_repos` to verify the repository name
68+
69+
---
970

1071
# Bug Fix Task
1172

12-
**Repository:** element-hq/element-web
73+
**Repository:** github.com/sg-evals/element-web--526645c7 (mirror of element-hq/element-web)
1374

1475
## Problem Description
1576

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# IMPORTANT: Source Code Access
2+
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
4+
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-evals/flink--0cc95fcc` — use `repo:^github.com/sg-evals/flink--0cc95fcc$` filter
8+
9+
Scope ALL keyword_search/nls_search queries to these repos.
10+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
11+
12+
13+
## Required Workflow
14+
15+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
16+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
17+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
18+
4. **Verify locally** — Run tests with Bash to check your changes
19+
20+
## Tool Selection
21+
22+
| Goal | Tool |
23+
|------|------|
24+
| Exact symbol/string | `sg_keyword_search` |
25+
| Concepts/semantic search | `sg_nls_search` |
26+
| Trace usage/callers | `sg_find_references` |
27+
| See implementation | `sg_go_to_definition` |
28+
| Read full file | `sg_read_file` |
29+
| Browse structure | `sg_list_files` |
30+
| Find repos | `sg_list_repos` |
31+
| Search commits | `sg_commit_search` |
32+
| Track changes | `sg_diff_search` |
33+
| Compare versions | `sg_compare_revisions` |
34+
35+
**Decision logic:**
36+
1. Know the exact symbol? → `sg_keyword_search`
37+
2. Know the concept, not the name? → `sg_nls_search`
38+
3. Need definition of a symbol? → `sg_go_to_definition`
39+
4. Need all callers/references? → `sg_find_references`
40+
5. Need full file content? → `sg_read_file`
41+
42+
## Scoping (Always Do This)
43+
44+
```
45+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
46+
repo:github.com/ORG/ # All repos in org
47+
file:.*\.ts$ # TypeScript only
48+
file:src/api/ # Specific directory
49+
```
50+
51+
Start narrow. Expand only if results are empty.
52+
53+
## Efficiency Rules
54+
55+
- Chain searches logically: search → read → references → definition
56+
- Don't re-search for the same pattern; use results from prior calls
57+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
58+
- Read 2-3 related files before synthesising, rather than one at a time
59+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
60+
61+
## If Stuck
62+
63+
If MCP search returns no results:
64+
1. Broaden the search query (synonyms, partial identifiers)
65+
2. Try `sg_nls_search` for semantic matching
66+
3. Use `sg_list_files` to browse the directory structure
67+
4. Use `sg_list_repos` to verify the repository name
68+
69+
---
70+
71+
**Sourcegraph Repositories:** `github.com/sg-evals/flink--0cc95fcc`
72+
73+
# Task: Fix Late Data Side Output Handling for Merging Windows in Flink
74+
75+
## Objective
76+
Fix a bug in Apache Flink's windowing operator where late-arriving data elements that should be emitted to a side output (via `OutputTag`) are silently dropped when using session windows (merging windows). The bug occurs because the merging window operator's `processElement()` method checks lateness BEFORE merging, so an element that arrives after the watermark but could extend a session window is incorrectly classified as late and then the side output emission is skipped because the output tag is null at that point in the code path.
77+
78+
## Bug Description
79+
When using event-time session windows with `allowedLateness(Time.seconds(0))` and a `sideOutputLateData(lateTag)` configuration:
80+
1. An element arrives with a timestamp past the current watermark
81+
2. The `WindowOperator.processElement()` method calls `isElementLate()` which returns true
82+
3. For merging windows, this element could have been merged into an existing window that hasn't been cleaned up yet
83+
4. The late element handling code path attempts `sideOutput()` but the output collector may not have the tag registered, causing the element to be silently dropped
84+
85+
## Requirements
86+
87+
1. **Locate the root cause** in the window operator:
88+
- Find `WindowOperator.processElement()` or `EvictingWindowOperator.processElement()`
89+
- Identify where `isElementLate()` is called relative to the merge operation
90+
- Trace how `sideOutputLateData` interacts with the `OutputTag` in the late path
91+
92+
2. **Fix the late element handling**:
93+
- Ensure late elements for merging windows check whether the element could be merged into an existing (non-expired) window BEFORE marking as late
94+
- OR ensure the side output path correctly emits the element via the registered OutputTag
95+
- The fix should be in `flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/windowing/`
96+
97+
3. **Ensure correct OutputTag wiring**:
98+
- Verify that `sideOutput()` receives the correct `OutputTag<T>` for late data
99+
- Check `AbstractStreamOperator` or `Output` interface for side output dispatch
100+
101+
4. **Write a test case**:
102+
- Create or extend a test in `flink-streaming-java/src/test/java/.../windowing/`
103+
- Test: session window with `sideOutputLateData(tag)` + element arriving after watermark
104+
- Verify: late element appears in the side output stream, not silently dropped
105+
106+
## Key Reference Files
107+
- `flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/windowing/WindowOperator.java` — main window operator
108+
- `flink-runtime/src/main/java/org/apache/flink/streaming/runtime/operators/windowing/EvictingWindowOperator.java` — evicting variant
109+
- `flink-runtime/src/main/java/org/apache/flink/streaming/api/windowing/assigners/MergingWindowAssigner.java` — merging window base
110+
- `flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/SingleOutputStreamOperator.java` — sideOutputLateData API
111+
- `flink-runtime/src/main/java/org/apache/flink/streaming/api/operators/AbstractStreamOperator.java` — base operator with output handling
112+
- `flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/operators/windowing/WindowOperatorTest.java` — existing tests
113+
114+
## Success Criteria
115+
- Root cause identified: late element check order relative to window merge
116+
- Fix applied to window operator's processElement or late-element path
117+
- OutputTag correctly wired for side output emission
118+
- Side output late data emission works for merging windows
119+
- Test case added covering the late data + session window scenario

benchmarks/csb_sdlc_fix/teleport-users-can-delete-fix-001/environment/Dockerfile.sg_only

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@ RUN cd /app && git config user.email "agent@example.com" && \
1919
git add -A && git commit -m "sg_only truncation" --allow-empty --quiet
2020

2121
# Clone manifest for verifier (clone-at-verify strategy)
22-
RUN echo '{"workdir":"/app","repos":[{"mirror":"MIRROR_NOT_FOUND","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
22+
RUN echo '{"workdir":"/app","repos":[{"mirror":"sg-evals/teleport--4b11dc4a","target_dir":"."}]}' > /tmp/.sg_only_clone_manifest.json
23+
24+
ENV SOURCEGRAPH_REPOS="sg-evals/teleport--4b11dc4a"
2325

2426
# Mark sg_only mode
2527
RUN touch /tmp/.sg_only_mode && echo '/app' > /tmp/.sg_only_workdir

0 commit comments

Comments
 (0)