This tutorial walks you through how to automatically curate new issue-resolving tasks from real GitHub issues.
Dependencies: python, git, docker
docker ps # make sure docker is running and you have privilage
git clone --recursive https://github.com/microsoft/SWE-bench-Live
pip install -e .
pip install -e launch/.To set up RepoLaunch on Windows there are some special tips at Development-Windows.md
This step crawls the initial source repo list, from which we find issues. You should prepare GitHub tokens in advance to unlock the API rate limit.
- Crawl raw repositories within a given star range, supporting multiple tokens for higher rate limits
cd curation
mkdir -p output
# max_stars is optional
python crawl_repo.py \
--language Python \
--min_stars 10000 \
--max_stars 100000 \
--tokens_file tokens.txt \
--output_file output/raw_repos.jsonl- Filter the crawled raw repositories based on some predefined quality control-related criteria.
# More than 200 pulls and issues
# More than 200 forks
# The percentage of main language code should be more than 60%
python filter_repo.py \
--input_file output/raw_repos.jsonl \
--output_file output/filtered_repos.jsonl \
--tokens_file tokens.txt \
--language Python \
--max_workers 20This step crawls Issue-PR pairs created after the cut-off date from the given repositories, and converts them into SWE-bench-like task instances.
mkdir -p job_status
./swe_task_crawling/run_get_tasks_pipeline.sh \
--repos-jsonl output/filtered_repos.jsonl \
--token-file tokens.txt \
--cutoff-date 20250501 \
--path-prs output/prs \
--path-tasks output/tasks \
--output-dir output/split_jobs
python swe_task_crawling/merge_tasks.py \
--input_folder output/tasks \
--input_repos output/filtered_repos.jsonl \
--output output/raw_tasks.jsonlThis step basically follows the idea of SWE-bench-Verified to filter instance with
- Vague problem statements;
- Test patches with requirements not required in problem statements;
- Answer in problem statement.
Prepare your llm API Key.
export OPENAI_API_KEY=...
python -m llm_filter.verify \
--input_dir output/raw_tasks.jsonl \
--output_dir output/verified_tasks.jsonl \
--llm_provider AOAI \
--model_name gpt-5-20250807python -m llm_filter.split_os \
--input_file output/raw_tasks.jsonl \
--windows_file output/windows_tasks.jsonl \
--general_file output/general_tasks.jsonl \
--llm_provider AOAI \
--model_name gpt-5-20250807Next, we will use RepoLaunch to attempt to create an execution environment for each task instance to support test execution.
Create a run config for RepoLaunch and save it in launch/data/your_experiment/config.json. The example config.json in launch/data/examples is:
{
"mode": {
"setup": true,
"organize": true
},
"llm_provider_name": "OpenAI",
"model_config": {
"model_name": "gpt-4.1-20250414"
},
"workspace_root": "data/examples/",
"dataset": "data/examples/dataset.jsonl",
"print_to_console": false,
"first_N_repos": -1,
"overwrite": false,
"max_workers": 5,
"os": "linux",
"max_trials": 2,
"max_steps_setup": 60,
"max_steps_verify": 20,
"max_steps_organize": 40,
"cmd_timeout": 60,
"image_prefix": "repolaunch/dev"
}Prepare your llm API Key.
export OPENAI_API_KEY=...
export TAVILY_API_KEY=...Fire your RepoLaunch run!
cd ../launch
# recommended in a tmux session, it takes long time
python -m launch.run --config-path data/your_experiment/config.jsonNote: Some instances would require many file descriptors. If you see "too many files open error", try
ulimit -a ulimit -n 32768
Note: We observe that as the execution becomes very long, the docker response (docker run container; docker commit and docker remove container) would become lower and lower and even return None. In this case:stop running launch restart docker docker container prune start running launch again
In this step we apply gold patches to instances, run test cases, and get FAIL_TO_PASS and PASS_TO_PASS test cases for each instance.
# cd in repo root
cd ../
# For Windows system if there are decoding issues: $env:PYTHONUTF8="1" ; $env:PYTHONIOENCODING="utf-8"
# Get Fail to Pass
# apply test_patch -> build -> apply gold_patch -> build
# test is run 3 times automatically in validation.py to filter flaky instances
python -m evaluation.validation \
--input_dir launch/data/examples/organize.jsonl \
--output_dir logs/val \
--platform linux \#or windows
--workers 4 \
--overwrite 0 # or 1 for yes
# filter instances that fail when only apply test_patch -> apply gold_patch -> build
python -m evaluation.evaluation \
--dataset logs/val/validated_instances.jsonl \
--output_dir logs/eval \
--patch_dir gold \
--platform linux \#or windows
--workers 4 \
--overwrite 0 # or 1 for yesResult is saved to logs/eval/gold_patch_evaluated_instances.jsonl.
The demo to upload dataset to huggingface is
cd curation
hf auth login
python push_dataset/push_multilang.pyThe demo to upload docker image to dockerhub is
cd launch
docker login
python -m launch.scripts.upload_docker --dataset ... --clear_after_push 0