fix markdown format

binary-husky · binary-husky · commit 525ee6055836 · 2026-02-28T12:43:01.000+08:00
diff --git a/ajet/context_tracker/timeline_merging/README.md b/ajet/context_tracker/timeline_merging/README.md
@@ -3,6 +3,7 @@
 In complex multi-agent LLM interactions, we define a Timeline as the token trajectory generated by repeatedly invoking an LLM during a task execution process.
 
 A Timeline contains the following elements:
+
 - Text message list
     - Note: In most Qwen models, messages start with `<|im_start|>` and end with `<|im_end|>`, depending on the model's tokenizer and chat_template
 - Token sequence message list
@@ -48,6 +49,7 @@ T_n\left(M_\text{n}, m_\text{n}, a_\text{n}\right)
 \rbrace$
 
 Where:
+
 - $T_i$ represents the $i$-th (unmerged) timeline. $T_i = [T_{i}^{[1]}, T_{i}^{[2]}, \dots, T_{i}^{[|T_{i}|]}]$.
     - The last item $T_{i}^{[|T_{i}|]} = m_\text{i}$: always the output of this LLM request.
     - The first $|T_{i}|-1$ items: always the input $M_\text{i}$ of this LLM request.
@@ -90,6 +92,7 @@ Note: Loss Mask is calculated in detail during post-processing based on the $\te
 In practice, we found that when a token sequence is decoded into text and then re-encoded back into a token sequence by the tokenizer, it sometimes cannot be precisely converted back to the original token sequence.
 
 Therefore, the following situation often occurs in reality:
+
 - $\text{Author}(T_{i}^{[k]}) = \text{llm}$
 - $\text{Author}(T_{j}^{[k]}) \neq \text{llm}$
 - $\text{Text}(T_{j}^{[k]}) = \text{Text}(T_{i}^{[k]})$
diff --git a/ajet/default_config/trinity/README.md b/ajet/default_config/trinity/README.md
@@ -24,6 +24,7 @@ You can configure mappings via the `ajet/default_config/trinity/config_auto_conv
 ## Trinity Hyperparameter Quick Guide 📊
 
 Trinity adopts a typical producer (explorer)-consumer (trainer) architecture:
+
 - 🏭 **Producer**: Uses VLLM to generate samples
 - 🧠 **Consumer**: Consumes samples to update the model
 Both operate on different runtime schedules.
@@ -59,5 +60,6 @@ meanwhile
 ### Training Memory Control 💾
 
 Same as VERL, control training memory with the following parameters:
+
 - `trainer.max_token_len_per_gpu`
 - `ulysses_sequence_parallel_size`
diff --git a/docs/en/context_timeline.md b/docs/en/context_timeline.md
@@ -3,6 +3,7 @@
 In complex multi-agent LLM interactions, we define a Timeline as the token trajectory generated by repeatedly invoking an LLM during a task execution process.
 
 A Timeline contains the following elements:
+
 - Text message list
     - Note: In most Qwen models, messages start with `<|im_start|>` and end with `<|im_end|>`, depending on the model's tokenizer and chat_template
 - Token sequence message list
@@ -48,6 +49,7 @@ T_n\left(M_\text{n}, m_\text{n}, a_\text{n}\right)
 \rbrace$
 
 Where:
+
 - $T_i$ represents the $i$-th (unmerged) timeline. $T_i = [T_{i}^{[1]}, T_{i}^{[2]}, \dots, T_{i}^{[|T_{i}|]}]$.
     - The last item $T_{i}^{[|T_{i}|]} = m_\text{i}$: always the output of this LLM request.
     - The first $|T_{i}|-1$ items: always the input $M_\text{i}$ of this LLM request.
@@ -90,6 +92,7 @@ Note: Loss Mask is calculated in detail during post-processing based on the $\te
 In practice, we found that when a token sequence is decoded into text and then re-encoded back into a token sequence by the tokenizer, it sometimes cannot be precisely converted back to the original token sequence.
 
 Therefore, the following situation often occurs in reality:
+
 - $\text{Author}(T_{i}^{[k]}) = \text{llm}$
 - $\text{Author}(T_{j}^{[k]}) \neq \text{llm}$
 - $\text{Text}(T_{j}^{[k]}) = \text{Text}(T_{i}^{[k]})$
diff --git a/docs/en/example_werewolves.md b/docs/en/example_werewolves.md
@@ -137,6 +137,7 @@ If you need a more fine-grained evaluation (e.g., giving partial credit for key
 > **Visualization:** Training curves are generated by SwanLab. See [Visualization Tools](./visualization.md) for setup and usage.
 
 As training progresses, win rate increases. This usually means the agent becomes more stable on **two things**:
+
 - **Role-playing consistency**: the agent learns to maintain its werewolf cover under pressure, avoiding self-exposure even when voted out.
 - **Social deception skills**: it develops strategies to mislead opponents, sow suspicion among villagers, and implicitly coordinate with teammates.
 
@@ -153,6 +154,7 @@ Significant role-playing improvement is observed during the experiment.
 > **Token-level Visualization:** These detailed logs are generated by Beast-Logger. See [Beast-Logger Usage](./beast_logger.md) for more details.
 
 2. The agent develops multiple strategies for winning. For example:
+
 - **Misleading opponents**: "Let's keep an eye on the seer and the witch. They could be werewolves trying to hide".
 - **Appealing to reason**: "We need to be wary of fake seers and watch for inconsistencies in stories, Player-Y as hunter should act carefully".
 
diff --git a/docs/en/swarm.md b/docs/en/swarm.md
@@ -13,6 +13,7 @@ However, the AgentJet Swarm mode has pioneered a brand-new training approach. Co
 you can freely launch multiple "mother ships" (corresponding to multiple LLM models to be trained) on one or more servers.
 Then, from an "airport" (e.g., your workstation, server, or even your Mac), you can "take off" any number of "Jets" to act as "worker bees" running the Agent workflow awaiting training,
 forming a many-to-many training system:
+
 - "Jets" are responsible for reading datasets, running the Agent workflow, and finally sending reward signals back to each "mother ship".
 - "Mother ships" are responsible for providing vllm/sglang API interfaces (with AgentJet’s automatic context tracking & timeline merging capabilities that significantly accelerate training), coordinating and computing samples.
 
@@ -48,6 +49,7 @@ Notes:
 ## (2/2) Launching Swarm Clients ("jets")
 
 You can run any amount of swarm client:
+
 - on any devices (macbook, workstation, the same machine you run swarm-server, **wherever you want**).
 - at any time (before or in the middle of a training, **whenever you want**)
 
@@ -82,6 +84,7 @@ swarm_worker.auto_sync_train_config_and_start_engine(yaml_job)
 ```
 
 The swarm server can be in the following states and transition between them as follows:
+
 - **OFFLINE**: The swarm server is started but does not load any models or perform any training. It enters this state directly after startup. Additionally, it transitions to this state upon receiving a `stop_engine` command from (any) client while in any other state.
 - **BOOTING**: The swarm server enters this state upon receiving a configuration followed by an explicit `begin_engine` command. In this state, it loads model parameters, initializes FSDP, and initializes vLLM.
 - **ROLLING**: The swarm server enters this state automatically after completing **BOOTING** or after finishing the **WEIGHT_SYNCING** state. This represents the sampling phase.
diff --git a/docs/en/swarm_best_practice.md b/docs/en/swarm_best_practice.md
@@ -49,6 +49,7 @@ swarm_worker.start_engine()
 ```
 
 Hints:
+
 - You can `yaml_job.dump_job_as_yaml('./config.yaml')` to take a look at the full configuration.
 - You can `yaml_job.build_job_from_yaml('./config.yaml')` to load yaml configuration as override. (there are some configurations that must be edited from yaml).
 
@@ -93,6 +94,7 @@ def rollout(task) -> float | None:
 ```
 
 One important thing to note is that before each episode begins, you need to call `begin_episode` to obtain the `base_url` and `api_key`. At the same time, you will receive an episode identifier, `episode_uuid`. The `swarm_worker` is thread-safe and does not hold the state of the `episode`, so you can safely invoke multiple `begin_episode` calls concurrently. When your agent finishes running, remember to call `end_episode` to send the reward signal back to the swarm server (with the `episode_uuid` parameter). Additionally, if you wish to discard an episode for reasons such as:
+
 - **Reward miscalculation**
 - **External API out of credit**
 - **Debugging**
diff --git a/docs/en/swarm_deepdive.md b/docs/en/swarm_deepdive.md
@@ -39,6 +39,7 @@ In the following section, we will deep dive into the AgentJet Swarm.
 
 This gif displays the life cycle of a Swarm Server.
 The possible states and transitions of the swarm server are as follows:
+
 - **OFFLINE**: The swarm server starts but has not loaded any models and is not running any training. The swarm server enters this state directly after startup. Additionally, it enters this state after receiving a `stop_engine` command from any client while in any other state.
 - **BOOTING**: The swarm server enters this state after receiving configuration and then an explicit `begin_engine` command, performing model parameter loading, FSDP initialization, and vLLM initialization.
 - **ROLLING**: The swarm server sample collection state. It automatically enters this state when **BOOTING** ends or when the **WEIGHT_SYNCING** state ends.
@@ -112,6 +113,7 @@ swarm_client.start_engine()
 ```
 
 Practical tips:
+
 - **Treat YAML as the source of truth**: you can inspect it with `yaml_job.dump_job_as_yaml("./config.yaml")` and load overrides via `yaml_job.build_job_from_yaml("./config.yaml")`.
 - **Idempotency**: `auto_sync_train_config_and_start_engine()` is designed to be safe if the engine is already **ROLLING** (it will do nothing) and will wait if the engine is **BOOTING / WEIGHT_SYNCING**.
 - **Monitoring**: run `ajet-swarm overwatch --swarm-url=http://your-swarm-server:10086` (or `python -m ajet.launcher --swarm-overwatch=...`) to watch the server states and rollout pool.
@@ -155,10 +157,12 @@ def rollout(task: Task) -> float | None:
 ```
 
 Abort semantics (why it is safe for debugging):
+
 - When the server is **ENGINE.ROLLING**, `abort_episode` typically **reverts** the episode back to the unclaimed pool, so other clients can pick it up.
 - When the server is in **ENGINE.ROLLING_POST**, `abort_episode` will **delete** the episode record instead of re-queueing it, so weight syncing won’t be blocked by zombie episodes.
 
 Timeouts you should understand:
+
 - `discard_episode_timeout` (server-side): if an episode is **idle** (no LLM requests) for too long, the server can discard it.
 - Client-side protection: the client records an internal max lifetime (currently `max_episode_time = 2 × discard_episode_timeout`). If you submit too late, `end_episode` will be converted into an `abort_episode` to avoid poisoning the pool.
 
@@ -201,13 +205,15 @@ swarm_client.auto_sync_train_config_and_start_engine(yaml_job)
 4) Drive training by repeatedly running batches of episodes
 
 The usual batching relationship is:
+
 - remote `batch_size` is the number of tasks in one policy-gradient batch (server side)
 - local `num_repeat` (a.k.a. rollout.n / GRPO N) is the number of rollouts per task
 - so one “full” batch roughly needs `batch_size × num_repeat` completed episodes.
 
 The helper `run_episodes_until_all_complete(tasks, func=rollout, auto_retry=True)` is just a convenience thread pool; you can implement your own scheduling.
 
 Operational notes:
+
 - Use `ajet-swarm overwatch --swarm-url=...` to watch **running episodes** and whether the pool is close to triggering **WEIGHT_SYNCING**.
 - If you need to change training YAML, call `swarm_client.stop_engine()` first (server returns to **ENGINE.OFFLINE**), then sync again.
 
@@ -258,10 +264,12 @@ def rollout(task: Task):
 ```
 
 Key design constraint:
+
 - A “logical” rollout is only valid if you **commit/abort all involved episodes together**.
   If one model’s episode is ended but the other is aborted (or hangs), you create asynchronous noise across models.
 
 Batching rule of thumb:
+
 - Keep `num_repeat` aligned across servers.
 - It’s simplest when both servers use the same `batch_size` and you drive the outer loop by one of them (as in the best-practice example).
 
@@ -276,6 +284,7 @@ The one rule for the debug client is exactly what you noted:
 **do not contribute data to the training batch**.
 
 The simplest discipline is:
+
 - Debug client still calls `begin_episode()` to obtain valid routing credentials.
 - Debug client runs the agent.
 - Debug client always ends with `abort_episode(episode_uuid)` (never `end_episode`).
@@ -293,9 +302,11 @@ def debug_once(task: Task):
 ```
 
 Why this works:
+
 - `abort_episode` returns the claimed episode to the pool (or deletes it in **ROLLING_POST**), so your debugging does not change the reward statistics used for the next weight update.
 
 Practical cautions:
+
 - Keep debug parallelism low. If the debug client claims too many episodes and holds them, training clients may temporarily see “No available episodes to claim”.
 - Prefer short `discard_episode_timeout` for debugging so stuck runs get cleaned up fast.
 - Keep `ajet-swarm overwatch` open to ensure debug episodes are quickly aborted and not piling up.
diff --git a/docs/en/swarm_intro_blog_en.md b/docs/en/swarm_intro_blog_en.md
@@ -56,6 +56,7 @@ on the other hand it supports any number of sampling nodes. -->
 
 
 Previous Agentic RL training modes had some implicit assumptions:
+
 - First, no matter how many agents are in the task to be trained, these agents can only share the same fine-tunable LLM model (shared "brain").
 The reason for this phenomenon is that most training backends represented by VERL and TRL typically configure only one LLM model for fine-tuning.
 - Second, in the reinforcement learning sample collection stage, all current training frameworks forcibly bind the agent Rollout task process.
@@ -218,6 +219,7 @@ AgentJet has invested heavily in engineering quality to ensure that every traini
 
 **Version-by-Version Performance Tracking**:
 We maintain a public [Performance Tracking Dashboard](https://benchmark.agentjet.top/), continuously recording AgentJet's training curves and final performance on multiple standard tasks (mathematical reasoning, code generation, tool use, etc.), across major Git versions, and across different training backends (VERL, etc.). With every code update, the test bot executes benchmarks, and any performance regression is immediately detected. This means:
+
 - When upgrading AgentJet versions, you can clearly know how the new version performs on the tasks you care about.
 - If an update introduces a hidden bug causing a decline in training effectiveness, we will capture it immediately.
 - Researchers can confidently cite AgentJet's experimental results because they are reproducible.
@@ -283,6 +285,7 @@ AgentJet is fully open-sourced on GitHub. Researchers and developers in the comm
 <!--
 
 swarm server的所有可能状态和转换方式如下:
+
 - **OFFLINE**: swarm server启动,但未加载任何模型,也不运行任何训练。swarm server启动后,直接进入该状态。此外,在任何其他状态下收到来自(任意)client的 `stop_engine`命令后,进入该状态。
 - **BOOTING**: swarm server收到配置,然后收到明确的 `begin_engine`命令后,进入该状态,进行模型参数加载、FSDP初始化、vLLM初始化。
 - **ROLLING**: swarm server样本采集状态。当**BOOTING**结束后,或者**WEIGHT_SYNCING**状态结束后,自动进入该状态。
@@ -292,6 +295,7 @@ swarm server的所有可能状态和转换方式如下:
 
 
 唯有一个事情需要注意:每个episode开始前,你需要调用 `begin_episode` 来获取 `base_url` 和 `api_key`,与此同时,获取一个episode标识 `episode_uuid`。`swarm_worker`是线程安全,且不持有`episode`状态的,所以你可以随便同时并发多个`begin_episode`。当你的agent运行结束时,记得调用 `end_episode` 把奖励信号传递到 swarm server (带着`episode_uuid`参数)。此外,如果出于:
+
 - **奖励写错了**
 - **外部API欠费**
 - **调试**
diff --git a/docs/en/swarm_vibe_coding.md b/docs/en/swarm_vibe_coding.md
@@ -6,6 +6,7 @@ Here is an example:
 
 ```txt
 Your task:
+
 - Write an intelligent agent that learns the CountDown task (You are an agent specialized in solving countdown number puzzles. Given a target number and a list of source numbers, find a way to reach the target number using basic arithmetic operations (+, -, *, /). Each source number can only be used once.)
 - I hope to use the base model '/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct'
 - Train using 8 GPUs
diff --git a/docs/en/tune_your_first_agent.md b/docs/en/tune_your_first_agent.md
@@ -208,7 +208,7 @@ Now, we have obtained all materials required to train the agent.
     # ------------------ do not modify ------------------
     defaults:
       - - trinity_default verl_default
-     
+
       - ajet_default
       - _self_
 
@@ -473,6 +473,7 @@ ajet-swarm overwatch --swarm-url=http://localhost:10086
     ```
 
 The Swarm Server will:
+
 - Load the model specified by the client
 - Provide vLLM API endpoints for inference
 - Compute gradients and update model parameters
diff --git a/tutorial/example_academic_trans/train_multi_model/README.md b/tutorial/example_academic_trans/train_multi_model/README.md
@@ -12,6 +12,7 @@ Unlike the original implementation where all agents share a single reward (final
 
 ### 7B Model Reward (trans_reward.py)
 Evaluates the **final translation quality** considering:
+
 - First-person pronoun usage
 - Abbreviation translation
 - Word order and sentence structure
@@ -20,6 +21,7 @@ Evaluates the **final translation quality** considering:
 
 ### 14B Model Reward (trans_reward_14B.py)
 Evaluates the **proper noun detection quality** considering:
+
 - Completeness: Did it detect all critical errors?
 - Accuracy: Are the corrections appropriate?
 - False positives: Did it flag correct translations as errors?
@@ -41,6 +43,7 @@ Task → Agent 1 (7B) → Agent 2 (14B) → Agent 3 (7B) → Final Translation
 
 ### `trans.py`
 Main workflow execution file. Key features:
+
 - `execute_agent()` accepts TWO `OpenaiBaseUrlAndApiKey` objects (one per model)
 - Returns TWO `WorkflowOutput` objects with different rewards:
   - `workflow_output_7b`: Reward based on final translation quality
@@ -50,6 +53,7 @@ Main workflow execution file. Key features:
 
 ### `trans_roll.py`
 Training orchestration file. Implements parallel model training:
+
 - Creates two `SwarmClient` instances:
   - `swarm_worker_7b`: Manages 7B model training on port 10086
   - `swarm_worker_14b`: Manages 14B model training on port 10087
@@ -67,12 +71,14 @@ Specialized reward function evaluating **proper noun detection quality**.
 Used exclusively for the 14B model (agent 2).
 
 Evaluates:
+
 - **Detected errors**: What the agent successfully caught
 - **Missed errors**: Critical errors the agent should have detected
 - **False positives**: Incorrect flagging of non-errors
 - **JSON validity**: Proper formatting of output
 
 Score scale (0-2):
+
 - 0 = Poor detection (missed critical errors, many false positives, invalid JSON)
 - 1 = Acceptable detection (caught some errors but missed important ones)
 - 2 = Excellent detection (caught all major errors, minimal false positives)
diff --git a/tutorial/example_learn2ask/learn2ask.md b/tutorial/example_learn2ask/learn2ask.md
@@ -24,6 +24,7 @@ Download [RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) d
     ```
 
 You now will get two datasets:
+
 - `train.jsonl`: the train split
 - `test.jsonl`: the test split
 
@@ -33,6 +34,7 @@ Next, we will prepare a workflow to train an agent with these data.
 Refer to `tutorial/example_learn2ask/learn2ask.py` for details of workflow.
 
 In this workflow, we implement:
+
 - `ExampleLearn2Ask`: the workflow and agent
 - `reward_fn`: the reward of learn2ask task
 - `llm_reward`: a llm-as-a-judge used in `reward_fn`
@@ -43,6 +45,7 @@ The agent reads the messages between a doctor and a patient, and tries predictin
 Next, we need a YAML configuration file to tell AgentJet what data, workflow, reward to use.
 
 In `tutorial/example_learn2ask/learn2ask.yaml`, we set them respectively:
+
 - Set experiment name.
 - Use `dataset_file` reader to read the training and validation data.
 - Set the pretrained model.
diff --git a/tutorial/opencode_build_countdown_agent.prompt.md b/tutorial/opencode_build_countdown_agent.prompt.md
@@ -5,6 +5,7 @@ Use prompt below in opencode or claudecode to generate a one-key-to-tune agent (
 =============================
 
 Your task:
+
 - Write an intelligent agent that learns the CountDown task (You are an agent specialized in solving countdown number puzzles. Given a target number and a list of source numbers, find a way to reach the target number using basic arithmetic operations (+, -, *, /). Each source number can only be used once.)
 - I hope to use the base model '/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct'
 - Train using 8 GPUs
diff --git a/tutorial/opencode_build_countdown_agent/readme.md b/tutorial/opencode_build_countdown_agent/readme.md