Skip to content

Commit 8b37277

Browse files
style: apply /style-guide pass to models/launch (#2685)
## Summary This PR applies the `/style-guide` skill (Google Developer Style Guide + CoreWeave conventions) to documentation under `models/launch`. The run was automated; no technical content was intentionally changed. ## Files edited - `models/launch/evaluate-hosted-model.mdx` - `models/launch/evaluate-model-checkpoint.mdx` - `models/launch/evaluations.mdx` ## Recommendations for technical review **Prerequisites** - Confirm whether `WANDB_API_KEY` being required-but-unused for Serverless Inference (`evaluate-hosted-model.mdx` line 23) is still accurate, and whether a Note callout should explain this behavior. - Prerequisite 3 of `evaluate-hosted-model.mdx` (line 24) mentions a team-scoped secret holding the model's API key but doesn't specify a required name/format or where it's selected in the flow. Add guidance if needed. - In `evaluate-model-checkpoint.mdx`, the prerequisite item references "OpenAPI API key" but surrounding text and the secret name use OpenAI (`OPENAI_API_KEY`). Likely a typo — confirm and correct. - Confirm whether a specific W&B role (beyond team-admin for secrets) is required to launch evaluation jobs. Add to prerequisites if so. - In `evaluations.mdx`, the credentials section only links to the secrets doc. Consider also linking to role/permission documentation and to the "Evaluate a model checkpoint" / "Evaluate a hosted API model" pages from the credentials section (not only from "Next steps"). **Verification steps** - After "Click **Launch**" in both `evaluate-hosted-model.mdx` (line 44) and `evaluate-model-checkpoint.mdx` (step 11), no confirmation cue (toast, queued status, log location) is described before directing the reader to the recent-run modal. Consider adding a confirmation description. - `evaluate-model-checkpoint.mdx` doesn't describe expected outcome states for the evaluation job (running, succeeded, failed) or how the reader confirms successful completion. - Neither launch page has troubleshooting guidance for failed benchmarks, missing secrets, unreachable model URLs, or runs that don't appear in the recent runs list. **Technical accuracy** - `evaluate-hosted-model.mdx` line 38: "custom **OpenAPI-compliant** model" — verify whether this should be "OpenAI-compatible" (matching the rest of the page) or whether "OpenAPI" is intentional. - `evaluate-hosted-model.mdx` line 38: the custom-model syntax `openai-api/wandb/[MODEL-NAME]` appears identical to the Serverless Inference syntax on line 36. Confirm whether the custom case should use a different prefix. - Confirm the "AI Security Institute" attribution in `evaluate-hosted-model.mdx` (line 35) is current and correct. - In `evaluate-model-checkpoint.mdx`, "VLLM-compatible format" is used but not defined or linked. Add a definition or link. - Confirm the `AutoModelForCausalLM.from_pretrained` / `save_pretrained` example in `evaluate-model-checkpoint.mdx` produces VLLM-compatible output for all supported architectures, or document when additional steps are required. - `evaluate-model-checkpoint.mdx` step 5 mentions an "up to four benchmarks" limit without explanation or reference. Link or document the limit. - In `evaluations.mdx`, confirm the OpenAI Scorer column shows `Yes` (not `true`) in the product UI — Pass 8 aligned the prose to match the table cells; verify both reflect what users see. - Confirm the exact field labels `Scorer API key` and `Hugging Face Token` in `evaluations.mdx` match the strings in the product UI. - The hidden HTML comment in `evaluations.mdx` (lines 23–26) points to source-of-truth files in other repos. Confirm those URLs are current and the catalog is in sync. **Missing content** - `evaluate-hosted-model.mdx` doesn't link to a secrets-management page on first mention of "team-scoped secret". Consider linking for self-containment. - `evaluate-hosted-model.mdx` has no "Next steps" pointer beyond imported snippets. Optional editorial improvement. - The `wandb.init` call in `evaluate-model-checkpoint.mdx` uses placeholder strings but doesn't clarify whether `entity` and `project` must already exist or are created on the fly. - In `evaluations.mdx`, `SOSBench` in the Safety table is the only entry without a hyperlink on its name — likely an oversight. - The "Next steps" list in `evaluations.mdx` mixes link-label items with one longer descriptive item. Consider rephrasing for parallel structure (e.g., "Browse all benchmarks at AISI Inspect Evals"). - Some benchmark descriptions in the `evaluations.mdx` tables don't fully expand their acronyms on first use within the cell. Confirm whether the table is self-explanatory or if a glossary link would help. ## How to review - Each file's changes are style edits only. Compare side-by-side and flag any that change technical meaning. - Approve and merge to accept the edits, or close to reject them.
1 parent 37769c9 commit 8b37277

3 files changed

Lines changed: 45 additions & 31 deletions

File tree

models/launch/evaluate-hosted-model.mdx

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
description: "Evaluate a hosted API model using infrastructure managed by CoreWeave"
33
title: "Evaluate a hosted API model"
4+
keywords: ["LLM evaluation jobs", "hosted model benchmarks", "leaderboard", "OpenAI-compatible API", "Inspect AI"]
45
---
56
import ReviewEvaluationResults from "/snippets/_includes/llm-eval-jobs/review-evaluation-results.mdx";
67
import RerunEvaluation from "/snippets/_includes/llm-eval-jobs/rerun-evaluation.mdx";
@@ -9,33 +10,37 @@ import PreviewLink from '/snippets/_includes/llm-eval-jobs/preview.mdx';
910

1011
<PreviewLink />
1112

12-
This page shows how to use [LLM Evaluation Jobs](/models/launch) to run a series of evaluation benchmarks on a hosted API model at a publicly-accessible URL, using infrastructure managed by CoreWeave. To evaluate a model checkpoint saved as an artifact in W&B Models, see [Evaluate a model checkpoint](/models/launch/evaluate-model-checkpoint) instead.
13+
This page shows how to use [LLM Evaluation Jobs](/models/launch) to run a series of evaluation benchmarks on a hosted API model at a publicly accessible URL, using infrastructure managed by CoreWeave. Running these benchmarks helps you compare model performance, validate model quality, and publish results to a shared leaderboard without managing your own evaluation infrastructure. To evaluate a model checkpoint saved as an artifact in W&B Models, see [Evaluate a model checkpoint](/models/launch/evaluate-model-checkpoint) instead.
1314

1415
## Prerequisites
16+
17+
Before you create an evaluation job, complete the following:
18+
1519
1. Review the [requirements and limitations](/models/launch#more-details) for LLM Evaluation Jobs.
1620
1. To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.
17-
- An **OpenAPI API key**: Used by benchmarks that use OpenAI models for scoring. Required if the field **Scorer API key** appears after you select a benchmark. The secret must be named `OPENAI_API_KEY`.
18-
- A **Hugging Face user access token**: Required for certain benchmarks like `lingoly` and `lingoly2` that require access to one or more gated Hugging Face datasets. Required if the field **Hugging Face Token** appears after selecting a benchmark. The API key must have access to the relevant dataset. See the Hugging Face documentation for [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) and [accessing gated datasets](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user).
19-
- To evaluate a model provided by [Serverless Inference](/inference), an organization or team admin must create `WANDB_API_KEY` with any value. The secret is not actually used for authentication.
21+
- An **OpenAI API key**: Used by benchmarks that use OpenAI models for scoring. Required if the field **Scorer API key** appears after you select a benchmark. The secret must be named `OPENAI_API_KEY`.
22+
- A **Hugging Face user access token**: Required for certain benchmarks like `lingoly` and `lingoly2` that require access to one or more gated Hugging Face datasets. Required if the field **Hugging Face Token** appears after you select a benchmark. The API key must have access to the relevant dataset. See the Hugging Face documentation for [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) and [accessing gated datasets](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user).
23+
- To evaluate a model provided by [Serverless Inference](/inference), an organization or team admin must create `WANDB_API_KEY` with any value. The secret isn't used for authentication.
2024
1. The model to evaluate must be available at a publicly accessible URL. An organization or team admin must create a team-scoped secret with the API key for authentication.
2125
1. Create a new [W&B project](/models/track/project-page) for the evaluation results. From the project sidebar, click **Create new project**.
22-
1. Review the documentation for a given benchmark to understand how it works and learn about specific requirements. For convenience, the [Available evaluation benchmarks](/models/launch/evaluations) reference includes relevant links.
26+
1. Review the documentation for a given benchmark to understand how it works and learn about specific requirements. The [Available evaluation benchmarks](/models/launch/evaluations) reference includes relevant links.
2327

2428
## Evaluate your model
25-
Follow these steps to set up and launch an evaluation job:
29+
30+
Follow these steps to set up and launch an evaluation job. When you finish, your benchmark runs are queued on CoreWeave-managed infrastructure and their results appear in the destination W&B project you specify.
2631

2732
1. Log in to W&B, then click **Launch** in the project sidebar. The **LLM Evaluation Jobs** page displays.
2833
1. Click **Evaluate hosted API model** to set up the evaluation.
2934
1. Select a destination project to save the evaluation results to.
30-
1. In the **Model** section, specify the base URL and model name to evaluate, and select the API key to use for authentication. Provide the model name in OpenAI-compatible format defined by the [AI Security Institute](https://inspect.aisi.org.uk/providers.html#openai-api). For example, specify an OpenAI mode in the following syntax: `openai/<model-name>`. For a comprehensive list of hosted model providers and models, see [AI Security Institute's model provider reference](https://inspect.aisi.org.uk/providers.html).
31-
- To evaluate a model provided by [Serverless Inference](/inference), set the base URL to `https://api.inference.wandb.ai/v1` and specify the model name in the following syntax: `openai-api/wandb/<model_id>`. Refer to the [Inference model catalog](/inference/models) for details.
32-
- To use the [OpenRouter](https://inspect.aisi.org.uk/providers.html#openrouter) provider, prefix the model name with `openrouter` in the following syntax: `openrouter/<model-name>`.
33-
- To evaluate a custom OpenAPI-compliant model, specify the model name in the following syntax: `openai-api/wandb/<model-name>`.
35+
1. In the **Model** section, specify the base URL and model name to evaluate, and select the API key to use for authentication. Provide the model name in OpenAI-compatible format defined by the [AI Security Institute](https://inspect.aisi.org.uk/providers.html#openai-api). For example, specify an OpenAI model in the following syntax, where `[MODEL-NAME]` is the name of the model: `openai/[MODEL-NAME]`. For a list of hosted model providers and models, see [AI Security Institute's model provider reference](https://inspect.aisi.org.uk/providers.html).
36+
- To evaluate a model provided by [Serverless Inference](/inference), set the base URL to `https://api.inference.wandb.ai/v1` and specify the model name in the following syntax, where `[MODEL-ID]` is the model ID: `openai-api/wandb/[MODEL-ID]`. Refer to the [Inference model catalog](/inference/models) for details.
37+
- To use the [OpenRouter](https://inspect.aisi.org.uk/providers.html#openrouter) provider, prefix the model name with `openrouter` in the following syntax, where `[MODEL-NAME]` is the name of the model: `openrouter/[MODEL-NAME]`.
38+
- To evaluate a custom OpenAPI-compliant model, specify the model name in the following syntax, where `[MODEL-NAME]` is the name of the model: `openai-api/wandb/[MODEL-NAME]`.
3439
1. Click **Select evaluations**, then select up to four benchmarks to run.
35-
1. If you select benchmarks that use OpenAI models for scoring, the **Scorer API key** field displays. Click it, then select the `OPENAI_API_KEY` secret. For convenience, a team admin can create a secret from this drawer by clicking **Create secret**.
40+
1. If you select benchmarks that use OpenAI models for scoring, the **Scorer API key** field displays. Click it, then select the `OPENAI_API_KEY` secret. A team admin can create a secret from this drawer by clicking **Create secret**.
3641
1. If you select benchmarks that require access to gated datasets in Hugging Face, a **Hugging Face token** field displays. [Request access to the relevant dataset](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user), then select the secret that contains the Hugging Face user access token.
37-
1. Optionally, set **Sample limit** to a positive integer to limit the maximum number of benchmark samples to evaluate. Otherwise, all samples in the task are included.
38-
1. To create a leaderboard automatically, click **Publish results to leaderboard**. The leaderboard will display all evaluations together in a workspace panel, and you can also share it in a report.
42+
1. Optional: Set **Sample limit** to a positive integer to limit the maximum number of benchmark samples to evaluate. Otherwise, all samples in the task are included.
43+
1. To create a leaderboard automatically, click **Publish results to leaderboard**. The leaderboard displays all evaluations together in a workspace panel, and you can also share it in a report.
3944
1. Click **Launch** to launch the evaluation job.
4045
1. Click the circular arrow icon at the top of the page to open the recent run modal. Evaluation jobs appear with your other recent runs. Click the name of a finished run to open it in single-run view, or click the **Leaderboard** link to open the leaderboard directly. For details, see [View the results](#view-the-results).
4146

@@ -45,7 +50,7 @@ This example job runs the `simpleqa` benchmark against the OpenAI model `o4-mini
4550
![Example hosted model evaluation job](/images/models/llm-evaluation-jobs/hosted-model-job-example.png)
4651
</Frame>
4752

48-
This example leaderboard visualizes the performance of several OpenAI models together:
53+
If you published results to a leaderboard, you can compare evaluations side by side. This example leaderboard visualizes the performance of several OpenAI models together:
4954

5055
<Frame>
5156
![Example leaderboard visualizing the performance of several hosted models](/images/models/llm-evaluation-jobs/hosted-model-leaderboard-example.png)

models/launch/evaluate-model-checkpoint.mdx

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
description: "Evaluate a VLLM-compatible model checkpoint using infrastructure managed by CoreWeave"
33
title: "Evaluate a model checkpoint"
4+
keywords: ["LLM Evaluation Jobs", "VLLM", "benchmark", "leaderboard", "model checkpoint"]
45
---
56
import ReviewEvaluationResults from "/snippets/_includes/llm-eval-jobs/review-evaluation-results.mdx";
67
import RerunEvaluation from "/snippets/_includes/llm-eval-jobs/rerun-evaluation.mdx";
@@ -9,19 +10,23 @@ import PreviewLink from '/snippets/_includes/llm-eval-jobs/preview.mdx';
910

1011
<PreviewLink />
1112

12-
This page shows how to use [LLM Evaluation Jobs](/models/launch) to run a series of evaluation benchmarks on a fine-tuned model in W&B Models, using infrastructure managed by CoreWeave. To evaluate a hosted API model served at a publicly-accessible URL, see [Evaluate an API-hosted model](/models/launch/evaluate-hosted-model) instead, or run a small benchmark against a public OpenAI model endpoint with a streamlined [Quickstart](/models/launch#quickstart).
13+
This page shows how to use [LLM Evaluation Jobs](/models/launch) to run a series of evaluation benchmarks on a fine-tuned model in W&B Models, using infrastructure managed by CoreWeave. To evaluate a hosted API model served at a publicly accessible URL, see [Evaluate an API-hosted model](/models/launch/evaluate-hosted-model) instead, or run a small benchmark against a public OpenAI model endpoint with a streamlined [Quickstart](/models/launch#quickstart).
1314

1415
## Prerequisites
16+
17+
Before you evaluate a model checkpoint, complete the following:
18+
1519
1. Review the [requirements and limitations](/models/launch#more-details) for LLM Evaluation Jobs.
1620
1. To run certain benchmarks, a team admin must add the required API keys as [team-scoped secrets](/platform/secrets#add-a-secret). Any team member can specify the secret when configuring an evaluation job. See [Evaluation model catalog](/models/launch/evaluations) for requirements.
17-
- An **OpenAPI API key**: Used by benchmarks that use OpenAI models for scoring. Required if the field **Scorer API key** appears after you select a benchmark. The secret must be named `OPENAI_API_KEY`.
18-
- A **Hugging Face user access token**: Required for certain benchmarks like `lingoly` and `lingoly2` that require access to one or more gated Hugging Face datasets. Required if the field **Hugging Face Token** appears after selecting a benchmark. The API key must have access to the relevant dataset. See the Hugging Face documentation for [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) and [accessing gated datasets](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user).
21+
- An **OpenAPI API key**: used by benchmarks that use OpenAI models for scoring. Required if the field **Scorer API key** appears after you select a benchmark. The secret must be named `OPENAI_API_KEY`.
22+
- A **Hugging Face user access token**: required for certain benchmarks like `lingoly` and `lingoly2` that require access to one or more gated Hugging Face datasets. Required if the field **Hugging Face token** appears after selecting a benchmark. The API key must have access to the relevant dataset. See the Hugging Face documentation for [User access tokens](https://huggingface.co/docs/hub/en/security-tokens) and [accessing gated datasets](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user).
1923
1. Create a new [W&B project](/models/track/project-page) for the evaluation results. From the project sidebar, click **Create new project**.
20-
1. Package the model in VLLM-compatible format and save it as an artifact in W&B Models. An attempt to benchmark any other type of artifact will fail. For one approach, see [Example: Prepare a model](#example-prepare-your-model) at the end of this page.
24+
1. Package the model in VLLM-compatible format and save it as an artifact in W&B Models. An attempt to benchmark any other type of artifact fails. For one approach, see the following [Example: Prepare a model](#example-prepare-a-model) section.
2125
1. Review the documentation for a given benchmark to understand how it works and learn about specific requirements. For convenience, the [Available evaluation benchmarks](/models/launch/evaluations) reference includes relevant links.
2226

2327
## Evaluate your model
24-
Follow these steps to set up and launch an evaluation job:
28+
29+
After you complete the prerequisites, follow these steps to set up and launch an evaluation job:
2530

2631
1. Log in to W&B, then click **Launch** in the project sidebar. The **LLM Evaluation Jobs** page displays.
2732
1. Click **Evaluate model checkpoint** to set up the evaluation job.
@@ -30,8 +35,8 @@ Follow these steps to set up and launch an evaluation job:
3035
1. Click **Evaluations**, then select up to four benchmarks.
3136
1. If you select benchmarks that use OpenAI models for scoring, the **Scorer API key** field displays. Click it, then select the `OPENAI_API_KEY` secret. For convenience, a team admin can create a secret from this drawer by clicking **Create secret**.
3237
1. If you select benchmarks that require access to gated datasets in Hugging Face, a **Hugging Face token** field displays. [Request access to the relevant dataset](https://huggingface.co/docs/hub/en/datasets-gated#access-gated-datasets-as-a-user), then select the secret that contains the Hugging Face user access token.
33-
1. Optionally, set **Sample limit** to a positive integer to limit the maximum number of benchmark samples to evaluate. Otherwise, all samples in the task are included.
34-
1. To create a leaderboard automatically, click **Publish results to leaderboard**. The leaderboard will display all evaluations together in a workspace panel, and you can also share it in a report.
38+
1. Optional: Set **Sample limit** to a positive integer to limit the maximum number of benchmark samples to evaluate. Otherwise, the job includes all samples in the task.
39+
1. To create a leaderboard automatically, click **Publish results to leaderboard**. The leaderboard displays all evaluations together in a workspace panel, and you can also share it in a report.
3540
1. Click **Launch** to launch the evaluation job.
3641
1. Click the circular arrow icon at the top of the page to open the recent run modal. Evaluation jobs appear with your other recent runs. Click the name of a finished run to open it in single-run view, or click the **Leaderboard** link to open the leaderboard directly. For details, see [View the results](#view-the-results).
3742

@@ -58,7 +63,8 @@ This example leaderboard visualizes the performance of several models together:
5863
<ExportEvaluation />
5964

6065
## Example: Prepare a model
61-
To prepare your model, you load it in W&B Models, package the model weights in VLLM-compatible format, and save the result. This example shows one way to do this:
66+
67+
Before you can evaluate a model checkpoint, you must package it in VLLM-compatible format and save it as an artifact in W&B Models. This example shows one way to do this:
6268

6369
```python lines
6470
import os

0 commit comments

Comments
 (0)