RHOAIENG-38790: Cursor Rule Files

DavidAdaRH · openshift-merge-bot[bot] · commit 4c48972ec475 · 2026-03-27T08:23:30.000Z
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
diff --git a/.cursor/rules/01-project-context.mdc b/.cursor/rules/01-project-context.mdc
@@ -0,0 +1,84 @@
+---
+description: Codeflare SDK core context, grounding, and user personas — apply to every chat
+globs:
+alwaysApply: true
+---
+
+# Codeflare SDK — Core context & personas
+
+## What this is
+- **Codeflare SDK**: Python SDK for batch resource requesting, Ray clusters, job submission, Kueue. Apache-2.0.
+- **Python**: ^3.11 in pyproject.toml; CI runs 3.12.
+- **Repo**: [project-codeflare/codeflare-sdk](https://github.com/project-codeflare/codeflare-sdk). Main package: `src/codeflare_sdk/`.
+
+## Project structure
+```
+src/codeflare_sdk/
+  __init__.py              # Public API — export new public classes/functions here
+  conftest.py              # Global test fixtures (auto-mocks K8s API clients)
+  common/
+    kubernetes_cluster/    # Auth, API client, error handling
+    kueue/                 # Local queue listing, default queue resolution
+    utils/                 # Constants, helpers, validation
+    widgets/               # Jupyter/IPython widgets
+  ray/
+    cluster/               # Cluster create/config/status/delete (main entry: Cluster)
+    rayjobs/               # RayJob submit, tracking, runtime env
+    client/                # Ray JobSubmissionClient wrapper
+  vendored/                # DO NOT MODIFY — vendored KubeRay client
+tests/
+  e2e/, e2e_v2/            # E2E (KinD/Kueue/KubeRay); not run in “new code” flow
+```
+
+## Public API
+- **Export new public classes and functions** in `src/codeflare_sdk/__init__.py`. Do not add public API without listing it there.
+
+## Grounding — avoid hallucination
+- **Only use APIs, types, and file paths that exist in this codebase.** If unsure, search the repo (e.g. `codeflare_sdk`, `Cluster`, `ClusterConfiguration`, `get_api_client`) before suggesting code.
+- **Do not invent** new modules, env vars, or config keys unless the user explicitly asks to add them.
+- **Prefer referencing** existing modules: `codeflare_sdk.ray.cluster`, `codeflare_sdk.common.kueue`, `codeflare_sdk.common.kubernetes_cluster.auth`, etc.
+- When adding features, follow existing patterns in the same package (e.g. `kueue.py` for Kueue, cluster code under `ray/cluster/`).
+
+## Conventions (high level)
+- New code: type hints, Apache-2.0 header. Tests: pytest; coverage ≥90% project, ≥85% patch. Format: pre-commit.
+- **Reuse existing enums and types** for status/state (e.g. `RayClusterStatus`); do not introduce new string-based status fields for concepts already modeled in the codebase.
+- **Scope of changes**: Do not refactor, rename, or change code outside the scope of the user's request unless required for the change to work (e.g. a type used by new code). Prefer minimal, targeted edits to keep diffs small and avoid unrelated "improvements" that can break things or waste review time.
+
+---
+
+## User personas
+
+Use these personas to keep code, APIs, and docs user-relevant.
+
+### Cluster / platform admin
+- Manages Kubernetes/OpenShift, Kueue, quotas, namespaces.
+- Cares about: auth (kubeconfig, OIDC, tokens), RBAC, resource limits, default queues, priority classes.
+- Prefer: clear errors, `config_check()`, namespace/queue handling, security and quotas.
+
+### Data scientist / ML engineer
+- Runs Ray jobs, uses notebooks, wants minimal YAML/K8s.
+- Cares about: `Cluster`/`ClusterConfiguration`, job submission, runtime env, demos (`codeflare_sdk.copy_demo_nbs()`).
+- Prefer: simple APIs, good defaults, demos and docs that match notebook workflows.
+
+### Application developer
+- Integrates SDK into apps or pipelines.
+- Cares about: programmatic API, status checks, timeouts, error handling, idempotency.
+- Prefer: stable function signatures, logging, and predictable behavior.
+
+When writing or changing code, consider: "Which persona does this serve?" and keep their use case in mind.
+
+---
+
+## Suggesting rule improvements
+
+At the **end of the conversation**, if there is clear evidence in **this chat** that a new or updated rule would help, suggest one or two concrete improvements for the user to accept or deny.
+
+**Only suggest when:**
+- You had to correct the same type of mistake more than once, or
+- The user had to explicitly ask for something the existing rules could have enforced, or
+- You had to guess on style, structure, or behavior that could be turned into a guardrail.
+
+**How to suggest:**
+- Propose **concrete** rule text (a bullet or short paragraph) and say which `.mdc` file it belongs in (e.g. `02-python-standards.mdc`, `03-testing-and-ci.mdc`).
+- Do **not** suggest rules that are already covered by the existing `.cursor/rules` content.
+- Present suggestions as optional: "You could add the following to …" or "Consider adding a rule: …". The user may accept or deny; do not edit `.mdc` files unless the user explicitly asks you to.
diff --git a/.cursor/rules/02-python-standards.mdc b/.cursor/rules/02-python-standards.mdc
@@ -0,0 +1,80 @@
+---
+description: Python coding standards — quality, style, and matching existing codebase patterns
+globs: **/*.py
+alwaysApply: false
+---
+# Python coding standards
+
+When adding or editing code, **match the style and patterns of the surrounding code and the same subpackage**. Prefer consistency with existing files over introducing new conventions.
+
+---
+
+## Style & tooling
+- **Formatting / pre-commit**: Use pre-commit for all checks. Run `pre-commit run --show-diff-on-failure --color=always --all-files` before committing.
+- **Naming**: snake_case for functions, variables, modules; PascalCase for classes.
+- **Apache-2.0 header** at top of new files (see any `src/codeflare_sdk/**/*.py`).
+- **Type hints** for function parameters and return types (e.g. `Optional[str]`, `List[...]`). Use `typing`: `Optional`, `List`, `Dict`, `Tuple`. Prefer **dataclasses** for configuration or data-holding types (see `ClusterConfiguration` in config.py).
+
+---
+
+## Docstrings
+- **Format**: Google-style. One-line summary; then `Args:` with `name (type):` or `name (type, optional):`; then `Returns:`; add `Note:` or `Raises:` if needed. Required for public functions.
+- **Example** (match this style):
+  ```
+  Args:
+      namespace (str):
+          The Kubernetes namespace to query.
+  Returns:
+      Optional[str]:
+          The name of the default queue, or None.
+  ```
+- Optional multi-line description before Args for non-trivial behavior (see `get_default_kueue_name` in kueue.py).
+- **Module docstring**: Optional but used in submodules (e.g. cluster, config, auth) — short description of what the module provides.
+
+---
+
+## Imports
+- **Order**: stdlib → third-party → local. Blank line between groups.
+- **Local imports**: Use **relative** imports within the same package (e.g. `from ...common.utils import get_current_namespace`, `from .config import ClusterConfiguration`). Use **absolute** `from codeflare_sdk....` when importing from another top-level package or in tests.
+- **K8s/auth**: Prefer `from codeflare_sdk.common.kubernetes_cluster.auth import config_check, get_api_client` or `from ...common.kubernetes_cluster.auth import ...`. Use `from codeflare_sdk.common import _kube_api_error_handling` (or relative `...common`) for API error handling.
+
+---
+
+## Structure & naming
+- **Private helpers**: Prefix with `_` (e.g. `_fetch_local_queues`, `_find_default_queue_name`). Use for logic that is not part of the public API.
+- **Logging**: `logger = logging.getLogger(__name__)` at module level; use `logger` instead of `print` for runtime or debug output. (Some legacy code still uses `print`; new code should use `logger`.)
+- **Deprecation**: Use `@deprecated` from `typing_extensions` and `warnings.warn(..., DeprecationWarning, stacklevel=2)`; see `auth.py` for the pattern.
+
+---
+
+## Kubernetes / API errors
+- Use **`_kube_api_error_handling(e)`** for `ApiException` (from `kubernetes.client.exceptions` or `kubernetes.client.rest`). Do not add new ad-hoc exception handling patterns; follow kueue.py and build_ray_cluster.py.
+- Call **`config_check()`** before K8s API calls when the code path expects a configured client.
+- Use **`get_api_client()`** to obtain the client; do not instantiate new Kubernetes clients directly for the default SDK-configured client.
+
+## Parsing Kubernetes / API response dicts
+- Assume CR or list/get response fields can be **missing or wrong type**. Use safe access (e.g. `.get()`, `try/except` for KeyError, IndexError, TypeError) and coerce numeric fields with **`int(...)`** where appropriate (e.g. replicas, counts) so string or missing values do not crash.
+- When parsing Kubernetes Custom Resources (dictionaries), use **isolated try/except blocks** for distinct sections (e.g. metadata, spec, status). Do not let a missing status field prevent metadata or spec from being parsed.
+- **Never** use raw strings to represent application or cluster states. Always search for and reuse existing Enums (e.g. `RayClusterStatus` in `ray/cluster/status.py`). Map invalid or missing values to the enum’s unknown/default (e.g. `RayClusterStatus.UNKNOWN`) inside a try/except; do not introduce new string constants for the same concept.
+
+---
+
+## Canonical examples (copy patterns from these)
+- **Kueue / K8s API helpers**: `src/codeflare_sdk/common/kueue/kueue.py` — docstrings, `_helper` functions, ApiException handling, `config_check()`/`get_api_client()`.
+- **Cluster / high-level API**: `src/codeflare_sdk/ray/cluster/cluster.py` — class docstrings, relative imports from `...common`, use of `_kube_api_error_handling`.
+- **Config / dataclasses**: `src/codeflare_sdk/ray/cluster/config.py` — module docstring, `@dataclass`, Google-style Args with type in parens.
+- **Auth**: `src/codeflare_sdk/common/kubernetes_cluster/auth.py` — module docstring, deprecation pattern, abstract base classes.
+
+When in doubt, open the nearest existing file in the same package and mirror its style.
+
+---
+
+## Don’t
+- Add new dependencies without updating `pyproject.toml` (Poetry). Use existing test deps: pytest, pytest-mock, pytest-timeout, coverage.
+- Change Python version: keep ^3.11 per pyproject.toml.
+
+## Common pitfalls
+- **Don’t import from vendored** — use the SDK’s own wrappers (e.g. `RayjobApi` via codeflare_sdk, not raw vendored modules).
+- **Don’t hardcode Ray image tags** — use `common/utils/constants.py` (maps Python versions to default images).
+- **Don’t skip `config_check()`** — call it before K8s API calls when the code path expects a configured client.
+- **Unit test timeout is 900s** — long-running tests are killed after 15 minutes (pyproject.toml).
diff --git a/.cursor/rules/03-testing-and-ci.mdc b/.cursor/rules/03-testing-and-ci.mdc
@@ -0,0 +1,126 @@
+---
+description: Testing and coverage when new code is added — pre-commit, unit tests, coverage (no E2E/notebooks)
+globs: "**/test_*.py,**/*_test.py,**/tests/**/*.py,src/**/*.py,.github/workflows/*.yml,.github/workflows/*.yaml"
+alwaysApply: true
+---
+
+# Testing & coverage (new code)
+
+When **new code is added**, run this validation pipeline: **pre-commit**, **unit tests**, and **coverage**. Do **not** run E2E or notebook tests (they take too long).
+
+---
+
+## Cursor: no write/CRUD git commands
+
+- **Cursor MUST NOT run any git commands that modify state** (e.g. no `git commit`, `git push`, `git add`, `git merge`, `git rebase`, `git reset --hard`, etc.). The user runs those themselves.
+- **Cursor MAY run read-only git commands** when useful (e.g. `git status`, `git log`, `git diff`, `git show`, `git checkout -- <path>` to restore a file, `git branch -a`, etc.).
+---
+
+## Validation when adding new code
+
+- **When to run**: If you added or changed code in this session, run the validation pipeline **once, just before ending the conversation**. Do not run pre-commit, unit tests, or coverage after every small edit; run them only at the end so the user sees a single pass/fail before they leave.
+
+**Quick commands** (unit tests only, no E2E/notebooks; then check **patch** coverage only, not full codebase):
+```bash
+pre-commit run --show-diff-on-failure --color=always --all-files
+poetry install --with test
+coverage run --omit="src/**/test_*.py,src/codeflare_sdk/common/utils/unit_test_support.py,src/codeflare_sdk/vendored/**" -m pytest --ignore=tests/e2e --ignore=tests/e2e_v2 --ignore=tests/upgrade --ignore=demo-notebooks --ignore=ui-tests
+coverage report -m
+# Then, for source files changed in this session only: coverage report -m --include="path/to/changed1.py,path/to/changed2.py"  (require ≥85% for that subset; ignore overall %)
+```
+
+### 1. Pre-commit
+- Run before considering changes done: `pre-commit run --all-files` (or the full command above).
+- Do not skip or alter pre-commit hooks.
+
+### 2. Unit tests (no E2E, no notebooks)
+- Run **only unit tests**. Do **not** run E2E or notebook tests (they take too long). Exclude: `tests/e2e`, `tests/e2e_v2`, `tests/upgrade`, `demo-notebooks`, `ui-tests`.
+- Use the coverage + pytest command from **Quick commands** above.
+
+### 3. Coverage requirements (in this session)
+- **Check only patch coverage**: In Cursor, only validate coverage for **code added or changed in this chat** (the “patch”), not the full codebase. Full codebase coverage is enforced in GitHub CI and will already be ≥90% on main; local runs often show low overall % (e.g. below 30%) due to vendored code, kuberay client, and other excluded paths—**do not fail the session on that**.
+- **Patch target**: For source files you added or modified in this session, require **≥85%** coverage. After running the coverage commands below, run `coverage report -m --include="path/to/changed1.py,path/to/changed2.py,..."` with the actual paths you changed and ensure that subset is ≥85%. If you only changed tests or config, there is no patch coverage to check.
+- **CI / codecov**: GitHub runs full pytest and enforces ≥90% project coverage; codecov.yml uses patch 85%, threshold 2.5%. Ignore: `**/*.ipynb`, `demo-notebooks/**`, `**/__init__.py`.
+
+---
+
+## Unit test stack (do not replace)
+- **Runner**: pytest 9.x. **Extras**: pytest-mock, pytest-timeout (default timeout 900s in pyproject.toml). **Coverage**: coverage 7.x.
+
+### Where unit tests live
+- Tests next to code: `src/codeflare_sdk/**/test_*.py`. Ignore: `src/codeflare_sdk/vendored/**`, `unit_test_support.py`.
+- **conftest.py**: The global `src/codeflare_sdk/conftest.py` auto-mocks K8s API clients; tests inherit these mocks. Override with `mocker` or `monkeypatch` when testing specific K8s/config behavior. Never make real Kubernetes API calls in unit tests.
+
+### Pytest markers (use when relevant)
+- `smoke` — quick validation. `tier1` — standard suite.
+- `kind`, `openshift`, `nvidia_gpu` — environment-specific (E2E only; do not run in the “new code” flow).
+
+### Writing tests
+- Use **mocker** (pytest-mock) for K8s/API calls; see `test_kueue.py` and `test_auth*.py` for patterns.
+- **NEVER** hardcode raw Kubernetes JSON payloads in test files. You **MUST** use or extend the helper functions in `src/codeflare_sdk/common/utils/unit_test_support.py` (e.g. `get_ray_obj_with_status`, `get_obj_none`, `get_local_queue`, `create_cluster_config`, `apply_template`).
+- New code in `src/codeflare_sdk` should have corresponding tests so patch coverage (for that new code) stays ≥85%; full project coverage is enforced in CI.
+- **Edge-case tests for API/CR parsing**: When adding code that parses Kubernetes Custom Resource or list/get API response dicts, add at least one test that uses **malformed or partial payloads**: empty `items`, missing `spec` or `status`, empty or missing nested lists (e.g. `workerGroupSpecs`). Assert safe defaults (e.g. 0 for counts, UNKNOWN or equivalent for status). Test error-handling by mocking the API to raise (e.g. `ApiException`); assert the handler is invoked (e.g. mock `_kube_api_error_handling` and `assert_called_once()`) rather than asserting exact stdout text.
+
+---
+
+## CI workflows (reference only; E2E/notebooks not in “new code” flow)
+
+### Versions (CI env)
+- **KUEUE_VERSION**: v0.13.4. **KUBERAY_VERSION**: v1.4.2 (opendatahub-io/kuberay fork, RHOAI features). **Python**: 3.12. **Common repo**: project-codeflare/codeflare-common @ main (KinD/GPU setup).
+
+### Pre-commit (every PR / workflow_dispatch)
+- **Run before pushing**: `pre-commit run --all-files`. Image: `quay.io/project-codeflare/codeflare-sdk-precommit:v0.0.1`. Do not skip or alter pre-commit hooks.
+
+### Unit tests (every PR)
+- `poetry install --with test` then pytest with coverage ≥90%. No paths-ignore for this workflow.
+
+### RayJob E2E (PR to main, release-*, ray-jobs-feature)
+- **Paths-ignore**: docs/**, **.adoc, **.md, LICENSE. **Runner**: gpu-t4-4-core. KinD + NVIDIA GPU operator + Kueue + KubeRay.
+- **Command**: `poetry run pytest -v -s ./tests/e2e/rayjob/`
+- **RBAC**: sdk-user with limited permissions (rayclusters, rayjobs, localqueues, clusterqueues, resourceflavors, pods, services, secrets, workloads, etc.). Do not assume cluster-admin.
+
+### General E2E (same branches as RayJob E2E)
+- **Command**: `poetry run pytest -v -s ./tests/e2e/ -m 'kind and nvidia_gpu'`
+- E2E tests in `tests/e2e/` must use `@pytest.mark.kind` and `@pytest.mark.nvidia_gpu` to run in this workflow. Place RayJob e2e in `tests/e2e/rayjob/`; other e2e in `tests/e2e/`. Install for e2e: `poetry install --with test,docs`.
+
+### Guided notebooks (label: test-guided-notebooks)
+- KinD + Kueue + KubeRay; no GPU. Notebooks: 0_basic_ray, 4_rayjob_existing_cluster, 5_submit_rayjob_cr. Run with papermill (see Demo notebooks below).
+
+### UI notebooks (labels: test-guided-notebooks or test-ui-notebooks)
+- Job: verify-3_widget_example. Playwright (chromium) in `ui-tests/`; notebook `demo-notebooks/guided-demos/3_widget_example.ipynb`.
+
+### Additional notebooks (label: test-additional-notebooks)
+- local_interactive.ipynb and ray_job_client.ipynb are **skipped** in CI (mTLS/OpenShift required; not available in KinD).
+
+---
+
+## Demo notebooks & CI
+
+### Notebooks executed in CI
+
+| Notebook | Workflow | Notes |
+|----------|----------|--------|
+| 0_basic_ray.ipynb | Guided | KinD: namespace='default', dashboard_check=False, remove auth cells |
+| 4_rayjob_existing_cluster.ipynb | Guided | KinD: namespace='default', GPU 0, remove oc login cell |
+| 5_submit_rayjob_cr.ipynb | Guided | KinD: namespace='default', remove oc login cell |
+| 3_widget_example.ipynb | UI | Playwright in ui-tests/; namespace='default', view_clusters('default'), remove auth cells |
+
+**Skipped in CI** (require mTLS/OpenShift): local_interactive.ipynb, ray_job_client.ipynb.
+
+### KinD-specific adaptations (CI applies these)
+- **Auth**: Remove cells that do auth/login (e.g. "Create authentication object for user permissions", `auth.logout()`, `oc login`) — KinD doesn't support token auth the same way.
+- **Namespace**: Use `namespace='default'` where the SDK needs it. Replace `namespace="your-namespace"` with `namespace="default"` in notebooks.
+- **Dashboard**: Use `cluster.wait_ready(dashboard_check=False)` in KinD (no HTTPRoute/Route).
+- **GPU**: In KinD jobs without GPU, set GPU requests to 0 (e.g. `head_extended_resource_requests={'nvidia.com/gpu':0}`).
+- **Widget**: For 3_widget_example, call `view_clusters('default')` with explicit namespace.
+
+When editing guided demos, keep them runnable on both real OpenShift and KinD; CI runs on KinD with the above edits applied in the workflow.
+
+### How notebooks are run
+- **Guided**: `poetry run papermill <notebook>.ipynb <notebook>_out.ipynb --log-output --execution-timeout 600` from `demo-notebooks/guided-demos`. Install: `poetry install --with test,docs`; for papermill also `pip install papermill ipython ipykernel`.
+- **UI**: From `ui-tests/`, `poetry run yarn test` (Playwright). Dependencies: `yarn install`, `yarn playwright install chromium`.
+
+### Adding a new notebook that should run in CI
+- **Guided**: Add a job in the Guided notebooks workflow (similar to verify-0_basic_ray), apply the same KinD adaptations in the workflow steps.
+- **UI**: Add to ui-tests and ensure 3_widget_example pattern (namespace, auth removal) if it uses cluster/widget APIs.
+- Do not rely on mTLS or OpenShift-only features if the notebook should run in current KinD CI.