Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .ci/scripts/test_backend.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,26 @@ else
fi
CMAKE_ARGS="$EXTRA_BUILD_ARGS" ${CONDA_RUN_CMD} $SETUP_SCRIPT --build-tool cmake --build-mode Release --editable true

GOLDEN_DIR="${ARTIFACT_DIR}/golden-artifacts"
export GOLDEN_ARTIFACTS_DIR="${GOLDEN_DIR}"

Comment on lines +88 to +90

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOLDEN_ARTIFACTS_DIR is exported unconditionally, so the operators suite will also generate golden inputs/outputs and .pte files even though the packaging job only collects *-models artifacts. This will increase artifact size and I/O for operators runs; consider only setting this env var (or only zipping) when SUITE=models (or when a separate opt-in flag is set).

Copilot uses AI. Check for mistakes.
EXIT_CODE=0
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest -c /dev/nul looks like a typo: /dev/nul typically doesn’t exist and will cause pytest to fail to start when trying to load the config file. This should likely be /dev/null (commonly used to ignore repo pytest.ini).

Suggested change
${CONDA_RUN_CMD} pytest -c /dev/nul -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?
${CONDA_RUN_CMD} pytest -c /dev/null -n auto backends/test/suite/$SUITE/ -m flow_$FLOW --json-report --json-report-file="$REPORT_FILE" || EXIT_CODE=$?

Copilot uses AI. Check for mistakes.
# Generate markdown summary.
${CONDA_RUN_CMD} python -m executorch.backends.test.suite.generate_markdown_summary_json "$REPORT_FILE" > ${GITHUB_STEP_SUMMARY:-"step_summary.md"} --exit-code $EXIT_CODE

# Package golden artifacts into per-model zips for downstream consumers.
if [[ -d "${GOLDEN_DIR}/${FLOW}" ]]; then
pushd "${GOLDEN_DIR}/${FLOW}"
# Group files by model name prefix and zip each model's artifacts.
for pte in *.pte; do
[[ -f "$pte" ]] || continue
model_name="${pte%.pte}"
zip -j "${GOLDEN_DIR}/${model_name}_golden.zip" \
"${model_name}.pte" \
${model_name}_input*.bin \
${model_name}_expected_output*.bin \
2>/dev/null || true

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-model zips are written to ${GOLDEN_DIR}/${model_name}_golden.zip (outside the per-flow directory). In the workflow matrix, multiple flows can produce the same model_name, which will silently overwrite zips from earlier flows. Include $FLOW in the zip filename or keep the per-model zips under ${GOLDEN_DIR}/${FLOW}/ to avoid collisions.

Copilot uses AI. Check for mistakes.
done
popd
fi
45 changes: 45 additions & 0 deletions .github/workflows/_test_backend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,51 @@ jobs:

source .ci/scripts/test_backend.sh "${{ matrix.suite }}" "${{ matrix.flow }}" "${RUNNER_ARTIFACT_DIR}"

package-golden-artifacts:
if: ${{ inputs.run-linux }}
needs: test-backend-linux
runs-on: linux.2xlarge
steps:
- name: Download model test artifacts
uses: actions/download-artifact@v4
with:
pattern: test-report-*-models

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern 'test-report--models' only downloads artifacts from the 'models' suite, but not from the 'operators' suite. According to the test-backend-linux job matrix, both 'models' and 'operators' suites are run (line 47), and both could potentially generate golden artifacts. If golden artifacts are also expected from operator tests, this pattern should be 'test-report-' to include both suites, or the pattern should explicitly include operators as well.

Suggested change
pattern: test-report-*-models
pattern: test-report-*

Copilot uses AI. Check for mistakes.
path: downloaded/

- name: Package golden artifacts
run: |
set -eux
TIMESTAMP=$(date -u +%y%m%d%H)
mkdir -p golden_combined

find downloaded/ \( -name '*.pte' -o -name '*_input*.bin' -o -name '*_expected_output*.bin' \) \
-exec cp {} golden_combined/ \;

if ls golden_combined/*.pte 1>/dev/null 2>&1; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions "These artifacts are packaged into per-model zips and a combined golden_artifacts_yymmddhh.zip", but the implementation only creates a combined zip file (line 92). There are no per-model zips being created. Either update the PR description to match the implementation, or add the per-model zip creation step if it was intended.

Copilot uses AI. Check for mistakes.
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packaging step flattens all .pte/.bin files from downloaded/ into a single golden_combined/ directory via cp. Since artifacts are produced per-flow, identical filenames across flows (same model/test name) will overwrite each other and the combined zip will silently drop files. Preserve directory structure (e.g. copy with --parents or zip from the original tree) or prefix filenames with flow/suite to keep them unique.

Suggested change
-exec cp {} golden_combined/ \;
if ls golden_combined/*.pte 1>/dev/null 2>&1; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(ls golden_combined/*.pte | wc -l) models."
-exec cp --parents {} golden_combined/ \;
if find golden_combined -name '*.pte' -print -quit | grep -q .; then
(cd golden_combined && zip -r "../golden_artifacts_${TIMESTAMP}.zip" .)
echo "Created golden_artifacts_${TIMESTAMP}.zip with $(find golden_combined -name '*.pte' | wc -l) models."

Copilot uses AI. Check for mistakes.
else
echo "No golden artifacts found."
fi

- name: Upload combined golden artifacts
uses: actions/upload-artifact@v4
with:
name: golden-artifacts-${{ inputs.backend }}
path: golden_artifacts_*.zip
if-no-files-found: ignore

- name: Upload golden artifacts to S3
uses: seemethere/upload-artifact-s3@v5
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks for the existence of golden_artifacts_*.zip files to determine whether to upload to S3, but this check happens in the step itself (line 98). If for some reason the file doesn't exist at that point, the step will be skipped silently. However, the step name suggests it should "Upload golden artifacts to S3" unconditionally if the package-golden-artifacts job succeeded. Consider whether the conditional should be on the job level (line 63) rather than the step level, or if the conditional logic needs adjustment to match the intended behavior.

Suggested change
if: ${{ hashFiles('golden_artifacts_*.zip') != '' }}

Copilot uses AI. Check for mistakes.
with:
s3-bucket: gha-artifacts
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/artifact/golden-artifacts-${{ inputs.backend }}
retention-days: 90
if-no-files-found: ignore
path: golden_artifacts_*.zip

test-backend-macos:
if: ${{ inputs.run-macos }}
strategy:
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/test-backend-xnnpack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ on:
paths:
- .github/workflows/test-backend-xnnpack.yml
- .github/workflows/_test_backend.yml
- .ci/scripts/test_backend.sh
- backends/test/harness/**
- backends/test/suite/**
workflow_dispatch:

concurrency:
Expand Down
42 changes: 42 additions & 0 deletions backends/test/harness/tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

import logging
import os
import random
from collections import Counter, OrderedDict
from typing import Any, Callable, Dict, List, Optional, Tuple
Expand Down Expand Up @@ -317,11 +319,14 @@ def run_method_and_compare_outputs(
rtol=1e-03,
qtol=0,
statistics_callback: Callable[[ErrorStatistics], None] | None = None,
artifact_dir: Optional[str] = None,
artifact_name: Optional[str] = None,
):
number_of_runs = 1 if inputs is not None else num_runs
reference_stage = self.stages[StageType.EXPORT]

stage = stage or self.cur
artifacts_saved = False

for _ in range(number_of_runs):
inputs_to_run = inputs if inputs else next(self.generate_random_inputs())
Expand All @@ -346,8 +351,45 @@ def run_method_and_compare_outputs(
statistics_callback,
)

if artifact_dir and artifact_name and not artifacts_saved:
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golden artifact dumping can raise and fail the test run: the call to _dump_golden_artifacts(...) isn’t wrapped, so any filesystem/serialization issue (permissions, full disk, unsupported dtype -> .numpy(), etc.) will turn an otherwise-successful correctness check into a test failure. Since artifacts are optional, catch exceptions around this call and log a warning (similar to the .pte dump logic in runner.py).

Suggested change
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)
try:
self._dump_golden_artifacts(
artifact_dir, artifact_name, inputs_to_run, reference_output
)
except Exception as e:
logger = logging.getLogger(__name__)
logger.warning(
"Failed to dump golden artifacts for '%s': %s",
artifact_name,
e,
)

Copilot uses AI. Check for mistakes.
artifacts_saved = True

return self

@staticmethod
def _dump_golden_artifacts(
artifact_dir: str,
artifact_name: str,
inputs: Tuple[torch.Tensor],
reference_output,
):
logger = logging.getLogger(__name__)
os.makedirs(artifact_dir, exist_ok=True)

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact directory creation should be done earlier to catch errors during the actual test run rather than silently failing later. Currently, if os.makedirs fails, the exception is caught and logged as a warning, but the test continues. Since this is called after successful output comparison, there's a risk that test results could be marked as successful even though artifact generation failed. Consider whether artifact generation failures should be treated as test failures, or at minimum, ensure that the directory creation happens before the comparison so that filesystem issues are caught early.

Copilot uses AI. Check for mistakes.

for i, inp in enumerate(inputs):
if isinstance(inp, torch.Tensor):
suffix = "" if len(inputs) == 1 else f"_{i}"
path = os.path.join(artifact_dir, f"{artifact_name}_input{suffix}.bin")
inp.contiguous().numpy().tofile(path)
logger.info(f"Saved golden input to {path}")
Comment on lines +381 to +386

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop only saves inputs that are torch.Tensor instances, silently skipping any non-tensor inputs. This could lead to incomplete golden artifact sets if models accept mixed tensor and non-tensor inputs (e.g., integers, floats, booleans). While this might be intentional for simplicity, it should be documented or a warning should be logged when non-tensor inputs are skipped, so that users are aware that the golden artifacts may not fully represent the test case.

Copilot uses AI. Check for mistakes.

if isinstance(reference_output, torch.Tensor):
reference_output = (reference_output,)
elif isinstance(reference_output, OrderedDict):
reference_output = tuple(reference_output.values())

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not handle the case where reference_output is already a tuple. According to the existing _compare_outputs method (lines 474-477), the code handles torch.Tensor and OrderedDict, but if reference_output is already a tuple (which is a valid case), it will not be normalized. This could lead to issues if the tuple contains non-tensor elements or needs further processing. Consider adding a check for tuple type or ensuring all possible output types are handled consistently.

Suggested change
reference_output = tuple(reference_output.values())
reference_output = tuple(reference_output.values())
elif isinstance(reference_output, (list, tuple)):
reference_output = tuple(reference_output)

Copilot uses AI. Check for mistakes.

for i, out in enumerate(reference_output):
if isinstance(out, torch.Tensor):
suffix = "" if len(reference_output) == 1 else f"_{i}"
path = os.path.join(
artifact_dir, f"{artifact_name}_expected_output{suffix}.bin"
)
out.contiguous().numpy().tofile(path)
logger.info(f"Saved golden output to {path}")
Comment on lines +393 to +400

Copilot AI Feb 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the input handling, the loop only saves outputs that are torch.Tensor instances. If reference_output contains non-tensor elements after being converted to a tuple, those elements will be silently skipped. This could result in incomplete output files. Consider logging a warning when non-tensor outputs are encountered and skipped.

Copilot uses AI. Check for mistakes.

@staticmethod
def _assert_outputs_equal(
model_output,
Expand Down
9 changes: 9 additions & 0 deletions backends/test/suite/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import os
from typing import Any

import pytest
Expand Down Expand Up @@ -32,6 +33,13 @@ def __init__(self, flow, test_name, test_base_name):
self._test_base_name = test_base_name
self._subtest = 0
self._results = []
self._artifact_dir = self._resolve_artifact_dir()

def _resolve_artifact_dir(self) -> str | None:
base = os.environ.get("GOLDEN_ARTIFACTS_DIR")
if not base:
return None
return os.path.join(base, self._flow.name)

def lower_and_run_model(
self,
Expand All @@ -50,6 +58,7 @@ def lower_and_run_model(
None,
generate_random_test_inputs=generate_random_test_inputs,
dynamic_shapes=dynamic_shapes,
artifact_dir=self._artifact_dir,
)

self._subtest += 1
Expand Down
22 changes: 22 additions & 0 deletions backends/test/suite/runner.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import argparse
import hashlib
import importlib
import logging
import os
import random
import re
import time
Expand Down Expand Up @@ -92,6 +94,7 @@ def run_test( # noqa: C901
params: dict | None,
dynamic_shapes: Any | None = None,
generate_random_test_inputs: bool = True,
artifact_dir: str | None = None,
) -> TestCaseSummary:
"""
Top-level test run function for a model, input set, and tester. Handles test execution
Expand Down Expand Up @@ -201,6 +204,11 @@ def build_result(
# We can do this if we ever see to_executorch() or serialize() fail due a backend issue.
return build_result(TestResult.UNKNOWN_FAIL, e)

# Derive a clean model name for golden artifacts (e.g. "test_mobilenet_v3_small" -> "mobilenet_v3_small").
artifact_name = None
if artifact_dir:
artifact_name = test_base_name.removeprefix("test_")

# TODO We should consider refactoring the tester slightly to return more signal on
# the cause of a failure in run_method_and_compare_outputs. We can look for
# AssertionErrors to catch output mismatches, but this might catch more than that.
Expand All @@ -210,11 +218,25 @@ def build_result(
statistics_callback=lambda stats: error_statistics.append(stats),
atol=1e-1,
rtol=4e-2,
artifact_dir=artifact_dir,
artifact_name=artifact_name,
)
except AssertionError as e:
return build_result(TestResult.OUTPUT_MISMATCH_FAIL, e)
except Exception as e:
return build_result(TestResult.PTE_RUN_FAIL, e)

# Dump .pte after successful comparison.
if artifact_dir and artifact_name and flow.supports_serialize:
logger = logging.getLogger(__name__)
try:
pte_path = os.path.join(artifact_dir, f"{artifact_name}.pte")
tester.stages[StageType.SERIALIZE].dump_artifact(pte_path)
logger.info(f"Saved golden .pte to {pte_path}")
except Exception:
logger.warning(
f"Failed to save .pte for {artifact_name}", exc_info=True
)
else:
# Skip the test if nothing is delegated
return build_result(TestResult.SUCCESS_UNDELEGATED)
Expand Down
Loading