Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions .github/workflows/partition-benchmark.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
name: Partition Benchmark

# Runs on every PR targeting main to detect regressions.
# Can also be triggered manually to establish or inspect a new baseline.
on:
pull_request:
branches: [main]
workflow_dispatch:

permissions:
contents: read

env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
PYTHON_VERSION: "3.12"
# Number of times to run the full benchmark suite.
NUM_ITERATIONS: "3"
# 20% threshold for now and tune later
REGRESSION_THRESHOLD: "0.20"
# Increment to change cache key when benchmark-affecting dependencies are updated, to ensure clean slate runs.
CACHE_VERSION: "v2"
# S3 location for metrics – matches core-product convention.
S3_METRICS_BUCKET_KEY: utic-metrics/ci-metrics
S3_BENCHMARK_PATH: open-source/partition-benchmark/benchmark_best.json

jobs:
setup:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/base-cache
with:
python-version: ${{ env.PYTHON_VERSION }}

benchmark:
name: Measure and compare partition() runtime
runs-on: ubuntu-latest
needs: [setup]

steps:

- uses: actions/checkout@v4

- uses: ./.github/actions/base-cache
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor


- name: Restore HuggingFace model cache
uses: actions/cache/restore@v4
with:
path: ~/.cache/huggingface
key: hf-models-${{ runner.os }}-${{ env.CACHE_VERSION }}-${{ github.sha }}
restore-keys: |
hf-models-${{ runner.os }}-${{ env.CACHE_VERSION }}-
hf-models-${{ runner.os }}-


- name: Run partition benchmark
env:
NUM_ITERATIONS: ${{ env.NUM_ITERATIONS }}
run: |
uv run --no-sync python scripts/performance/benchmark_partition.py \
benchmark_results.json

- name: Save HuggingFace model cache
uses: actions/cache/save@v4
with:
path: ~/.cache/huggingface
key: hf-models-${{ runner.os }}-${{ env.CACHE_VERSION }}-${{ github.sha }}


- name: Download previous best from S3
continue-on-error: true
env:
AWS_ACCESS_KEY_ID: ${{ secrets.S3_EVAL_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_EVAL_SECRET_KEY }}
run: |
aws s3 cp \
"s3://${{ env.S3_METRICS_BUCKET_KEY }}/${{ env.S3_BENCHMARK_PATH }}" \
benchmark_best.json


- name: Compare results against stored best
id: compare
run: |
uv run --no-sync python scripts/performance/compare_benchmark.py \
benchmark_results.json \
benchmark_best.json \
${{ env.REGRESSION_THRESHOLD }}


- name: Upload best result to S3
continue-on-error: true
env:
AWS_ACCESS_KEY_ID: ${{ secrets.S3_EVAL_ACCESS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.S3_EVAL_SECRET_KEY }}
run: |
aws s3 cp \
benchmark_best.json \
"s3://${{ env.S3_METRICS_BUCKET_KEY }}/${{ env.S3_BENCHMARK_PATH }}"


- name: Upload benchmark artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: benchmark-results-${{ github.sha }}
path: |
benchmark_results.json
benchmark_best.json
retention-days: 30
13 changes: 12 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -209,4 +209,15 @@ annotated/
pcaps
python-output

.vs/
.vs/
# Partition benchmark generated output
benchmark_results.json
scripts/performance/partition-speed-test/benchmark_results.json

# Partition benchmark generated output
benchmark_results.json
scripts/performance/partition-speed-test/benchmark_results.json

# Partition benchmark generated output
benchmark_results.json
scripts/performance/partition-speed-test/benchmark_results.json
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
## 0.21.4
- Add a git action to test regression in runtime performance

## 0.21.3

### Enhancements
Expand All @@ -17,6 +20,15 @@
### Fixes
- **Replace NLTK with spaCy to remediate CVE-2025-14009**: NLTK's downloader uses `zipfile.extractall()` without path validation, enabling RCE via malicious packages (CVSS 10.0, no patch available). spaCy models install as pip packages, eliminating the vulnerable downloader entirely.

||||||| parent of f7a4a80e (fix: update depdencies (#4247))
=======
>>>>>>> f7a4a80e (fix: update depdencies (#4247))
||||||| parent of 0758132c (bump version)
=======
## 0.20.9
- add an action to test the partition seepd in high_res and fast

>>>>>>> 0758132c (bump version)
## 0.20.8

### Fixes
Expand Down
Loading
Loading