Skip to content

Merge submission v6.0 into master#1146

Open
anhappdev wants to merge 35 commits into
masterfrom
submission-v6.0
Open

Merge submission v6.0 into master#1146
anhappdev wants to merge 35 commits into
masterfrom
submission-v6.0

Conversation

@anhappdev
Copy link
Copy Markdown
Collaborator

@anhappdev anhappdev commented May 29, 2026

  • Merge fast forward instead of squash to preserve commit history.

mohitmundhragithub and others added 30 commits November 4, 2025 14:28
* WIP LLM pipeline and dataset implementation

* fixed issues preventing libraries from compiling, runtime errors not included

* upgrade TensorFlow to 2.18.0

* upgraded llm pipeline to use TFLite C++ api + small bug fixes

* basic flutter app support for icon and dataset

* added linux x86_64 config for internal testing

* updated bazel config to use SSE/MMX instructions

* fixed incorrect answer format and compression

* got pipeline and dataset to produce proper results + fixed issues where pipeline cannot handle an input size larger than the max prefill size

* added support for loadgen's token based performance measurement + implemented performance benchmark for LLM pipeline

* fixed bugs in inference process, first token function now handles only input and issue_query only handles output tokens

* optimized tensor retrieval for inference + added check for input size vs KV cache size

* clang-format

* mmlu dataset cleanup and formatting

* slight code cleanup

* fixed issue with genai ops import

* code/config cleanup

* add zero-shot option to MMLU constructor

* use function to detect which token is answer letter

* quick initial implementation of first token callback

* moved tokenizer to dataset side (possibly needs cleanup)

* added files needed for MMLU utils

* clang-format

* continued formatting

* code cleanup / issue_query signature update to vendor backends

* signature update for QTI/Samsung backends

* format

* formatted clang and bazel using docker based formatter

* reverted issue_query change for samsung + bazel formatting

* fix for MSVC C7555 error

* rough IFEval implementation using llm_instruction benchmark

* disabled XNNPACK AVX-VNNI for windows due to C2440 error

* moved accuracy calculation away from ProcessOutput, ifeval accuracy is calculated per instruction not per sample

* fixed issue with app not finding model/tokenizer

* properly format 0-shot prompts + allow for file/directory for model path

* formatting

* potential fix for windows C2440

* fix for aligned free for windows

* potential fix for IOS / windows CI issues

* ifeval check cleanup and bugfixes

* formatting

* all possible configs for removing eigen exceptions

* removed objc opts

* use token latencies in app

* enable exceptions for IOS

* disable FP16 AVX for x86 simulator

* attempt to enable exceptions for eigen

* 2nd attempt at enabling exceptions for IOS eigen

* fixed fexceptions syntax

* kitchen-sink approach to enable exceptions for IOS

* attempt to undefine EIGEN_EXCEPTIONS for IOS

* add global ovverride to disable eigen exceptions

* use ARM based macos for IOS build

* fixed and re-enabled eigen patch

* further fix for eigen patch

* even more patch fixing

* fixed typo in eigen patch

* fixed incorrect count in eigen patch

* force arm64 ios build

* use ARM64 simulator for IOS build

* use arm64 simulator for tflite on IOS

* set ios cpu argument for cpuinfo

* remvoed ios_sim prefix

* attempt at using arm64 simulator for IOS instead of x86

* attempt to force flutter to build ITs for arm64 only

* force arm64 for pods

* disable f16 instead of building for arm64

* more bazel config lines to disable fp16

* removed unavailable compiler flags

* provide patched fp16 lib with math workaround

* typo

* added patch arg

* created a math workaround patch compatible with fp16 version used by xnnpack

* datasets now provide token limits as inputs to pipeline

---------

Co-authored-by: Farook Al-Sammarraie <farook.a@scopicsoftware.com>
use the entire dataset for performance
update loadgen to 5.1.1
fixed android build failing on macos
* added powershell specific command for loadgen bazel genrule

* undo cmd_bash for macos support

* excaped $ characters for bazel
* added input token limit to pipeline backend config

* changed input limit to be part of the dataset

* formatting
* Added TPS and TTFT to results screen (and results log)

* fixed potentially null values

* fixed unit tests

* fixed unused import

* added units for result screen

* formatting

* fixed UI bugs, added missing tooltips and values
* added json and language validators for IFEval

* formatting
use config to dictate thread count
* fixed ifeval evaluation implementation bugs

* further improved sentence counter

* overhauled keyword evaluation by using stemming and a plural word map

* formatting

* use stemming library as extenral dependency

* format
* truncate larger than max input for MMLU instead of ignoring

* fix bug in vector::erase

* format

* differentiate between shots when truncating + connect IFEval benchmark to app

* added new llm icons

* formatting

* added llama tflite model to cdn
This PR adds a new element (BenchmarkSet) which bundles together
benchmarks that are mostly similar but need to be run separately (i.e.
different models or datasets but same function).

Under the hood the benchmarks work exactly the same, no C++ logic has
been changed. The added configuration is only for the frontend.

The way it works is by bundling similar benchmarks under a set, and
having each benchmark be active if one or more options it requires are
active. For example, if we take LLM, let's say we have 3 models and 3
dataset implementations to test, `ModelA-DatasetB`, `ModelC-DatasetA`,
and so on, that'll be 9 benchmarks.
Benchmark `ModelA-DatasetC` will define 2 required options, `Model-A`
and `Dataset-C`, then the Benchmark Set will contain 6 options in 2
categories, Models (A,B,C) and DataSets (A,B,C).
If a user then enables Models A and C, And dataset A. the set will
automatically activate `ModelA-DatasetA` and `ModelC-DatasetA` and
disable all the others.

The benefit from this approach is that instead of having 9 benchmarks
that are basically the same, we'll have 1 set containg 6 options. While
the core benchmarking will not see the sets or options.

This PR also applies the above described implementation to
`image_classification_v2`, combining the default and offline versions
into a set, and providing 2 options to enable and disable the
benchmarks. This is only a secondary improvements, since this system is
meant to tidy up the (at least) 4 benchmarks that LLM will add.

I've also included a video of the system in action:


https://github.com/user-attachments/assets/9c833086-60fc-4d6f-a5bd-bf1bb10cab0a

Closes #1082
This PR fixes an issue described in #1098 where accuracy benchmarks
would crash because `loadgen_info.dart` attempts to extract performance
data from loadgen's accuracy logs.

I'm honestly not sure how this code ever worked, but this fix should
work since performance data isn't used in accuracy benchmark anyway.
This PR addressed the points discussed in #1098 regarding datasets
(specifically MMLU).

It does the following:
- Prevent performance mode from running when query count is 0

- set input/output token limits to 2048/1024 for IFEval and 2048/(4|128)
for MMLU.

- Update accuracy string format for MMLU and IFEval (from `Accuracy:
50%` to `50%` for an accuracy of `0.5`)
This PR removes the checksum values for LLM models, so that custom
models can be used in testing.
This PR makes it so that MMLU and IFEval datasets produce the raw output
from the model.
This PR allows options to be hidden and adds results screen element for
benchmark sets.
This PR adds app and backend benchmark configs for 3B and 8B LLM models,
currently the model files are not uploaded.
This PR does a few things:

- Adds detailed IFEval accuracy log.
- Uses OTPS instead of TPS for throughput.
- Fixes minor UI issue where hidden options would still count towards
the total shown.

Closes #1121
This PR fixes the incorrect model filename for 3B and 8B benchmarks, and
disables all benchmark set options by default. As well as fixing 2 bugs concerning offline and LLM benchmarks
* Exynos 2600 v6.0 Submission

* Update format/linting

* Update model and lib paths for Samsung submission v6.0 (#57)

* Update Samsung lib to v6.0_20260505

* Update Samsung model paths to mlcommons-storage URLs

* Expand LLM zip placeholders into individual model files

* Update format/lint

* Add model_checksum values for exynos2600

* Add LLM model_checksum values for exynos2600

* Removing base_zip_dir value in 2600 pbtxt

* Withdraw llm benchmarks

---------

Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com>

---------

Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com>
Co-authored-by: Anh <anh.app.dev@gmail.com>
* Update Mediatek backend for mt6993

* Format C/C++/Proto code with clang-format (Google style)

* Update MTK mt6993 model URLs and checksums (#60)

Replace local:// paths with v6_0 mediatek storage URLs and update
model_checksum to the actual MD5 values for all six MTK DLA models.

---------

Co-authored-by: Anh <anh.app.dev@gmail.com>
* QTI submission for V6.0 submission

co-author: Utkarsh Mishra <utkarshm@qti.qualcomm.com>
co-author: Mohit Mundhra <mmundhra@qti.qualcomm.com>
co-author: Aswin B <aswib@qti.qualcomm.com>

* Adding license headers

* Removing 8cxg3

* Fixing the issue with submission mode

* Format and linting

* fixing markdown

* fix the output token for LLM (instruct) (#61)

* Update model and lib paths for Qualcomm submission v6.0 (#58)

* Update QTI SDK to v2.45.0.260326 from v6.0 lib bucket

* Restrict workflow push triggers to submission-v<N>.<N> branches

* Revert "Restrict workflow push triggers to submission-v<N>.<N> branches"

This reverts commit 61670880dae023cd963fd7df0952afe14072cbec.

* Update QTI model URLs to v6_0 and refresh model_checksum values

* Expand LLM model_file entries to per-file v6_0 URLs with checksums

* Update Makefile to use the correct output folder (#62)

* Update Makefile to use the correct output folder

* Fix SNPE_MOBILEDET_DIR

---------

Co-authored-by: anhappdev <anh.app.dev@gmail.com>

* Update DLC

* Update android-build-test-linux.yml (#64)

Include building LLM Models

* Add llama3_1b.spm.model tokenizer download to QTI Genie LLM settings

The tokenizer_filename custom_setting referenced llama3_1b.spm.model but
no model_file entry downloaded it, so backend_model_path/llama3_1b.spm.model
would be missing at runtime for MMLU/IFEval datasets. Add the same tokenizer
model_file used by the TFLite backend to both llm-8b and llm-8b-instruct
delegate_choice blocks.

* Add tokenizer.json download to QTI Genie LLM settings

Per review feedback, Genie requires the Llama 3.1 tokenizer.json alongside
the NPU model shards. Add the file (hosted at v6_0/qualcomm/llama3_npu/)
to both llm-8b and llm-8b-instruct delegate_choice blocks.

---------

Co-authored-by: Mohit Mundhra <mmundhra@qti.qualcomm.com>

* Update the batch size of offline case

---------

Co-authored-by: Utkarsh Mishra <utkarshm@qti.qualcomm.com>
Co-authored-by: Anh <anh.app.dev@gmail.com>
Co-authored-by: anhappdev <anh.app.dev@gmail.com>
farook-edev and others added 2 commits May 27, 2026 05:38
submission branch is protected, so I can't merge my UI changes without a
PR @freedomtan @Mostelk @anhappdev

---------

Co-authored-by: Anh <anh.app.dev@gmail.com>
mention MTK Dimensity 9500 in the doc
@anhappdev anhappdev requested a review from a team as a code owner May 29, 2026 08:07
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…hmark for M32 device (#1143)

The empty model_checksum entries (introduced in #1113 to ease testing
with custom models) caused the integration test validateSettings() to
fail for every benchmark, since it iterates allBenchmarks regardless of
which benchmark is under test.

Checksums computed from the published model files:
- llama_q8_ekv3072.tflite    -> c618e9bbb89ce52eedcab4f61b2dc3a4
- llama_3b_q8_ekv3072.tflite -> 6bde54c22e2be27bdad5303a0bd9cf72
- llama_8b_q8_ekv3072.tflite -> ec19cf40f881684f15db0b862883ff00

Disable the LLM benchmark for the Samsung Galaxy M32 used in CI tests as
it's not supported.
@anhappdev anhappdev marked this pull request as draft May 29, 2026 11:30
@anhappdev anhappdev closed this May 30, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators May 30, 2026
**Conflict resolution:**
- `flutter/pubspec.yaml`: kept submission's `6.0.0+1` over master's
`5.0.5+1`.
- `WORKSPACE`: kept submission's TF 2.18.0 patch list. Dropped master's
`ndk_25_r14.diff` (TF-2.14-only; removed on this branch during the TF
2.18 upgrade), but kept `xcode26_compat.patch` — it only patches
`tensorflow/lite/kernels/elementwise.cc`, whose `std::abs<T>` usage is
identical in TF 2.18, so it applies cleanly and fixes the Clang 21 `no
matching function for call to 'EvalImpl'` error.
@anhappdev anhappdev reopened this May 30, 2026
@anhappdev anhappdev closed this May 30, 2026
@anhappdev anhappdev reopened this May 30, 2026
Resolves conflicts in WORKSPACE and flutter/pubspec.yaml:
- pubspec.yaml: keep v6.0.0+1 (supersedes master's 5.0.5+1).
- WORKSPACE: keep TF 2.18.0 patch list from v6.0. Master's
  ndk_25_r14.diff reference is dropped because that patch file
  no longer exists on v6.0 (removed during the TF 2.14 -> 2.18
  upgrade) and the xcode26_compat.patch is already applied.
@anhappdev anhappdev marked this pull request as ready for review May 30, 2026 14:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants