Merge submission v6.0 into master by anhappdev · Pull Request #1146 · mlcommons/mobile_app_open

anhappdev · 2026-05-29T08:07:24Z

Merge fast forward instead of squash to preserve commit history.

* WIP LLM pipeline and dataset implementation * fixed issues preventing libraries from compiling, runtime errors not included * upgrade TensorFlow to 2.18.0 * upgraded llm pipeline to use TFLite C++ api + small bug fixes * basic flutter app support for icon and dataset * added linux x86_64 config for internal testing * updated bazel config to use SSE/MMX instructions * fixed incorrect answer format and compression * got pipeline and dataset to produce proper results + fixed issues where pipeline cannot handle an input size larger than the max prefill size * added support for loadgen's token based performance measurement + implemented performance benchmark for LLM pipeline * fixed bugs in inference process, first token function now handles only input and issue_query only handles output tokens * optimized tensor retrieval for inference + added check for input size vs KV cache size * clang-format * mmlu dataset cleanup and formatting * slight code cleanup * fixed issue with genai ops import * code/config cleanup * add zero-shot option to MMLU constructor * use function to detect which token is answer letter * quick initial implementation of first token callback * moved tokenizer to dataset side (possibly needs cleanup) * added files needed for MMLU utils * clang-format * continued formatting * code cleanup / issue_query signature update to vendor backends * signature update for QTI/Samsung backends * format * formatted clang and bazel using docker based formatter * reverted issue_query change for samsung + bazel formatting * fix for MSVC C7555 error * rough IFEval implementation using llm_instruction benchmark * disabled XNNPACK AVX-VNNI for windows due to C2440 error * moved accuracy calculation away from ProcessOutput, ifeval accuracy is calculated per instruction not per sample * fixed issue with app not finding model/tokenizer * properly format 0-shot prompts + allow for file/directory for model path * formatting * potential fix for windows C2440 * fix for aligned free for windows * potential fix for IOS / windows CI issues * ifeval check cleanup and bugfixes * formatting * all possible configs for removing eigen exceptions * removed objc opts * use token latencies in app * enable exceptions for IOS * disable FP16 AVX for x86 simulator * attempt to enable exceptions for eigen * 2nd attempt at enabling exceptions for IOS eigen * fixed fexceptions syntax * kitchen-sink approach to enable exceptions for IOS * attempt to undefine EIGEN_EXCEPTIONS for IOS * add global ovverride to disable eigen exceptions * use ARM based macos for IOS build * fixed and re-enabled eigen patch * further fix for eigen patch * even more patch fixing * fixed typo in eigen patch * fixed incorrect count in eigen patch * force arm64 ios build * use ARM64 simulator for IOS build * use arm64 simulator for tflite on IOS * set ios cpu argument for cpuinfo * remvoed ios_sim prefix * attempt at using arm64 simulator for IOS instead of x86 * attempt to force flutter to build ITs for arm64 only * force arm64 for pods * disable f16 instead of building for arm64 * more bazel config lines to disable fp16 * removed unavailable compiler flags * provide patched fp16 lib with math workaround * typo * added patch arg * created a math workaround patch compatible with fp16 version used by xnnpack * datasets now provide token limits as inputs to pipeline --------- Co-authored-by: Farook Al-Sammarraie <farook.a@scopicsoftware.com>

use the entire dataset for performance

update loadgen to 5.1.1

fixed android build failing on macos

* added powershell specific command for loadgen bazel genrule * undo cmd_bash for macos support * excaped $ characters for bazel

* added input token limit to pipeline backend config * changed input limit to be part of the dataset * formatting

* Added TPS and TTFT to results screen (and results log) * fixed potentially null values * fixed unit tests * fixed unused import * added units for result screen * formatting * fixed UI bugs, added missing tooltips and values

* added json and language validators for IFEval * formatting

use config to dictate thread count

* fixed ifeval evaluation implementation bugs * further improved sentence counter * overhauled keyword evaluation by using stemming and a plural word map * formatting * use stemming library as extenral dependency * format

* truncate larger than max input for MMLU instead of ignoring * fix bug in vector::erase * format * differentiate between shots when truncating + connect IFEval benchmark to app * added new llm icons * formatting * added llama tflite model to cdn

This PR adds a new element (BenchmarkSet) which bundles together benchmarks that are mostly similar but need to be run separately (i.e. different models or datasets but same function). Under the hood the benchmarks work exactly the same, no C++ logic has been changed. The added configuration is only for the frontend. The way it works is by bundling similar benchmarks under a set, and having each benchmark be active if one or more options it requires are active. For example, if we take LLM, let's say we have 3 models and 3 dataset implementations to test, `ModelA-DatasetB`, `ModelC-DatasetA`, and so on, that'll be 9 benchmarks. Benchmark `ModelA-DatasetC` will define 2 required options, `Model-A` and `Dataset-C`, then the Benchmark Set will contain 6 options in 2 categories, Models (A,B,C) and DataSets (A,B,C). If a user then enables Models A and C, And dataset A. the set will automatically activate `ModelA-DatasetA` and `ModelC-DatasetA` and disable all the others. The benefit from this approach is that instead of having 9 benchmarks that are basically the same, we'll have 1 set containg 6 options. While the core benchmarking will not see the sets or options. This PR also applies the above described implementation to `image_classification_v2`, combining the default and offline versions into a set, and providing 2 options to enable and disable the benchmarks. This is only a secondary improvements, since this system is meant to tidy up the (at least) 4 benchmarks that LLM will add. I've also included a video of the system in action: https://github.com/user-attachments/assets/9c833086-60fc-4d6f-a5bd-bf1bb10cab0a Closes #1082

This PR fixes an issue described in #1098 where accuracy benchmarks would crash because `loadgen_info.dart` attempts to extract performance data from loadgen's accuracy logs. I'm honestly not sure how this code ever worked, but this fix should work since performance data isn't used in accuracy benchmark anyway.

This PR addressed the points discussed in #1098 regarding datasets (specifically MMLU). It does the following: - Prevent performance mode from running when query count is 0 - set input/output token limits to 2048/1024 for IFEval and 2048/(4|128) for MMLU. - Update accuracy string format for MMLU and IFEval (from `Accuracy: 50%` to `50%` for an accuracy of `0.5`)

This PR removes the checksum values for LLM models, so that custom models can be used in testing.

This PR makes it so that MMLU and IFEval datasets produce the raw output from the model.

This PR allows options to be hidden and adds results screen element for benchmark sets.

This PR adds app and backend benchmark configs for 3B and 8B LLM models, currently the model files are not uploaded.

* Closes #1119

This PR does a few things: - Adds detailed IFEval accuracy log. - Uses OTPS instead of TPS for throughput. - Fixes minor UI issue where hidden options would still count towards the total shown. Closes #1121

This PR fixes the incorrect model filename for 3B and 8B benchmarks, and disables all benchmark set options by default. As well as fixing 2 bugs concerning offline and LLM benchmarks

* Exynos 2600 v6.0 Submission * Update format/linting * Update model and lib paths for Samsung submission v6.0 (#57) * Update Samsung lib to v6.0_20260505 * Update Samsung model paths to mlcommons-storage URLs * Expand LLM zip placeholders into individual model files * Update format/lint * Add model_checksum values for exynos2600 * Add LLM model_checksum values for exynos2600 * Removing base_zip_dir value in 2600 pbtxt * Withdraw llm benchmarks --------- Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com> --------- Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com> Co-authored-by: Anh <anh.app.dev@gmail.com>

* Update Mediatek backend for mt6993 * Format C/C++/Proto code with clang-format (Google style) * Update MTK mt6993 model URLs and checksums (#60) Replace local:// paths with v6_0 mediatek storage URLs and update model_checksum to the actual MD5 values for all six MTK DLA models. --------- Co-authored-by: Anh <anh.app.dev@gmail.com>

* QTI submission for V6.0 submission co-author: Utkarsh Mishra <utkarshm@qti.qualcomm.com> co-author: Mohit Mundhra <mmundhra@qti.qualcomm.com> co-author: Aswin B <aswib@qti.qualcomm.com> * Adding license headers * Removing 8cxg3 * Fixing the issue with submission mode * Format and linting * fixing markdown * fix the output token for LLM (instruct) (#61) * Update model and lib paths for Qualcomm submission v6.0 (#58) * Update QTI SDK to v2.45.0.260326 from v6.0 lib bucket * Restrict workflow push triggers to submission-v<N>.<N> branches * Revert "Restrict workflow push triggers to submission-v<N>.<N> branches" This reverts commit 61670880dae023cd963fd7df0952afe14072cbec. * Update QTI model URLs to v6_0 and refresh model_checksum values * Expand LLM model_file entries to per-file v6_0 URLs with checksums * Update Makefile to use the correct output folder (#62) * Update Makefile to use the correct output folder * Fix SNPE_MOBILEDET_DIR --------- Co-authored-by: anhappdev <anh.app.dev@gmail.com> * Update DLC * Update android-build-test-linux.yml (#64) Include building LLM Models * Add llama3_1b.spm.model tokenizer download to QTI Genie LLM settings The tokenizer_filename custom_setting referenced llama3_1b.spm.model but no model_file entry downloaded it, so backend_model_path/llama3_1b.spm.model would be missing at runtime for MMLU/IFEval datasets. Add the same tokenizer model_file used by the TFLite backend to both llm-8b and llm-8b-instruct delegate_choice blocks. * Add tokenizer.json download to QTI Genie LLM settings Per review feedback, Genie requires the Llama 3.1 tokenizer.json alongside the NPU model shards. Add the file (hosted at v6_0/qualcomm/llama3_npu/) to both llm-8b and llm-8b-instruct delegate_choice blocks. --------- Co-authored-by: Mohit Mundhra <mmundhra@qti.qualcomm.com> * Update the batch size of offline case --------- Co-authored-by: Utkarsh Mishra <utkarshm@qti.qualcomm.com> Co-authored-by: Anh <anh.app.dev@gmail.com>

Co-authored-by: anhappdev <anh.app.dev@gmail.com>

@freedomtan

submission branch is protected, so I can't merge my UI changes without a PR @freedomtan @Mostelk @anhappdev --------- Co-authored-by: Anh <anh.app.dev@gmail.com>

mention MTK Dimensity 9500 in the doc

github-actions · 2026-05-29T08:07:36Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…hmark for M32 device (#1143) The empty model_checksum entries (introduced in #1113 to ease testing with custom models) caused the integration test validateSettings() to fail for every benchmark, since it iterates allBenchmarks regardless of which benchmark is under test. Checksums computed from the published model files: - llama_q8_ekv3072.tflite -> c618e9bbb89ce52eedcab4f61b2dc3a4 - llama_3b_q8_ekv3072.tflite -> 6bde54c22e2be27bdad5303a0bd9cf72 - llama_8b_q8_ekv3072.tflite -> ec19cf40f881684f15db0b862883ff00 Disable the LLM benchmark for the Samsung Galaxy M32 used in CI tests as it's not supported.

**Conflict resolution:** - `flutter/pubspec.yaml`: kept submission's `6.0.0+1` over master's `5.0.5+1`. - `WORKSPACE`: kept submission's TF 2.18.0 patch list. Dropped master's `ndk_25_r14.diff` (TF-2.14-only; removed on this branch during the TF 2.18 upgrade), but kept `xcode26_compat.patch` — it only patches `tensorflow/lite/kernels/elementwise.cc`, whose `std::abs<T>` usage is identical in TF 2.18, so it applies cleanly and fixes the Clang 21 `no matching function for call to 'EvalImpl'` error.

Resolves conflicts in WORKSPACE and flutter/pubspec.yaml: - pubspec.yaml: keep v6.0.0+1 (supersedes master's 5.0.5+1). - WORKSPACE: keep TF 2.18.0 patch list from v6.0. Master's ndk_25_r14.diff reference is dropped because that patch file no longer exists on v6.0 (removed during the TF 2.14 -> 2.18 upgrade) and the xcode26_compat.patch is already applied.

mohitmundhragithub and others added 30 commits November 4, 2025 14:28

Use the full dataset for performance benchmarks (#1069)

2510839

use the entire dataset for performance

Merge branch 'master' into submission-v6.0

b4559b4

Update loadgen to 5.1.1 (#1070)

f5b9cfd

update loadgen to 5.1.1

Fix android build on MacOS (#1076)

ed06ef2

fixed android build failing on macos

Add windows specific cmd for loadgen config (#1079)

9181973

* added powershell specific command for loadgen bazel genrule * undo cmd_bash for macos support * excaped $ characters for bazel

Add input token limit to LLM (#1077)

cd22a42

* added input token limit to pipeline backend config * changed input limit to be part of the dataset * formatting

Results for Token throughput based benchmarks (#1071)

da0b0f7

* Added TPS and TTFT to results screen (and results log) * fixed potentially null values * fixed unit tests * fixed unused import * added units for result screen * formatting * fixed UI bugs, added missing tooltips and values

Add JSON and Language validators for IFEval (#1078)

a7da562

* added json and language validators for IFEval * formatting

Merge branch 'master' into submission-v6.0

8d2e5c7

Config based LLM pipeline thread count (#1094)

a85a7e7

use config to dictate thread count

IFEval evaluation code overhaul (#1093)

5470700

* fixed ifeval evaluation implementation bugs * further improved sentence counter * overhauled keyword evaluation by using stemming and a plural word map * formatting * use stemming library as extenral dependency * format

MMLU + IFEval Edits (#1097)

a5e1c45

* truncate larger than max input for MMLU instead of ignoring * fix bug in vector::erase * format * differentiate between shots when truncating + connect IFEval benchmark to app * added new llm icons * formatting * added llama tflite model to cdn

Merge branch 'master' into submission-v6.0

07a87b8

Remove LLM checksum for testing (#1113)

34fea88

This PR removes the checksum values for LLM models, so that custom models can be used in testing.

fix - 0 perf-sample dataset (#1115)

d24a7dd

LLM raw output (#1117)

0621e33

This PR makes it so that MMLU and IFEval datasets produce the raw output from the model.

BenchmarkSet Hidden options (#1116)

06b89c3

This PR allows options to be hidden and adds results screen element for benchmark sets.

3B and 8B configs (#1118)

394e9d9

This PR adds app and backend benchmark configs for 3B and 8B LLM models, currently the model files are not uploaded.

Merge branch 'master' into submission-v6.0

08c2771

Update random seeds and app version for v6.0 (#1120)

702148e

* Closes #1119

LLM fixes (#1123)

055b442

This PR does a few things: - Adds detailed IFEval accuracy log. - Uses OTPS instead of TPS for throughput. - Fixes minor UI issue where hidden options would still count towards the total shown. Closes #1121

small fixes for LLM config (#1130)

5e5472d

This PR fixes the incorrect model filename for 3B and 8B benchmarks, and disables all benchmark set options by default. As well as fixing 2 bugs concerning offline and LLM benchmarks

Update qti_settings_sd8elite.pbtxt (#1141)

e1155e6

Co-authored-by: anhappdev <anh.app.dev@gmail.com>

farook-edev and others added 2 commits May 27, 2026 05:38

V6.0 ps (#1142)

4d9a843

submission branch is protected, so I can't merge my UI changes without a PR @freedomtan @Mostelk @anhappdev --------- Co-authored-by: Anh <anh.app.dev@gmail.com>

Update the supported backend for MediaTek (#1145)

f2271a5

mention MTK Dimensity 9500 in the doc

anhappdev requested a review from a team as a code owner May 29, 2026 08:07

anhappdev marked this pull request as draft May 29, 2026 11:30

anhappdev closed this May 30, 2026

github-actions Bot locked and limited conversation to collaborators May 30, 2026

anhappdev reopened this May 30, 2026

anhappdev closed this May 30, 2026

anhappdev reopened this May 30, 2026

anhappdev marked this pull request as ready for review May 30, 2026 14:16

anhappdev requested review from Mostelk, freedomtan and mohitmundhragithub May 30, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge submission v6.0 into master#1146

Merge submission v6.0 into master#1146
anhappdev wants to merge 35 commits into
masterfrom
submission-v6.0

anhappdev commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anhappdev commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anhappdev commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading