Merge submission v6.0 into master#1146
Open
anhappdev wants to merge 35 commits into
Open
Conversation
* WIP LLM pipeline and dataset implementation * fixed issues preventing libraries from compiling, runtime errors not included * upgrade TensorFlow to 2.18.0 * upgraded llm pipeline to use TFLite C++ api + small bug fixes * basic flutter app support for icon and dataset * added linux x86_64 config for internal testing * updated bazel config to use SSE/MMX instructions * fixed incorrect answer format and compression * got pipeline and dataset to produce proper results + fixed issues where pipeline cannot handle an input size larger than the max prefill size * added support for loadgen's token based performance measurement + implemented performance benchmark for LLM pipeline * fixed bugs in inference process, first token function now handles only input and issue_query only handles output tokens * optimized tensor retrieval for inference + added check for input size vs KV cache size * clang-format * mmlu dataset cleanup and formatting * slight code cleanup * fixed issue with genai ops import * code/config cleanup * add zero-shot option to MMLU constructor * use function to detect which token is answer letter * quick initial implementation of first token callback * moved tokenizer to dataset side (possibly needs cleanup) * added files needed for MMLU utils * clang-format * continued formatting * code cleanup / issue_query signature update to vendor backends * signature update for QTI/Samsung backends * format * formatted clang and bazel using docker based formatter * reverted issue_query change for samsung + bazel formatting * fix for MSVC C7555 error * rough IFEval implementation using llm_instruction benchmark * disabled XNNPACK AVX-VNNI for windows due to C2440 error * moved accuracy calculation away from ProcessOutput, ifeval accuracy is calculated per instruction not per sample * fixed issue with app not finding model/tokenizer * properly format 0-shot prompts + allow for file/directory for model path * formatting * potential fix for windows C2440 * fix for aligned free for windows * potential fix for IOS / windows CI issues * ifeval check cleanup and bugfixes * formatting * all possible configs for removing eigen exceptions * removed objc opts * use token latencies in app * enable exceptions for IOS * disable FP16 AVX for x86 simulator * attempt to enable exceptions for eigen * 2nd attempt at enabling exceptions for IOS eigen * fixed fexceptions syntax * kitchen-sink approach to enable exceptions for IOS * attempt to undefine EIGEN_EXCEPTIONS for IOS * add global ovverride to disable eigen exceptions * use ARM based macos for IOS build * fixed and re-enabled eigen patch * further fix for eigen patch * even more patch fixing * fixed typo in eigen patch * fixed incorrect count in eigen patch * force arm64 ios build * use ARM64 simulator for IOS build * use arm64 simulator for tflite on IOS * set ios cpu argument for cpuinfo * remvoed ios_sim prefix * attempt at using arm64 simulator for IOS instead of x86 * attempt to force flutter to build ITs for arm64 only * force arm64 for pods * disable f16 instead of building for arm64 * more bazel config lines to disable fp16 * removed unavailable compiler flags * provide patched fp16 lib with math workaround * typo * added patch arg * created a math workaround patch compatible with fp16 version used by xnnpack * datasets now provide token limits as inputs to pipeline --------- Co-authored-by: Farook Al-Sammarraie <farook.a@scopicsoftware.com>
use the entire dataset for performance
update loadgen to 5.1.1
fixed android build failing on macos
* added powershell specific command for loadgen bazel genrule * undo cmd_bash for macos support * excaped $ characters for bazel
* added input token limit to pipeline backend config * changed input limit to be part of the dataset * formatting
* Added TPS and TTFT to results screen (and results log) * fixed potentially null values * fixed unit tests * fixed unused import * added units for result screen * formatting * fixed UI bugs, added missing tooltips and values
* added json and language validators for IFEval * formatting
use config to dictate thread count
* fixed ifeval evaluation implementation bugs * further improved sentence counter * overhauled keyword evaluation by using stemming and a plural word map * formatting * use stemming library as extenral dependency * format
* truncate larger than max input for MMLU instead of ignoring * fix bug in vector::erase * format * differentiate between shots when truncating + connect IFEval benchmark to app * added new llm icons * formatting * added llama tflite model to cdn
This PR adds a new element (BenchmarkSet) which bundles together benchmarks that are mostly similar but need to be run separately (i.e. different models or datasets but same function). Under the hood the benchmarks work exactly the same, no C++ logic has been changed. The added configuration is only for the frontend. The way it works is by bundling similar benchmarks under a set, and having each benchmark be active if one or more options it requires are active. For example, if we take LLM, let's say we have 3 models and 3 dataset implementations to test, `ModelA-DatasetB`, `ModelC-DatasetA`, and so on, that'll be 9 benchmarks. Benchmark `ModelA-DatasetC` will define 2 required options, `Model-A` and `Dataset-C`, then the Benchmark Set will contain 6 options in 2 categories, Models (A,B,C) and DataSets (A,B,C). If a user then enables Models A and C, And dataset A. the set will automatically activate `ModelA-DatasetA` and `ModelC-DatasetA` and disable all the others. The benefit from this approach is that instead of having 9 benchmarks that are basically the same, we'll have 1 set containg 6 options. While the core benchmarking will not see the sets or options. This PR also applies the above described implementation to `image_classification_v2`, combining the default and offline versions into a set, and providing 2 options to enable and disable the benchmarks. This is only a secondary improvements, since this system is meant to tidy up the (at least) 4 benchmarks that LLM will add. I've also included a video of the system in action: https://github.com/user-attachments/assets/9c833086-60fc-4d6f-a5bd-bf1bb10cab0a Closes #1082
This PR fixes an issue described in #1098 where accuracy benchmarks would crash because `loadgen_info.dart` attempts to extract performance data from loadgen's accuracy logs. I'm honestly not sure how this code ever worked, but this fix should work since performance data isn't used in accuracy benchmark anyway.
This PR addressed the points discussed in #1098 regarding datasets (specifically MMLU). It does the following: - Prevent performance mode from running when query count is 0 - set input/output token limits to 2048/1024 for IFEval and 2048/(4|128) for MMLU. - Update accuracy string format for MMLU and IFEval (from `Accuracy: 50%` to `50%` for an accuracy of `0.5`)
This PR removes the checksum values for LLM models, so that custom models can be used in testing.
This PR makes it so that MMLU and IFEval datasets produce the raw output from the model.
This PR allows options to be hidden and adds results screen element for benchmark sets.
This PR adds app and backend benchmark configs for 3B and 8B LLM models, currently the model files are not uploaded.
This PR does a few things: - Adds detailed IFEval accuracy log. - Uses OTPS instead of TPS for throughput. - Fixes minor UI issue where hidden options would still count towards the total shown. Closes #1121
This PR fixes the incorrect model filename for 3B and 8B benchmarks, and disables all benchmark set options by default. As well as fixing 2 bugs concerning offline and LLM benchmarks
* Exynos 2600 v6.0 Submission * Update format/linting * Update model and lib paths for Samsung submission v6.0 (#57) * Update Samsung lib to v6.0_20260505 * Update Samsung model paths to mlcommons-storage URLs * Expand LLM zip placeholders into individual model files * Update format/lint * Add model_checksum values for exynos2600 * Add LLM model_checksum values for exynos2600 * Removing base_zip_dir value in 2600 pbtxt * Withdraw llm benchmarks --------- Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com> --------- Co-authored-by: Ahmed Youssef <ahmed.e@samsung.com> Co-authored-by: Anh <anh.app.dev@gmail.com>
* Update Mediatek backend for mt6993 * Format C/C++/Proto code with clang-format (Google style) * Update MTK mt6993 model URLs and checksums (#60) Replace local:// paths with v6_0 mediatek storage URLs and update model_checksum to the actual MD5 values for all six MTK DLA models. --------- Co-authored-by: Anh <anh.app.dev@gmail.com>
* QTI submission for V6.0 submission co-author: Utkarsh Mishra <utkarshm@qti.qualcomm.com> co-author: Mohit Mundhra <mmundhra@qti.qualcomm.com> co-author: Aswin B <aswib@qti.qualcomm.com> * Adding license headers * Removing 8cxg3 * Fixing the issue with submission mode * Format and linting * fixing markdown * fix the output token for LLM (instruct) (#61) * Update model and lib paths for Qualcomm submission v6.0 (#58) * Update QTI SDK to v2.45.0.260326 from v6.0 lib bucket * Restrict workflow push triggers to submission-v<N>.<N> branches * Revert "Restrict workflow push triggers to submission-v<N>.<N> branches" This reverts commit 61670880dae023cd963fd7df0952afe14072cbec. * Update QTI model URLs to v6_0 and refresh model_checksum values * Expand LLM model_file entries to per-file v6_0 URLs with checksums * Update Makefile to use the correct output folder (#62) * Update Makefile to use the correct output folder * Fix SNPE_MOBILEDET_DIR --------- Co-authored-by: anhappdev <anh.app.dev@gmail.com> * Update DLC * Update android-build-test-linux.yml (#64) Include building LLM Models * Add llama3_1b.spm.model tokenizer download to QTI Genie LLM settings The tokenizer_filename custom_setting referenced llama3_1b.spm.model but no model_file entry downloaded it, so backend_model_path/llama3_1b.spm.model would be missing at runtime for MMLU/IFEval datasets. Add the same tokenizer model_file used by the TFLite backend to both llm-8b and llm-8b-instruct delegate_choice blocks. * Add tokenizer.json download to QTI Genie LLM settings Per review feedback, Genie requires the Llama 3.1 tokenizer.json alongside the NPU model shards. Add the file (hosted at v6_0/qualcomm/llama3_npu/) to both llm-8b and llm-8b-instruct delegate_choice blocks. --------- Co-authored-by: Mohit Mundhra <mmundhra@qti.qualcomm.com> * Update the batch size of offline case --------- Co-authored-by: Utkarsh Mishra <utkarshm@qti.qualcomm.com> Co-authored-by: Anh <anh.app.dev@gmail.com>
Co-authored-by: anhappdev <anh.app.dev@gmail.com>
submission branch is protected, so I can't merge my UI changes without a PR @freedomtan @Mostelk @anhappdev --------- Co-authored-by: Anh <anh.app.dev@gmail.com>
mention MTK Dimensity 9500 in the doc
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
…hmark for M32 device (#1143) The empty model_checksum entries (introduced in #1113 to ease testing with custom models) caused the integration test validateSettings() to fail for every benchmark, since it iterates allBenchmarks regardless of which benchmark is under test. Checksums computed from the published model files: - llama_q8_ekv3072.tflite -> c618e9bbb89ce52eedcab4f61b2dc3a4 - llama_3b_q8_ekv3072.tflite -> 6bde54c22e2be27bdad5303a0bd9cf72 - llama_8b_q8_ekv3072.tflite -> ec19cf40f881684f15db0b862883ff00 Disable the LLM benchmark for the Samsung Galaxy M32 used in CI tests as it's not supported.
**Conflict resolution:** - `flutter/pubspec.yaml`: kept submission's `6.0.0+1` over master's `5.0.5+1`. - `WORKSPACE`: kept submission's TF 2.18.0 patch list. Dropped master's `ndk_25_r14.diff` (TF-2.14-only; removed on this branch during the TF 2.18 upgrade), but kept `xcode26_compat.patch` — it only patches `tensorflow/lite/kernels/elementwise.cc`, whose `std::abs<T>` usage is identical in TF 2.18, so it applies cleanly and fixes the Clang 21 `no matching function for call to 'EvalImpl'` error.
Resolves conflicts in WORKSPACE and flutter/pubspec.yaml: - pubspec.yaml: keep v6.0.0+1 (supersedes master's 5.0.5+1). - WORKSPACE: keep TF 2.18.0 patch list from v6.0. Master's ndk_25_r14.diff reference is dropped because that patch file no longer exists on v6.0 (removed during the TF 2.14 -> 2.18 upgrade) and the xcode26_compat.patch is already applied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.