Feat/distribution check 2 stage#2
Open
staaason wants to merge 1056 commits intofeat/sequence-check-validatorfrom
Open
Feat/distribution check 2 stage#2staaason wants to merge 1056 commits intofeat/sequence-check-validatorfrom
staaason wants to merge 1056 commits intofeat/sequence-check-validatorfrom
Conversation
Auto build/auto upgrade testnet/add k8s support to testermint: ## Auto Build 1. Add `publish_upgrade_binaries.yml` 2. Fix build process for upgrades to work properly in Github Actions, by using build options that will work on pure linux machines (not Apple Silicon) ## k8s Support 1. Abstract out CLI layer to a class. (CliExecutor) 2. Create DockerExecutor for existing logic 3. Add KubernetesExecutor for k8s cluster 4. For now, k8s tests need to be separate. They'll always need to have some kind of tag at a minimum, as you can't do everything you need to to run all tests on k8s. ## Upgrade There's a test that pulls RELEASETAG from the enironment and upgrades to that test for k8s clusters> There's a workflow that kicks off when the create release workflow is done that will kick off the script.
* fix wildcard
* fix wildcard
Fix the path to the artifact.
* Create new "sanity" test (just a few tests) Also add AI guidelines for better agent behavior and fix one weird edge case * More changes to fix downloading tags
* Create new "sanity" test (just a few tests) Also add AI guidelines for better agent behavior and fix one weird edge case * move to a cancel context for shutdown * Fix API upgrades 1. Make sure the test fails if the upgrade fails. 2. Make sure `decentralized-api` returns an error on upgrade so cosmovisor will upgrade correctly 3. Update to new cosmovisor that gets upgrade height the old way (via inferenced status).
# Conflicts: # decentralized-api/apiconfig/config.go # tmkms/Makefile
* Add a log of the node version for debugging
* Create new "sanity" test (just a few tests) Also add AI guidelines for better agent behavior and fix one weird edge case * Add upgrade for v0.1.4 --------- Co-authored-by: Gleb Morgachev <gleb@productscience.ai>
* Reset cosmovisor * Scripts refactoring * docker refactoring * Upd reset * Another clear cosmovisor * Add lost line
# Conflicts: # inference-chain/scripts/init-docker.sh
…PoC submissions (gonka-ai#162) * Large PoC payload test * Empty arrays for MsgSubmitPocValidation * change how we make a PoC decision on participant's submission * Add comment on why we're zeroing out all arrays * log votedWeight * tweak log statements * Add upgrade handler * 1_6 > 1_7 * test that will break it! * refactor * test simplification * Fix * log currentValidatorWeights * Deny if no validators are found * Improve logging * change epochParams * accept if no validators * skip test * 0.1.7 > 0.1.8
* Just some cline files to start. More to come. * Initial checkin for the log-examiner * Add prompts and a README.md file * Fix for README.md
* Several attempts at making k8s more reliable 1. Longer timeouts for https requests. 2. retries for some CLI commands in certain cases. Flexible implementation, we can control the errors to retry as well as the commands to retry (currently get only) 3. Better logging of the pair involved in the logs 4. Logging as k8s for activities related to k8s connections 5. Fix port forwarding logic and move to a proper class 6. Added a local cluster upgrade from live test to help debug upgrade issues
* fix attempt * Revert "fix attempt" This reverts commit 2956ba2. * Alternative mappings * template approach * delete unused * fix template
* Upgrade restart * Readme * do nothing is state as intendet * Filter commands * Upgrade v0.1.9 * Timeout, cleanup logs * Change default state
* dynamic resolve for docker address * Enhance dynamic resolve
* add filtering to chainvalidation.go * add filtering to ValidateReceivedBatches * ... * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> * ... * guard against len mismatch --------- Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Inference validation proposal We propose un update to a current inference validation system that protects against pre-fill attacks. * Update proposals/inference-validation/inference-validation.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: mtvnastya <mtvnastya@gmail.com> * Update proposals/inference-validation/inference-validation.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: mtvnastya <mtvnastya@gmail.com> * Update proposals/inference-validation/inference-validation.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: mtvnastya <mtvnastya@gmail.com> * Distance check plots Added plots for distance check comparing Qwen3-32B fp8 vs int4 * format * format 2 * Format 3 * notation fix Updated mathematical notation for clarity and consistency. Signed-off-by: mtvnastya <mtvnastya@gmail.com> * notation Signed-off-by: mtvnastya <mtvnastya@gmail.com> * Change notation from curly brackets to braces Updated notation for the full probability distribution in the inference validation proposal. Signed-off-by: mtvnastya <mtvnastya@gmail.com> * minor clean up Updated the purpose of Stage 1 and clarified the token sampling process in the validator. Signed-off-by: mtvnastya <mtvnastya@gmail.com> --------- Signed-off-by: mtvnastya <mtvnastya@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Gleb Morgachev <morgachev.g@gmail.com>
This PR fixes a critical issue where queries intended to fetch all items were silently truncated to the first 100 results due to default Cosmos SDK pagination settings. This could lead to incomplete data processing and inconsistent state. ### Key Changes * **`SettleAccounts`**: Updated to read all participants directly from the store, which is more appropriate and efficient for on-chain logic. Added tests to make sure we're settling for all the participants in the state and not only for the first 100. * **`get_participants_handler`**: Implemented manual per-page fetching for the public API endpoint. Queries are pinned to a specific block height to ensure data consistency across pages. * **`GetPartialUpgrades`**: Corrected to use a new `GetAllWithPagination` utility, ensuring all partial upgrade plans are fetched from the chain. These fixes prevent silent data loss and ensure the reliability of account settlements and API responses. All changes are covered by comprehensive tests as detailed in the implementation plan.
The commit introduces multiple fixes and security improvements: #### Missed `node_id` to PoCBatch (`inference-chain/x/inference/keeper/msg_server_submit_poc_batch.go`) `node_id` is used to detect which node produced the nonce's batch #### Remove legacy weight distribution from batches without `node_id` `inference-chain/x/inference/module/model_assignment.go` distributed before batches without `node_id` between another nodes. As now all MLNodes returns `node_id` => that it not needed anymore #### Remove MLNodes without HardwareNodes or models supported by Governance `unassignedMLNodes` from `inference-chain/x/inference/module/model_assignment.go` are not counted in total weight anymore Total weight of participant is recomputed after all filtering #### Statistical Validation for Missed inference and validation Binom test `inference-chain/x/inference/calculations/stats.go` is now used for: - not pay reward if statistically signigicant > 10% of requests are missed (`inference-chain/x/inference/keeper/accountsettle.go`) - not pay claim if statistically signigicant > 10% of validations are missed (`inference-chain/x/inference/keeper/msg_server_claim_rewards.go`) instead of hard check for all validations #### Set Participant status to `ACTIVE` for all ActiveParticipant when switch epoch `inference-chain/x/inference/module/module.go:moveUpcomingToEffectiveGroup` #### Fix counter for successful inferences `inference-chain/x/inference/keeper/msg_server_start_inference.go` `inference-chain/x/inference/keeper/msg_server_finish_inference.go` #### Fix for interrupted inference validation #### Logging
* regenerate seed * ? --------- Co-authored-by: dima <dima.orekhov@productscience.ai>
This reverts commit b627a72.
* Batch processing for DetectMissedValidations inferences * AutoRewardRecovery on startup * Fix: retry Txs * Tests * Recover * Auto upgrade * Fixes * Logs * Replace TX subscriptions with block querying to prevent subscription channels overflow (gonka-ai#376) * block_monitoring.go sketch * first sketch * rename file * remove unnecessary things * block_observer_test.go * real event * ... * ... * ... * ... * Fix: add lastProcessHeight to log * adapt tests * minimal version of signalAllEventsRead * minimal version of signalAllEventsRead * Update decentralized-api/internal/event_listener/block_observer_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> * add comment --------- Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Gleb Morgachev <morgachev.g@gmail.com> * Instruction * Instruction * Text * Text * Remove old upgrade info * Reproducible seed based on epoch signature (gonka-ai#377) * regenerate seed * fix integration_test.go * remove GenerateSeed mock * remove GenerateSeed mock * return the seed number calculation logic as it was * Regenerate Seed if missing * Fix --------- Co-authored-by: 0xBECEDA <sakharovasonya1@gmail.com> Co-authored-by: David and Daniil Liberman <da@liberman.net> * Certik Audit fixes (straightforward changes) (gonka-ai#356) * Potential division by 0 (GOI-15) * Confusion in pass by ref/value (GOI-05) Technically, the code worked, but looks wrong. * Fix PubKey nil checks (GOI-19) The claim_rewards instances were largely replaced with the delegatees logic, however I did fix an edge case where if no keys were found the signature would be considered valid (not sure if it's possible, but no harm in fixing) * Make RunMembershipService match RunManager (GOI-29) * Move consts (GOI-07) * Consistent Declaration pattern (GOI-08) * Remove unused errors (GOI-31) * Remove unused DefaultMaxTokens (GOI-33) * Remove extra DefaultMaxTokens (GOI-33) * found != true => !found (GOI-35) * Optimize newMLNodesMap (GOI-09) This is also cleaner code * Rename method (GOC-19) Doesn't actually filter for zero balance. Not really clear if this method is even being used. * Upgrade dependencies (GOC-20) The vulnerability in github.com/cosmos/ibc-go/v8 8.2.0 is severe. The rest of the updates are just required * Participants should start in RAMPING (GOC-29) Note that there is (or should be) no operational difference between RAMPING and ACTIVE at present. * Full token enforcing (gonka-ai#256) * Extract new artifact * Change validation * New validation by default * Fix listemer * Get back error * BUG: Invalid Request to create new participant can break validateo (gonka-ai#304) * don't submit new participant from `dapi` * Fix guardian power tests * Rename method (GOC-24) * Add log of error (GOC-32) * remove unneeded printlns (GOC-33) * Move to const (GOC-34) * Add err check (GOC-35) * A variety of comment fixes (GOC-38) 1. TODOs that had been done 2. out of date comments 3. badly worded/misspelled NOTE: This was done with a good bit of AI assist (on finding them) and only on the `inference` core module. * Fix possible panics for negative coins (gonka-ai#309) * Fix possible panics for negative coins `sdk.NewInt64Coin` WILL panic if it gets a negative value. Add error checking for every instance where we use NewInt64Coin (minus a few in Genesis we can ignore) Add a Safe method for logging subaccount transactions to reduce the (now) needed boilerplate Move from using uint64 in our interior methods. In the future, we need to strictly use uint64 at the edges (messages, apis) only. The risk of 0 - 1 = uint64.Max is real, and Go idioms avoid uints in general. * Fix typo * fix build * Missed validations recovery system (gonka-ai#353) ## Problem During an epoch, a participant may legitimately miss validating some inferences due to network instability, hardware changes, or other temporary issues. Currently, once accounts are settled at epoch transition, missed validations cannot be recovered. This leads to: * Gaps in inference validation coverage. * Potentially lower reputation and compute credit for participants. * Inconsistent incentives between those who missed validations for legitimate reasons and those who didn’t. * A risk that some invalid inferences remain undetected if they were missed by validators. --- ## Solution Introduce a recovery mechanism that allows participants to “catch up” on missed validations **after account settlement but before claiming rewards**. The key aspects are: 1. **Automatic Detection of Missed Validations** * Each participant node queries for any inferences it should have validated but did not. * These missed inferences are filtered against the models available to the participant. 2. **Recovery Validations Before Claim** * Any missed validations are executed and submitted prior to submitting the claim transaction. * This ensures participants remain consistent with the validation process, even if late. 2. **Validations Retries** - Validation failures will now be retried several times before giving up, in an attempt to reduce the need for missed validations work. 3. **Modified Treatment of Recovery Validations** * Participants receive **validation credit** (reputation / proof of compute) but **no shared coin for work**, since payment has already been settled. They do still receive earned Bitcoin style rewards. * If a late validation identifies an invalid inference: - Invalidation verification and voting still happens on the network * If invalid, the inference submitter’s reputation is still penalized. * However, the requester does **not** receive a refund, since funds have already been distributed and cannot be clawed back. We may revisit this in the future for finding a way to claw back funds from the invalid executor. 4. **System Safeguards** * Ensure that no negative coin balances during settlement can cause a panic, preventing consensus-breaking chain halts. (While this PR will not cause such situations, testing revealed this possible error which needed to be addressed for chain stability) --- ## Implementation * **Validation Logic Refactor** * Extract reusable validation logic into `shouldValidateInference()`. * **Missed Validation Detection** * Implement `DetectMissedValidations()` to identify per-epoch gaps in validation. * **Execution of Recovery Validations** * Add `ExecuteRecoveryValidations()` with parallel goroutine execution for efficiency. * **Integration with Epoch Transition** * Trigger recovery validations at `IsSetNewValidatorsStage`, before claim submission. * Run background processing to minimize delays. * **Infrastructure Reuse & Logging** * Use existing validation infrastructure for consistency. * Add logging for monitoring and debugging missed validation recovery. * Handle mismatching error better (GOC-22) The check in GetPreviousEpochMLNodesWithInferenceAllocation `currentEpochGroup.GroupData.EpochIndex != upcomingEpoch.Index-1` should absolutely never happen, but if it does stopping processing is better than going off of bad data. * Distributed remainder correctly (GOI-20) 1. Ensure we NEVER distribute more remainder than available (via the `Remainder` type) 2. Fix logic for adding created MLNodeInfo for remainder only situations. 3. Add tests. * Fix decentralized API build * Update README.md Signed-off-by: Tania Charchian <tatiana.charchian@productscience.ai> * TestNet (gonka-ai#363) * MLNode on-chain upgrade (gonka-ai#364) # MLNode Upgrade & API Stability **Architecture Overview:** ``` ┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ │ decentralized- │───▶│ ML Proxy │───▶│ MLNode v3.0.6 │ │ api │ │ (NGINX) │ │ (old version) │ └─────────────────┘ │ │ └─────────────────┘ │ │ ┌─────────────────┐ │ │───▶│ MLNode v3.0.8 │ └──────────────┘ │ (new version) │ └─────────────────┘ ``` ## Key Changes: - MLNode version is now stored on-chain and managed by partial upgrades. - `api` service fetches the version at startup and updates automatically when an upgrade height is reached. - `api` stores the last-used MLNode version to its config. If the current version changes, it refreshes all MLNode clients and calls `.stop()` on the old version's clients. - Fixed a bug where partial upgrades were ignored because the `Name` field was empty. - Removed local version plan storage from the `api` config. - Refactored and stabilized `testermint` integration tests to reduce flakiness. - Upgrade handler and migration for on-chain upgrade * dynamic resolve for docker address (gonka-ai#354) * dynamic resolve for docker address * Enhance dynamic resolve * lost file * PROPOSAL 2: new models (gonka-ai#350) * register migrations in test * Filter duplicate nonces (gonka-ai#366) * add filtering to chainvalidation.go * add filtering to ValidateReceivedBatches * ... * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> * ... * guard against len mismatch --------- Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix: total num nodes according to current state * Fix build * Typo snuck in * Weird formatting change * Necessary change to support moving to ibc-go >8.7.0 * See what version changes have broken * Check last version as in main --------- Signed-off-by: Tania Charchian <tatiana.charchian@productscience.ai> Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Gleb Morgachev <gleb@productscience.ai> Co-authored-by: Tania Charchian <tatiana.charchian@productscience.ai> Co-authored-by: Gleb Morgachev <morgachev.g@gmail.com> Co-authored-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Upgrade handler * bump cosmos-sdk * Migrations * Proxy SSL (gonka-ai#373) proxy-ssl is a small HTTP service that issues TLS certificates via ACME DNS-01 (Let’s Encrypt) for subdomains of a single base domain. It is used in Gonka deployments to automatically issue certs for hosts like explorer., api., rpc., etc. How it works: clients submit a CSR and the desired FQDNs; the service performs DNS-01 challenges using your DNS provider credentials, then returns a certificate bundle. Security: requests must be authorized with a JWT (CERT_ISSUER_JWT_SECRET). Only subdomains listed in CERT_ISSUER_ALLOWED_SUBDOMAINS under CERT_ISSUER_DOMAIN are allowed. Storage: issued bundles are written under cert_storage_path (default /app/certs). Providers: Route53, Cloudflare, Google Cloud DNS, Azure DNS, DigitalOcean DNS, Hetzner DNS. If configuration is missing/invalid, the container runs in a disabled mode and serves only /health for liveness checks. --------- Co-authored-by: Gleb Morgachev <morgachev.g@gmail.com> * bump version * deploy docker compose set defaults * remove docker volumes * deploy docker compose set missing defaults * update cosmos-sdk * Update cosmos-sdk version * Description --------- Signed-off-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Signed-off-by: Tania Charchian <tatiana.charchian@productscience.ai> Co-authored-by: David and Daniil Liberman <da@liberman.net> Co-authored-by: DimaOrekhovPS <dima.orekhov@productscience.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: 0xBECEDA <sakharovasonya1@gmail.com> Co-authored-by: John Long <john.long@productscience.ai> Co-authored-by: Tania Charchian <tatiana.charchian@productscience.ai> Co-authored-by: GLiberman <gabriel@liberman.net>
* Update cosmos-sdk * Restrict pub key changes * bump cosmos-sdk * Instruction * Clean * back compartibility * New instruction * Update containers * Fetching blocks
# Upgrade Proposal: v0.2.4 This document outlines the proposed changes for on-chain software upgrade v0.2.4. The `Changes` section details the major modifications, and the `Upgrade Plan` section describes the process for applying these changes. ## Upgrade Plan This PR updates the code for the `api` and `node` services and modifies the container versions in `deploy/join/docker-compose.yml`. The binary versions will be updated via an on-chain upgrade proposal. For more information on the upgrade process, refer to `/docs/upgrades.md`. Existing participants are **not** required to upgrade their `api` and `node` containers. The updated container versions are intended for new participants who join after the on-chain upgrade is complete. **Proposed Process:** 1. Active participants review this proposal on GitHub. 2. Once the PR is approved by a majority, a `v0.2.4` release will be created from this branch, and an on-chain upgrade proposal for this version will be submitted. 3. If the on-chain proposal is approved, this PR will be merged immediately after the upgrade is executed on-chain. Creating the release from this branch (instead of `main`) minimizes the time that the `/deploy/join/` directory on the `main` branch contains container versions that do not match the on-chain binary versions, ensuring a smoother onboarding experience for new participants. New MLNode container `v3.0.10` is fully compartible with `v3.0.9` and can be updated asyncronously at any time. **Important:** to support new features as models auto-downloading, `HF_HUB_OFFLINE` should be disabled (env variable removed from `deploy/join/docker-compose.mlnode.yml`). ## Testing ### Testnet The on-chain upgrade from version `v0.2.3-patch2` to `v0.2.4` has been successfully deployed and verified on the testnet. We encourage all reviewers to request access to our testnet environment to validate the upgrade. Alternatively, reviewers can test the on-chain upgrade process on their own private testnets. ## Changes --- ### Training Security See the commits related to this change [here](gonka-ai@9bf03ec) and [here](gonka-ai@c115a25) #### Overview Training in Gonka is not ready for broad use. However, all training messages are currently available, causing both confusion and an attack surface for DoS attachs #### Solution All training related messages are now behind two allow lists (EXEC to run training tasks, START to initiate training). The lists are controlled via governance votes. This also addresses several Certik audit issues. --- ### Major Certik Audit fixes This is a set of fixes that address issues found in the Certik Audit that required more substantial fixes in the code. References to the specific Certik issue are in parenthesis The overall changes can be seen [here](gonka-ai@ba1657c) #### Pubkey must match address When adding a new Participant, the pubkey must match the address (GOC-12). Specific changes [here](gonka-ai@bd00f16) #### Panic avoidance during EndBlock Panics during EndBlock will cause _consensus failure_. These are a set of changes to avoid possible panics, either explicitly called or through methods starting with `Must`. (GOC-13, GOI-24) Changes are [here](gonka-ai@c739b1c) and [here](gonka-ai@b3a31ec) #### Validate inference timestamps on chain While replay attacks are primarily avoided via Signature dedupe for InferenceId, after pruning older inferences could be replayed. This adds on-chain checks for old inferences being replayed. (GOI-01). Changes are [here](gonka-ai@54da2f5) #### Pruning Improvements The first version of pruning was crude and liable to issues as scale increased as all pruning would take place in a single block. This introduces a more scalable version of pruning that will prune a given number of items each block until all items are pruned, and introduces a generic `Pruner` that can be re-used for this logic for other items as needed. (GOC-14). Changes are [here](gonka-ai@0bf6334) and [here](gonka-ai@dbdfcda) #### Validation Limits `MsgValidate` had zero limits, allowing even non-participants to submit invalidations, requiring and expensive chain-wide revalidation each time. Even inadvertent invalidations could cause significant strain on the chain, and have caused one chain halt. There are several fixes to address this: 1. No longer include the `ResponsePayload` in `MsgValidate` (this was 90%+ of the message size) 2. Check that a validator is an active participant AND has the model for the Inference being validated 3. Add limits to the number of Invalidations allowed for each Participant (based on Power and Reputation) Code is [here](gonka-ai@000f1a7) and [here](gonka-ai@d6c78cf) --- ### Config Management Improvements for API nodes #### Config Storage Config Management was entirely file based for API nodes. While this was adequate for infrequently or never changing attributes such as URLs, account addresses or network settings, it is not performant or safe for frequently changing values such as block height, node data or seed info. #### Solution Introduce a file based SqlLite DB to handle changing values and synchronize them for the API node. Source code is [here](https://github.com/gonka-ai/gonka/pull/385/files/ba1657cc692e9bc61d03b6329e50235dd3225230..6bec0cb82c03e5e8dd90268c50d74328d579b02b) #### MLNode management api REST API is becoming new main way to manage MLNodes. `node-config.json` is used only at the first load. New endpoint `PUT /admin/nodes/:id` is introduce to update MLNode infor without deleting. --- ### Unordered Transaction Timeout fix #### Context We use unordered transactions. Each transaction has a TTL—the time window within which the transaction must be executed. Some transactions are sent with a retry: we send repeatedly until we confirm that the transaction has made it to the chain #### Problem Under heavy load or other extreme circumstances, the chain blocks can begin to slow or even halt. Since transactions are compared with _block time_ for TTL and signed with _node machine time_, the drift would result in all messages being rejected as being ahead of block time. They are then resent, further propagating the error and resulting in _additional_ strain on the network as it tries to recover. #### Solution - Use the latest block time instead of node machine time to sign and verify TTL for transactions - Detect slow or halted chains and stop sending retries to allow better chain recovery - Cap number of retry attempts (at 100 for now) Source code is [here](gonka-ai@e5d7cb9) --- ### Support small GPUs on TestNet #### Problem `testnet` should support smaller GPUs in order to encourage bigger participation in testnet, increasing tests and therefore network robustness. #### Solution - Add testenv specific environment variables to detect when the chain is testnet vs mainnet - Add testenv specific params to have different proof-of-compute configurations (enabling smaller GPUs and faster testing) Full changes are [here](gonka-ai@dbce2ab) --- ### MLNode Management APIs & vLLM Stability Improvements Far more details available in the PR description at the top [here](gonka-ai@807a62f), but a summary: 1. Add new Endpoints to the ML Node that allow checking status of model downloads and other properties 2. Add new Endpoints to the ML Node that checking GPU status and drivers 3. Improvements in vLLM Stability deployment, as well as additional status APIs to check on progress 4. Integration of the above with the API node, with background pre-downloading of upcoming epochs and updating hardware on the chain. --- ### Health endpoints for MLNode and experimental auto-detection for node setup issues [here](gonka-ai@7559630)
This document outlines the proposed changes for on-chain software upgrade v0.2.5. The `Changes` section details the major modifications, and the `Upgrade Plan` section describes the process for applying these changes. ## Upgrade Plan This PR updates the code for the `api` and `node` services and introduces the new service `bridge` for the native bridge with Ethereum. The PR modifies the container versions in `deploy/join/docker-compose.yml`. The binary versions will be updated via an on-chain upgrade proposal. For more information on the upgrade process, refer to [`/docs/upgrades.md`](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/docs/upgrades.md). Existing hosts are **not** required to upgrade their `api` and `node` containers. The updated container versions are intended for new hosts who join after the on-chain upgrade is complete. ## Proposed Process 1. Active hosts review this proposal on GitHub. 2. Once the PR is approved by a majority, a `v0.2.5` release will be created from this branch, and an on-chain upgrade proposal for this version will be submitted. 3. If the on-chain proposal is approved, this PR will be merged immediately after the upgrade is executed on-chain. Creating the release from this branch (instead of `main`) minimizes the time that the `/deploy/join/` directory on the `main` branch contains container versions that do not match the on-chain binary versions, ensuring a smoother onboarding experience for new hosts. The `bridge` container can be started any time after upgrade by: 1. Pulling the latest changes from `main` branch (after `upgrade-v0.2.5` merged) ``` git pull ``` 2. Start ``` source config.env && docker compose up bridge -d ``` It'll take some time to synchronize. New MLNode container `v3.0.11` is fully compatible with `v3.0.10` and can be updated asynchronously at any time. Additionally, the version `v3.0.11-blackwell` is introduced for Blackwell GPUs (CUDA 12.8+ required). ## Further Steps The PR introduces 3 contracts: - [liquidity pool](https://github.com/gonka-ai/gonka/tree/upgrade-v0.2.5/inference-chain/contracts/liquidity-pool/) - [wrapped token](https://github.com/gonka-ai/gonka/tree/upgrade-v0.2.5/inference-chain/contracts/wrapped-token/) - [Ethereum contract](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/proposals/ethereum-bridge-contact/BridgeContract.sol) All contracts might be proposed for voter approval via separate proposals. ## Testing ### Testnet The on-chain upgrade from version `v0.2.4` to `v0.2.5` has been successfully deployed and verified on the testnet. We encourage all reviewers to request access to our testnet environment to validate the upgrade. Alternatively, reviewers can test the on-chain upgrade process on their own private testnets. ## Migration The on-chain migration logic and default values for new parameters are defined in [`upgrades.go`](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/inference-chain/app/upgrades/v0_2_5/upgrades.go). Specific data migrations are implemented in: - [`migrations_confirmation_weight.go`](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/inference-chain/x/inference/keeper/migrations_confirmation_weight.go): Initializes confirmation weights for the current epoch. - [`migrations_bridge.go`](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/inference-chain/x/inference/keeper/migrations_bridge.go): Removes legacy bridge state and artifacts. **Note on Inactive Participant Exclusion:** The parameters for the continuous exclusion of inactive participants (SPRT) are initialized with values that effectively disable the mechanism (requiring ~32k consecutive failures). This ensures the feature remains inactive until explicitly enabled via governance. **Note on Confirmation PoC:** The Confirmation PoC parameters are initialized to require 1 Confirmation PoC per Epoch. ## Changes --- ### Native Bridge Commit: [f7470c1](gonka-ai@168f7a8). This commit introduces primitives for native bridge for the Ethereum blockchain and contracts for its integration. Details can be found [here](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/proposals/governance-artifacts/update-v0.2.5/bridge.md). --- ### BLS Signature fix Commit: [f7470c1](gonka-ai@f7470c1). This commit fixes a bug in BLS Group Public Key generation. --- ### Participant Status Update Commit: [1010622](gonka-ai@1010622) This commit fixes the procedure for removing invalid and unavailable hosts from the EpochGroup. It also introduces a mechanism for continuously excluding inactive participants using SPRT. Details: [here](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/proposals/invalid-participant-exclusion/README.md) --- ### Confirmation PoC Commit: [e9dbf13](gonka-ai@e9dbf13) This commit introduces Random Confirmation PoC - a new layer to verify inference-serving nodes maintain computational capacity during the whole epoch. Details: [here](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/proposals/random-poc/README.md) --- ### New Schedule for `POC_SLOT=true` (nodes who serves inference during PoC) Commit: [9ce1b60](gonka-ai@9ce1b60) The chain automatically assigns a portion of MLNodes to serve inference during the next PoC phase to keep inference working. The initial version assigned 50% of weight per participant per model. To raise security, this commit proposes allocation of `POC_SLOT=true` by model weight percentages instead of per-participant halves, using a random subset of participants who served this model in the previous epoch. Details: [here](https://github.com/gonka-ai/gonka/blob/upgrade-v0.2.5/proposals/poc-schedule-v2/README.md) --- ### Blackwell Support for MLNode and Fixes Commit: [b77dcac](gonka-ai@b77dcac) Fix for vLLM to support Blackwell GPUs (tested on B200). --- ### Account transfer fix Commit: [4228a70](gonka-ai@4228a70) The commit fixes the bug which used the full account balance to transfer instead of the spendable amount. Now locked coins and spendable are transferred separately. --- ### Paginator fix for `GetMembers` Commit: [bbce9f4](gonka-ai@bbce9f4) The commit fixes a bug with a missed paginator for the `GetMembers` function. This caused the selection of only a subset of miners. That might cause "unknown" status of validator. --- ### MLNode status check fixes, retry mechanism Commit: [1352c13](gonka-ai@1352c13) The commit fixes the MLNode status check to assign status "FAILED" if the node is not responding. Additionally, it adds a retry mechanism when a host has multiple MLNodes for the model. --- ### Recalculate total weight after punishment Commit: [cf2d393](gonka-ai@cf2d393) The commit fixes a bug where undistributed rewards paid to the first host included rewards for invalid participants.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.