Skip to content

feat: refactor ceramic simulations and new benchmark for V' features#149

Merged
dav1do merged 10 commits into
mainfrom
feat/ws1-1438-mid-simulation
Feb 21, 2024
Merged

feat: refactor ceramic simulations and new benchmark for V' features#149
dav1do merged 10 commits into
mainfrom
feat/ws1-1438-mid-simulation

Conversation

@dav1do

@dav1do dav1do commented Feb 14, 2024

Copy link
Copy Markdown
Contributor

Linear ws1-1438

Summary

The main goal was to add a new simulation to benchmark Model Instance Document creation for the E2E ceramic V' (recon) system. I ended up making a bunch of changes along the way and can potentially split this into multiple PRs if it's too hard to review.

I created a new scenario called ceramic-new-streams-benchmark that creates a model, all nodes index the model, and then 1 node creates new MIDs. We investigate how fast these sync to the other nodes in order to determine if we succeed. As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end. It respects the throttleRequests and successRequestTarget values.

Details

  • I had to expose the ceramic node admin DID's private key to the simulation in order to interact with the endpoints (i.e. index models). I added a did_from_pk function for this (similar to composedb), but it should probably be moved to rust-ceramic.

  • I corrected a handful of bugs: indexing was not succeeding, and some of the simulations did not seem to do exactly what they were designed to (e.g. running something per node instead of per simulation). I adjusted the global/node/user goose leader functions to support the various control flows, and corrected the requests to log their statuses to see a better summary of the simulation and request breakdown. This does change how the output looks, for example the ceramic simple scenario now shows more request info, even though the behavior should be unchanged.

  • There is now a simulation configuration object that allows setting up the model instance specific details. This includes things like how signing should be done (user specific DID key, DID per node), whether models should be shared across nodes or not. The setup makes sure this happens appropriately, and stores the goose user data/config so that the simulations can proceed without really caring what they're using (they just use the ceramic http client and the models/MIDs in the goose data struct).

    • I refactored the Ceramic simulations to use this config. They are almost all the same with a slight variation and I wanted to take this further, but already got too into the weeds. We could almost have one file to define all the operations (get/update/etc) and parameterized config to allow changing the behavior, which would allow for more interesting simulations without requiring code changes. This would entail exposing some of the config options to the simulation in order to get passed down. Anyway, for now, having a bunch of simulation names is fine.
    • The reason for supporting user DIDs is that the js-ceramic worker threads changes parallelize on the DID, so we wanted a more realistic scenario where not every user in the system was the same.
  • I noticed that the local-scripts didn't work right and fixed them (they relied on local config I had).

  • The new postgres database ignored the devMode flag when setting its resources, now it doesn't.

Results

Results from a 1 minute run on my MBP just to demonstrate that it works:

Worker 1 did not meet the target request rate: 228.16666666666666 < 300 (total requests: 13690 over 60)
Worker 2 did not meet the target request rate: 160.91666666666666 < 300 (total requests: 9655 over 60)

Follow up

When generating random data for models, I ran into an error like the following that crashed js-ceramic. I'm not yet sure what/where the bug is (that is, it might be postgres config/not setting the encoding to utf8), it might be js-ceramic, or something else. I switched to strings as a result but need to follow up.

[2024-02-18T17:27:14.774Z] IMPORTANT: Ceramic API running on 0.0.0.0:7007'
/js-ceramic/node_modules/rxjs/dist/cjs/internal/util/reportUnhandledError.js:13
            throw err;
            ^

error: insert into "kjzl6hvfrbw6ca1eadhvu4l06ssws62ks7gmgzc8285p2xmqaaiw0l57paapfnp" ("controller_did", "created_at", "first_anchored_at", "last_anchored_at", "stream_content", "stream_id", "tip", "updated_at") values ($1, $2, $3, $4, $5, $6, $7, $8) on conflict ("stream_id") do update set "stream_id" = $9,"controller_did" = $10,"stream_content" = $11,"tip" = $12,"last_anchored_at" = $13,"first_anchored_at" = $14,"updated_at" = $15 - unsupported Unicode escape sequence
    at Parser.parseErrorMessage (/js-ceramic/node_modules/pg-protocol/dist/parser.js:287:98)
    at Parser.handlePacket (/js-ceramic/node_modules/pg-protocol/dist/parser.js:126:29)
    at Parser.parse (/js-ceramic/node_modules/pg-protocol/dist/parser.js:39:38)
    at Socket.<anonymous> (/js-ceramic/node_modules/pg-protocol/dist/index.js:11:42)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at Readable.push (node:internal/streams/readable:390:5)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  length: 250,
  severity: 'ERROR',
  code: '22P05',
  detail: '\\u0000 cannot be converted to text.',
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: 'JSON data, line 1: ...葎󲼯􊋁򃻐򖕟𭅨򩗒𺽣񎾏򻔻񟢽\\u0000...\n' +
    "unnamed portal parameter $5 = '...'",
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'jsonfuncs.c',
  line: '621',
  routine: 'json_ereport_error'
}

@linear

linear Bot commented Feb 14, 2024

Copy link
Copy Markdown

@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch from 2345de7 to 61544b6 Compare February 16, 2024 19:53
dav1do and others added 7 commits February 16, 2024 15:00
Most of the scenarios are very simlilar. Add a more config based approach, allowing changing whether models and keys are shared or unique. Added naming of requests to get more granular goose metrics.
Optionally share model instance docs across simulation
@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch 2 times, most recently from 3d4c112 to 4c065a3 Compare February 19, 2024 23:41
Includes some refactoring (sorry) and collecting metrics for success.
This scenario doesn't have a prom metric as easily accessible afaik, so I used redis to have the worker capture the values before/after.
@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch from 4c065a3 to a6976f6 Compare February 20, 2024 16:53
@dav1do dav1do marked this pull request as ready for review February 20, 2024 17:29
Comment thread runner/src/scenario/ceramic/model_instance.rs
Comment thread runner/src/scenario/ceramic/mod.rs

@nathanielc nathanielc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for taking on this refactor along with the end goal.

As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end.

This seems fine, using redis makes it quick and easy. My only question is should we invest the time now to add the metrics to the js-ceramic process and expose it to the goose workers? Is that the best long term strategy? Not that we will block this PR on that work but rather should we capture this need as issue and start on it right away?

Comment thread runner/src/scenario/ceramic/mod.rs
Comment thread runner/src/simulate.rs Outdated
@dav1do

dav1do commented Feb 20, 2024

Copy link
Copy Markdown
Contributor Author

This seems fine, using redis makes it quick and easy. My only question is should we invest the time now to add the metrics to the js-ceramic process and expose it to the goose workers? Is that the best long term strategy? Not that we will block this PR on that work but rather should we capture this need as issue and start on it right away?

I think we should definitely have some metrics in js-ceramic that make it easy for someone to tell what's going on with their node but I'm not sure what we currently have. I'm not sure how urgent/important exposing them to the simulations is but it seems like it could be quite useful, especially as we use keramik and the simulations to identify and investigate issues, and add more correctness validation to them.

Right now, correctness checks require doing scenario specific things, but with the shared setup, we could also have some "shared correctness" checks at the end if metrics are exposed. While some validations are going to be scenario specific, having some sanity validations and/or re-publishing specific metrics to the goose output, could simplify review and understanding things. I think having a dashboard of "key metrics" for simulations would be incredibly helpful toward a holistic understanding of the network health and the simulation status/success (especially as we try to include or exclude CAS and other services in the benchmarks). Some of this might be possible already and just require providing an integrated dashboard/steps/etc in order to make it happen.

sometimes hit errors like `failed to get model count: TransactionError: request canceled because throttled load test ended (sending on a closed channel)` so we use our own reqwest client instead of the goose supplied one on the user. this only happened if the `throttleRequests` value is set
@dav1do dav1do added this pull request to the merge queue Feb 21, 2024
Merged via the queue into main with commit ff53def Feb 21, 2024
@dav1do dav1do deleted the feat/ws1-1438-mid-simulation branch February 21, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants