Skip to content

Remove no listeners#717

Closed
zbirenbaum wants to merge 8 commits intoTraceMachina:mainfrom
zbirenbaum:remove-no-listeners
Closed

Remove no listeners#717
zbirenbaum wants to merge 8 commits intoTraceMachina:mainfrom
zbirenbaum:remove-no-listeners

Conversation

@zbirenbaum
Copy link
Copy Markdown
Contributor

@zbirenbaum zbirenbaum commented Mar 4, 2024

Fixes #338

There might be a better way to do this if listeners can be checked before trying to send off the result of an action to them, but I don't know of a way to do that so I'm opening this as a draft. Also was wondering what the easiest way to test for this behavior would be, as I couldn't find a test case for actions without listeners.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

This change is Reviewable

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nativelink-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 4, 2024 3:47am

@zbirenbaum zbirenbaum marked this pull request as draft March 4, 2024 03:26
Copy link
Copy Markdown
Contributor

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5.
Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file):
We have a couple options here.

  1. We could call notify_channel.send() to tell the worker to terminate the job.
  2. We could cancel only queued actions.

I prefer 1, but it'll be more difficult.


a discussion (no related file):
At this phase, I suggest writing some tests.



nativelink-config/src/schedulers.rs line 124 at r6 (raw file):

    /// Remove action from queue after this much time has elapsed without a listener
    /// amount of time in seconds.

nit: End the sentence above and make this line say:
Units in seconds.


nativelink-scheduler/src/simple_scheduler.rs line 356 at r6 (raw file):

                    self.queued_actions_set.insert(action_info.clone());
                    self.queued_actions.insert(action_info.clone(), awaited_action);
                    self.metrics.time_without_listeners.reset();

Don't store this number on metrics. Metrics is only for statistics. We should not rely on it for any required features. Users can disable all metrics everywhere, so this may cause issues in weird ways.

@zbirenbaum
Copy link
Copy Markdown
Contributor Author

Reviewed 1 of 1 files at r5.
Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file): We have a couple options here.

1. We could call `notify_channel.send()` to tell the worker to terminate the job.

2. We could cancel only queued actions.

I prefer 1, but it'll be more difficult.

a discussion (no related file): At this phase, I suggest writing some tests.

nativelink-config/src/schedulers.rs line 124 at r6 (raw file):

    /// Remove action from queue after this much time has elapsed without a listener
    /// amount of time in seconds.

nit: End the sentence above and make this line say: Units in seconds.

nativelink-scheduler/src/simple_scheduler.rs line 356 at r6 (raw file):

                    self.queued_actions_set.insert(action_info.clone());
                    self.queued_actions.insert(action_info.clone(), awaited_action);
                    self.metrics.time_without_listeners.reset();

Don't store this number on metrics. Metrics is only for statistics. We should not rely on it for any required features. Users can disable all metrics everywhere, so this may cause issues in weird ways.

Got it, my instinct was that using notifications would be a better way but need to read more into how to get it working, the original issue only referenced the queue so got a little confused. I think doing it right will pay dividends in the long run so lets go with number 1.

Copy link
Copy Markdown
Contributor

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

a discussion (no related file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

At this phase, I suggest writing some tests.

Please reword the github description, as it will become the commit message.


Copy link
Copy Markdown
Contributor

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r7.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 210 at r8 (raw file):

    workers: Workers,
    active_actions: HashMap<Arc<ActionInfo>, RunningAction>,
    actions_no_listeners: HashMap<Arc<ActionInfoHashKey>, SystemTime>,

nit: Lets track this in field in AwaitedAction with something like last_client_event. This field will need to be updated on every connect of a client and a timer will need to update it periodically if it still has listeners.


nativelink-scheduler/src/worker.rs line 208 at r8 (raw file):

    }

    pub fn drop_action(&mut self, action_info: &Arc<ActionInfo>) {

nit: We don't want to do anything here, we need to worker to send the RPC message back to the scheduler saying it errored/completed and let the normal flow take it.


nativelink-util/src/action_messages.rs line 634 at r8 (raw file):

    /// Worker is executing the action.
    Executing,
    // TODO(zbirenbaum) Add dropped status and rpc handling for actions which were cancelled due to disconnection

nit: This is not the RPC message you are looking for, check worker_api.proto.

Copy link
Copy Markdown
Contributor

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-22.04, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-scheduler/src/simple_scheduler.rs line 61 at r2 (raw file):

/// Default timeout for actions without any listeners
/// If this changes, remember to change the documentation in the config.
const DEFAULT_NO_LISTENERS_TIMEOUT_S: u64 = 10;

nit: Lets make this 60 seconds.

@zbirenbaum
Copy link
Copy Markdown
Contributor Author

Closing this and continuing in #778 which is much more polished and addresses most of the problems here

@zbirenbaum zbirenbaum closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actions with no listeners that are old should be removed from queue

2 participants