Skip to content

Commit 0f25c85

Browse files
docs(Echo): add performance optimization roadmap and contribution guide
Added comprehensive TODO.md outlining prioritized enhancements for the Echo scheduler component of Mountain's task execution system. This document: - Aligns with Land's focus on extreme performance in the Rust backend - Targets key areas: work-stealing optimization (fastrand), async worker sleep (tokio::Notify) - Proposes advanced features like CPU affinity and LIFO slots to reduce latency - Establishes benchmarking foundation via criterion for data-driven optimizations - Prepares observability through tracing integration and metrics collection Provides structured path for community contributions while maintaining architectural coherence with Mountain's async task processing model. Serves as a reference for upcoming scheduler improvements critical to editor responsiveness.
1 parent 6d265a0 commit 0f25c85

1 file changed

Lines changed: 249 additions & 0 deletions

File tree

docs/TODO.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
### **Help Us Boost Performance: A Call for Contributions!** 🚀
2+
3+
`Echo` is built on a high-performance foundation, but there's always room to
4+
push the boundaries of speed and efficiency. We invite the community to help us
5+
implement the following performance enhancements. These are fantastic
6+
opportunities to learn about low-level systems programming and make a real
7+
impact on the project.
8+
9+
---
10+
11+
### Level 1: Quick Wins & Low-Hanging Fruit ✅
12+
13+
These tasks are ideal for first-time contributors. They provide measurable
14+
performance gains with low implementation complexity.
15+
16+
#### **TODO 1: Implement Faster Random Number Generation**
17+
18+
- **The Goal:** Replace the cryptographically secure (and slower) `rand::rng()`
19+
with a much faster non-cryptographic Pseudo-Random Number Generator (PRNG) for
20+
selecting steal victims.
21+
- **The Problem:** The current `Steal` method in `Queue::StealingQueue` uses
22+
`rand::rng()`, which involves OS-level interaction and is overkill for our
23+
needs. This adds unnecessary overhead in a hot loop.
24+
- **Proposed Solution:**
25+
1. Add the `fastrand` crate as a dependency.
26+
2. Modify the `Steal` method in `Echo/Source/Queue/StealingQueue.rs` to use
27+
`fastrand::usize(..)` for choosing the starting index for stealing.
28+
3. Remove the `rand` crate dependency if it's no longer used elsewhere.
29+
- **Impact:** Reduces system call overhead during steal attempts, improving
30+
performance when workers are contending for tasks.
31+
- **Difficulty:** Low
32+
- **Skills:** Basic Rust, dependency management.
33+
34+
---
35+
36+
### Level 2: Architectural Enhancements 🌍
37+
38+
This task involves a more significant change to the scheduler's core logic,
39+
focusing on improving idle performance and latency.
40+
41+
#### **TODO 2: Implement True Worker Sleep with a Notification System**
42+
43+
- **The Goal:** Eliminate busy-waiting in idle workers. Instead of `sleep(1ms)`,
44+
workers should enter a deep, OS-level sleep and only be woken up when new work
45+
is available.
46+
- **The Problem:** The current `tokio::time::sleep()` loop in an idle `Worker`
47+
consumes CPU cycles and introduces up to 1ms of latency for newly submitted
48+
tasks.
49+
- **Proposed Solution:**
50+
1. Introduce a `tokio::sync::Notify` primitive into the
51+
`Queue::StealingQueue::Share` struct.
52+
2. In `Queue::StealingQueue::Submit()`, after a task is successfully pushed
53+
to an injector, call `Notifier.notify_one()` to wake up a single
54+
sleeping worker.
55+
3. In `Scheduler::Worker::Run()`, replace the `else { sleep(...) }` block
56+
with a call to `await` the `Notifier.notified()` future.
57+
- **Impact:** Drastically reduces CPU usage for an idle scheduler and minimizes
58+
latency for the first task submitted to an idle system. This is crucial for
59+
GUI applications and servers with bursty workloads.
60+
- **Difficulty:** Medium
61+
- **Skills:** `async` Rust, understanding of `tokio` synchronization primitives.
62+
63+
---
64+
65+
### Level 3: Expert-Level Tuning & Measurement ⚙️
66+
67+
These tasks are for experienced developers who are passionate about
68+
systems-level performance, benchmarking, and hardware affinity.
69+
70+
#### **TODO 3: Establish a Comprehensive Benchmarking Suite**
71+
72+
- **The Goal:** Create a suite of benchmarks using the `criterion` crate to
73+
rigorously measure scheduler performance and validate the impact of
74+
optimizations.
75+
- **The Problem:** Without benchmarks, we are "flying blind." We cannot prove
76+
that changes are actually improving performance.
77+
- **Proposed Solution:**
78+
1. Add `criterion = { version = "0.5", features = ["async_tokio"] }` as a
79+
`[dev-dependency]`.
80+
2. Create a `benches/` directory at the root of the project.
81+
3. Implement several benchmark scenarios in `benches/scheduler_bench.rs`,
82+
such as:
83+
- **Throughput:** Measure the time to submit and execute a massive
84+
number of tiny tasks (e.g., 1,000,000).
85+
- **Latency:** Measure the time from `Submit()` to completion for a
86+
single task on an idle scheduler.
87+
- **Contention:** Benchmark performance when all workers are heavily
88+
contending for tasks from the global queue.
89+
- **Impact:** Provides the entire project with a tool to make data-driven
90+
performance decisions. This is foundational for any serious performance work.
91+
- **Difficulty:** Medium
92+
- **Skills:** Benchmarking practices, `criterion` usage, `async` benchmarking.
93+
94+
#### **TODO 4: Implement CPU Core Affinity (Thread Pinning)**
95+
96+
- **The Goal:** Allow the `Scheduler` to pin each of its worker threads to a
97+
specific CPU core.
98+
- **The Problem:** On modern multi-socket servers (NUMA architecture), a thread
99+
running on one CPU that accesses memory allocated by another CPU suffers a
100+
significant performance penalty. OS thread scheduling can also migrate threads
101+
between cores, causing cache misses.
102+
- **Proposed Solution:**
103+
1. Add the `core_affinity` crate as a dependency.
104+
2. In `Scheduler::Scheduler::Create()`, before spawning each `tokio` task,
105+
get the list of available core IDs.
106+
3. Use `tokio::task::spawn_blocking` in conjunction with
107+
`core_affinity::set_for_current()` to pin the thread to a specific core
108+
ID before starting the `Worker::Run` async loop. This is a complex task
109+
that requires careful integration with Tokio's threading model.
110+
- **Impact:** Can provide a massive performance boost on server-grade hardware
111+
by maximizing cache locality and eliminating NUMA cross-socket memory access
112+
penalties. This is an expert-level optimization for achieving bare-metal
113+
performance.
114+
- **Difficulty:** High
115+
- **Skills:** Deep understanding of OS schedulers, CPU architecture (NUMA), and
116+
the `tokio` runtime's threading model.
117+
118+
### **Level 4: Advanced Scheduling Logic & Fairness** 🧠
119+
120+
This level moves beyond raw speed and into the "intelligence" of the scheduler,
121+
focusing on fairness and preventing common concurrency pitfalls.
122+
123+
#### **TODO 5: Implement LIFO Slot for Recently Awoken Tasks**
124+
125+
- **The Goal:** Improve the performance of "ping-pong" workloads, where a task
126+
awaits a short I/O operation and then immediately needs to run again.
127+
- **The Problem:** When an `async` task completes an I/O operation (e.g., a
128+
database query), its `Waker` is called, and it gets re-submitted to the
129+
scheduler. This often pushes it to the back of a global queue, adding
130+
unnecessary latency.
131+
- **Proposed Solution:**
132+
1. Add a special, single-element "LIFO slot" to each `Worker`'s local
133+
state. This slot is separate from the `crossbeam-deque`.
134+
2. When a task is awoken by its `Waker`, instead of being pushed to the
135+
global `Injector`, it is placed directly into the LIFO slot of the _same
136+
worker_ that was running it before.
137+
3. Modify the `Worker::Run` loop to check this LIFO slot _before_ checking
138+
its main local deques.
139+
- **Impact:** Dramatically improves cache locality and reduces latency for
140+
I/O-bound tasks that frequently yield and resume. This is a key feature of the
141+
Tokio runtime itself.
142+
- **Difficulty:** High
143+
- **Skills:** Deep understanding of `async` `Future`s, `Waker`, and `Context`
144+
interaction.
145+
146+
#### **TODO 6: Introduce an Anti-Starvation Mechanism**
147+
148+
- **The Goal:** Prevent low-priority tasks from _never_ running on a perpetually
149+
busy scheduler.
150+
- **The Problem:** If there is a constant, high-volume stream of `High` and
151+
`Normal` priority tasks, the scheduler's logic will always prefer them,
152+
potentially causing `Low` priority tasks to "starve" and never get a chance to
153+
execute.
154+
- **Proposed Solution:**
155+
1. Add a counter to each `Worker`'s state.
156+
2. Every N tasks a worker completes (e.g., every 61 tasks, a prime number
157+
to avoid harmonic issues), the worker is _forced_ to try stealing one
158+
task specifically from the `Low` priority queue system.
159+
3. If it finds a `Low` priority task, it executes it, then resets its
160+
counter and returns to its normal scheduling logic.
161+
- **Impact:** Guarantees fairness and ensures that even on a fully loaded
162+
system, background and maintenance tasks will eventually make progress.
163+
- **Difficulty:** Medium
164+
- **Skills:** State management within the worker loop, algorithm design.
165+
166+
---
167+
168+
### **Level 5: Observability & Introspection** 🔬
169+
170+
A high-performance system is a black box without good tooling. This level is
171+
about adding the tools needed to understand, debug, and profile the scheduler's
172+
behavior in real-time.
173+
174+
#### **TODO 7: Expose Internal Metrics**
175+
176+
- **The Goal:** Provide a mechanism to query the state and performance of the
177+
scheduler at runtime.
178+
- **The Problem:** It's currently impossible to know how many tasks are queued,
179+
how many steals have occurred, or how busy each worker is.
180+
- **Proposed Solution:**
181+
1. Create a `SchedulerMetrics` struct containing `AtomicUsize` counters for
182+
various events (e.g., `tasks_submitted`, `tasks_completed`,
183+
`steals_succeeded`, `steals_failed`, `workers_parked`).
184+
2. Add an `Arc<SchedulerMetrics>` to the `Scheduler` and pass clones to
185+
each `Worker`.
186+
3. Instrument the code: increment the appropriate counters at key points
187+
(e.g., in `Submit`, `Run`, `Steal`).
188+
4. Add a `Scheduler::Metrics()` method that returns a snapshot of the
189+
current metrics, allowing external tools to monitor the scheduler's
190+
health.
191+
- **Impact:** Enables powerful debugging, monitoring, and auto-scaling
192+
decisions. It transforms the scheduler from a black box into a transparent,
193+
observable system.
194+
- **Difficulty:** Medium
195+
- **Skills:** Concurrency primitives (`AtomicUsize`), API design.
196+
197+
#### **TODO 8: Integrate with `tracing` for Granular Timings**
198+
199+
- **The Goal:** Provide detailed, structured logs and timing information about
200+
the entire lifecycle of a task, compatible with modern observability platforms
201+
like Jaeger or Datadog.
202+
- **The Problem:** `log` is good for simple messages, but `tracing` allows for
203+
structured, hierarchical "spans" that can measure the duration of specific
204+
operations.
205+
- **Proposed Solution:**
206+
1. Replace the `log` crate with the `tracing` crate.
207+
2. Wrap key operations in `tracing::span!` macros. For example:
208+
- Create a span in `Scheduler::Submit` that gets a unique task ID.
209+
- The `Worker` can "enter" this span when it begins executing the task.
210+
- Create sub-spans for `PopLocal` and `StealFromSystem` to see where
211+
time is being spent.
212+
3. The application using the `Echo` library can then configure a `tracing`
213+
subscriber to export this data to performance analysis tools.
214+
- **Impact:** Provides unparalleled insight into performance bottlenecks. You
215+
can visually see how long tasks wait in the queue versus how long they take to
216+
execute.
217+
- **Difficulty:** Medium
218+
- **Skills:** `tracing` crate API, structured logging concepts.
219+
220+
---
221+
222+
### **Level 6: Modular Extensibility** 🧩
223+
224+
This level focuses on making the scheduler more flexible and adaptable to
225+
different kinds of workloads.
226+
227+
#### **TODO 9: Support for Named Queues and Concurrency Limits**
228+
229+
- **The Goal:** Fully implement the `SchedulerBuilder::Queue()` API to allow
230+
users to create separate, named execution pools within the same scheduler,
231+
each with its own concurrency limit.
232+
- **The Problem:** The current scheduler has one unified pool of workers. Some
233+
applications need to limit concurrency for specific types of tasks (e.g.,
234+
"only allow 4 concurrent disk I/O operations").
235+
- **Proposed Solution:** This is a major architectural challenge.
236+
1. The `SchedulerBuilder` would collect configurations for named queues.
237+
2. The `Scheduler` would need to maintain multiple `Queue` systems or tag
238+
tasks with a queue name.
239+
3. A "supervisor" or "dispatcher" component would be needed. When a worker
240+
becomes free, it wouldn't just steal from anywhere; it would ask the
241+
dispatcher which queue it should service based on current concurrency
242+
levels. This might involve using `tokio::sync::Semaphore` to manage
243+
concurrency limits for each named queue.
244+
- **Impact:** Transforms `Echo` from a general-purpose scheduler into a highly
245+
sophisticated runtime capable of managing complex, heterogeneous workloads
246+
with fine-grained control.
247+
- **Difficulty:** Very High
248+
- **Skills:** Advanced architectural design, complex state management, deep
249+
knowledge of concurrency patterns and primitives.

0 commit comments

Comments
 (0)