|
| 1 | +### **Help Us Boost Performance: A Call for Contributions!** 🚀 |
| 2 | + |
| 3 | +`Echo` is built on a high-performance foundation, but there's always room to |
| 4 | +push the boundaries of speed and efficiency. We invite the community to help us |
| 5 | +implement the following performance enhancements. These are fantastic |
| 6 | +opportunities to learn about low-level systems programming and make a real |
| 7 | +impact on the project. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +### Level 1: Quick Wins & Low-Hanging Fruit ✅ |
| 12 | + |
| 13 | +These tasks are ideal for first-time contributors. They provide measurable |
| 14 | +performance gains with low implementation complexity. |
| 15 | + |
| 16 | +#### **TODO 1: Implement Faster Random Number Generation** |
| 17 | + |
| 18 | +- **The Goal:** Replace the cryptographically secure (and slower) `rand::rng()` |
| 19 | + with a much faster non-cryptographic Pseudo-Random Number Generator (PRNG) for |
| 20 | + selecting steal victims. |
| 21 | +- **The Problem:** The current `Steal` method in `Queue::StealingQueue` uses |
| 22 | + `rand::rng()`, which involves OS-level interaction and is overkill for our |
| 23 | + needs. This adds unnecessary overhead in a hot loop. |
| 24 | +- **Proposed Solution:** |
| 25 | + 1. Add the `fastrand` crate as a dependency. |
| 26 | + 2. Modify the `Steal` method in `Echo/Source/Queue/StealingQueue.rs` to use |
| 27 | + `fastrand::usize(..)` for choosing the starting index for stealing. |
| 28 | + 3. Remove the `rand` crate dependency if it's no longer used elsewhere. |
| 29 | +- **Impact:** Reduces system call overhead during steal attempts, improving |
| 30 | + performance when workers are contending for tasks. |
| 31 | +- **Difficulty:** Low |
| 32 | +- **Skills:** Basic Rust, dependency management. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +### Level 2: Architectural Enhancements 🌍 |
| 37 | + |
| 38 | +This task involves a more significant change to the scheduler's core logic, |
| 39 | +focusing on improving idle performance and latency. |
| 40 | + |
| 41 | +#### **TODO 2: Implement True Worker Sleep with a Notification System** |
| 42 | + |
| 43 | +- **The Goal:** Eliminate busy-waiting in idle workers. Instead of `sleep(1ms)`, |
| 44 | + workers should enter a deep, OS-level sleep and only be woken up when new work |
| 45 | + is available. |
| 46 | +- **The Problem:** The current `tokio::time::sleep()` loop in an idle `Worker` |
| 47 | + consumes CPU cycles and introduces up to 1ms of latency for newly submitted |
| 48 | + tasks. |
| 49 | +- **Proposed Solution:** |
| 50 | + 1. Introduce a `tokio::sync::Notify` primitive into the |
| 51 | + `Queue::StealingQueue::Share` struct. |
| 52 | + 2. In `Queue::StealingQueue::Submit()`, after a task is successfully pushed |
| 53 | + to an injector, call `Notifier.notify_one()` to wake up a single |
| 54 | + sleeping worker. |
| 55 | + 3. In `Scheduler::Worker::Run()`, replace the `else { sleep(...) }` block |
| 56 | + with a call to `await` the `Notifier.notified()` future. |
| 57 | +- **Impact:** Drastically reduces CPU usage for an idle scheduler and minimizes |
| 58 | + latency for the first task submitted to an idle system. This is crucial for |
| 59 | + GUI applications and servers with bursty workloads. |
| 60 | +- **Difficulty:** Medium |
| 61 | +- **Skills:** `async` Rust, understanding of `tokio` synchronization primitives. |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +### Level 3: Expert-Level Tuning & Measurement ⚙️ |
| 66 | + |
| 67 | +These tasks are for experienced developers who are passionate about |
| 68 | +systems-level performance, benchmarking, and hardware affinity. |
| 69 | + |
| 70 | +#### **TODO 3: Establish a Comprehensive Benchmarking Suite** |
| 71 | + |
| 72 | +- **The Goal:** Create a suite of benchmarks using the `criterion` crate to |
| 73 | + rigorously measure scheduler performance and validate the impact of |
| 74 | + optimizations. |
| 75 | +- **The Problem:** Without benchmarks, we are "flying blind." We cannot prove |
| 76 | + that changes are actually improving performance. |
| 77 | +- **Proposed Solution:** |
| 78 | + 1. Add `criterion = { version = "0.5", features = ["async_tokio"] }` as a |
| 79 | + `[dev-dependency]`. |
| 80 | + 2. Create a `benches/` directory at the root of the project. |
| 81 | + 3. Implement several benchmark scenarios in `benches/scheduler_bench.rs`, |
| 82 | + such as: |
| 83 | + - **Throughput:** Measure the time to submit and execute a massive |
| 84 | + number of tiny tasks (e.g., 1,000,000). |
| 85 | + - **Latency:** Measure the time from `Submit()` to completion for a |
| 86 | + single task on an idle scheduler. |
| 87 | + - **Contention:** Benchmark performance when all workers are heavily |
| 88 | + contending for tasks from the global queue. |
| 89 | +- **Impact:** Provides the entire project with a tool to make data-driven |
| 90 | + performance decisions. This is foundational for any serious performance work. |
| 91 | +- **Difficulty:** Medium |
| 92 | +- **Skills:** Benchmarking practices, `criterion` usage, `async` benchmarking. |
| 93 | + |
| 94 | +#### **TODO 4: Implement CPU Core Affinity (Thread Pinning)** |
| 95 | + |
| 96 | +- **The Goal:** Allow the `Scheduler` to pin each of its worker threads to a |
| 97 | + specific CPU core. |
| 98 | +- **The Problem:** On modern multi-socket servers (NUMA architecture), a thread |
| 99 | + running on one CPU that accesses memory allocated by another CPU suffers a |
| 100 | + significant performance penalty. OS thread scheduling can also migrate threads |
| 101 | + between cores, causing cache misses. |
| 102 | +- **Proposed Solution:** |
| 103 | + 1. Add the `core_affinity` crate as a dependency. |
| 104 | + 2. In `Scheduler::Scheduler::Create()`, before spawning each `tokio` task, |
| 105 | + get the list of available core IDs. |
| 106 | + 3. Use `tokio::task::spawn_blocking` in conjunction with |
| 107 | + `core_affinity::set_for_current()` to pin the thread to a specific core |
| 108 | + ID before starting the `Worker::Run` async loop. This is a complex task |
| 109 | + that requires careful integration with Tokio's threading model. |
| 110 | +- **Impact:** Can provide a massive performance boost on server-grade hardware |
| 111 | + by maximizing cache locality and eliminating NUMA cross-socket memory access |
| 112 | + penalties. This is an expert-level optimization for achieving bare-metal |
| 113 | + performance. |
| 114 | +- **Difficulty:** High |
| 115 | +- **Skills:** Deep understanding of OS schedulers, CPU architecture (NUMA), and |
| 116 | + the `tokio` runtime's threading model. |
| 117 | + |
| 118 | +### **Level 4: Advanced Scheduling Logic & Fairness** 🧠 |
| 119 | + |
| 120 | +This level moves beyond raw speed and into the "intelligence" of the scheduler, |
| 121 | +focusing on fairness and preventing common concurrency pitfalls. |
| 122 | + |
| 123 | +#### **TODO 5: Implement LIFO Slot for Recently Awoken Tasks** |
| 124 | + |
| 125 | +- **The Goal:** Improve the performance of "ping-pong" workloads, where a task |
| 126 | + awaits a short I/O operation and then immediately needs to run again. |
| 127 | +- **The Problem:** When an `async` task completes an I/O operation (e.g., a |
| 128 | + database query), its `Waker` is called, and it gets re-submitted to the |
| 129 | + scheduler. This often pushes it to the back of a global queue, adding |
| 130 | + unnecessary latency. |
| 131 | +- **Proposed Solution:** |
| 132 | + 1. Add a special, single-element "LIFO slot" to each `Worker`'s local |
| 133 | + state. This slot is separate from the `crossbeam-deque`. |
| 134 | + 2. When a task is awoken by its `Waker`, instead of being pushed to the |
| 135 | + global `Injector`, it is placed directly into the LIFO slot of the _same |
| 136 | + worker_ that was running it before. |
| 137 | + 3. Modify the `Worker::Run` loop to check this LIFO slot _before_ checking |
| 138 | + its main local deques. |
| 139 | +- **Impact:** Dramatically improves cache locality and reduces latency for |
| 140 | + I/O-bound tasks that frequently yield and resume. This is a key feature of the |
| 141 | + Tokio runtime itself. |
| 142 | +- **Difficulty:** High |
| 143 | +- **Skills:** Deep understanding of `async` `Future`s, `Waker`, and `Context` |
| 144 | + interaction. |
| 145 | + |
| 146 | +#### **TODO 6: Introduce an Anti-Starvation Mechanism** |
| 147 | + |
| 148 | +- **The Goal:** Prevent low-priority tasks from _never_ running on a perpetually |
| 149 | + busy scheduler. |
| 150 | +- **The Problem:** If there is a constant, high-volume stream of `High` and |
| 151 | + `Normal` priority tasks, the scheduler's logic will always prefer them, |
| 152 | + potentially causing `Low` priority tasks to "starve" and never get a chance to |
| 153 | + execute. |
| 154 | +- **Proposed Solution:** |
| 155 | + 1. Add a counter to each `Worker`'s state. |
| 156 | + 2. Every N tasks a worker completes (e.g., every 61 tasks, a prime number |
| 157 | + to avoid harmonic issues), the worker is _forced_ to try stealing one |
| 158 | + task specifically from the `Low` priority queue system. |
| 159 | + 3. If it finds a `Low` priority task, it executes it, then resets its |
| 160 | + counter and returns to its normal scheduling logic. |
| 161 | +- **Impact:** Guarantees fairness and ensures that even on a fully loaded |
| 162 | + system, background and maintenance tasks will eventually make progress. |
| 163 | +- **Difficulty:** Medium |
| 164 | +- **Skills:** State management within the worker loop, algorithm design. |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +### **Level 5: Observability & Introspection** 🔬 |
| 169 | + |
| 170 | +A high-performance system is a black box without good tooling. This level is |
| 171 | +about adding the tools needed to understand, debug, and profile the scheduler's |
| 172 | +behavior in real-time. |
| 173 | + |
| 174 | +#### **TODO 7: Expose Internal Metrics** |
| 175 | + |
| 176 | +- **The Goal:** Provide a mechanism to query the state and performance of the |
| 177 | + scheduler at runtime. |
| 178 | +- **The Problem:** It's currently impossible to know how many tasks are queued, |
| 179 | + how many steals have occurred, or how busy each worker is. |
| 180 | +- **Proposed Solution:** |
| 181 | + 1. Create a `SchedulerMetrics` struct containing `AtomicUsize` counters for |
| 182 | + various events (e.g., `tasks_submitted`, `tasks_completed`, |
| 183 | + `steals_succeeded`, `steals_failed`, `workers_parked`). |
| 184 | + 2. Add an `Arc<SchedulerMetrics>` to the `Scheduler` and pass clones to |
| 185 | + each `Worker`. |
| 186 | + 3. Instrument the code: increment the appropriate counters at key points |
| 187 | + (e.g., in `Submit`, `Run`, `Steal`). |
| 188 | + 4. Add a `Scheduler::Metrics()` method that returns a snapshot of the |
| 189 | + current metrics, allowing external tools to monitor the scheduler's |
| 190 | + health. |
| 191 | +- **Impact:** Enables powerful debugging, monitoring, and auto-scaling |
| 192 | + decisions. It transforms the scheduler from a black box into a transparent, |
| 193 | + observable system. |
| 194 | +- **Difficulty:** Medium |
| 195 | +- **Skills:** Concurrency primitives (`AtomicUsize`), API design. |
| 196 | + |
| 197 | +#### **TODO 8: Integrate with `tracing` for Granular Timings** |
| 198 | + |
| 199 | +- **The Goal:** Provide detailed, structured logs and timing information about |
| 200 | + the entire lifecycle of a task, compatible with modern observability platforms |
| 201 | + like Jaeger or Datadog. |
| 202 | +- **The Problem:** `log` is good for simple messages, but `tracing` allows for |
| 203 | + structured, hierarchical "spans" that can measure the duration of specific |
| 204 | + operations. |
| 205 | +- **Proposed Solution:** |
| 206 | + 1. Replace the `log` crate with the `tracing` crate. |
| 207 | + 2. Wrap key operations in `tracing::span!` macros. For example: |
| 208 | + - Create a span in `Scheduler::Submit` that gets a unique task ID. |
| 209 | + - The `Worker` can "enter" this span when it begins executing the task. |
| 210 | + - Create sub-spans for `PopLocal` and `StealFromSystem` to see where |
| 211 | + time is being spent. |
| 212 | + 3. The application using the `Echo` library can then configure a `tracing` |
| 213 | + subscriber to export this data to performance analysis tools. |
| 214 | +- **Impact:** Provides unparalleled insight into performance bottlenecks. You |
| 215 | + can visually see how long tasks wait in the queue versus how long they take to |
| 216 | + execute. |
| 217 | +- **Difficulty:** Medium |
| 218 | +- **Skills:** `tracing` crate API, structured logging concepts. |
| 219 | + |
| 220 | +--- |
| 221 | + |
| 222 | +### **Level 6: Modular Extensibility** 🧩 |
| 223 | + |
| 224 | +This level focuses on making the scheduler more flexible and adaptable to |
| 225 | +different kinds of workloads. |
| 226 | + |
| 227 | +#### **TODO 9: Support for Named Queues and Concurrency Limits** |
| 228 | + |
| 229 | +- **The Goal:** Fully implement the `SchedulerBuilder::Queue()` API to allow |
| 230 | + users to create separate, named execution pools within the same scheduler, |
| 231 | + each with its own concurrency limit. |
| 232 | +- **The Problem:** The current scheduler has one unified pool of workers. Some |
| 233 | + applications need to limit concurrency for specific types of tasks (e.g., |
| 234 | + "only allow 4 concurrent disk I/O operations"). |
| 235 | +- **Proposed Solution:** This is a major architectural challenge. |
| 236 | + 1. The `SchedulerBuilder` would collect configurations for named queues. |
| 237 | + 2. The `Scheduler` would need to maintain multiple `Queue` systems or tag |
| 238 | + tasks with a queue name. |
| 239 | + 3. A "supervisor" or "dispatcher" component would be needed. When a worker |
| 240 | + becomes free, it wouldn't just steal from anywhere; it would ask the |
| 241 | + dispatcher which queue it should service based on current concurrency |
| 242 | + levels. This might involve using `tokio::sync::Semaphore` to manage |
| 243 | + concurrency limits for each named queue. |
| 244 | +- **Impact:** Transforms `Echo` from a general-purpose scheduler into a highly |
| 245 | + sophisticated runtime capable of managing complex, heterogeneous workloads |
| 246 | + with fine-grained control. |
| 247 | +- **Difficulty:** Very High |
| 248 | +- **Skills:** Advanced architectural design, complex state management, deep |
| 249 | + knowledge of concurrency patterns and primitives. |
0 commit comments