Add resource monitor#58
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive resource monitoring capabilities to Kallichore, enabling real-time tracking of kernel resource usage. The implementation monitors CPU usage, memory consumption, and thread counts for kernel processes and their children. By default, resources are sampled every 1 second and can be configured via command-line arguments or runtime settings. Setting the interval to 0 disables monitoring entirely.
- Adds a global resource monitor that periodically collects metrics for all active kernel sessions
- Exposes resource usage data through both WebSocket messages and HTTP session queries
- Provides configuration API to adjust sampling intervals dynamically (minimum 100ms when enabled)
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
kallichore.json |
Adds resourceUsage schema and resource_sample_interval_ms configuration field to OpenAPI spec |
crates/kcshared/src/kernel_message.rs |
Defines ResourceUpdate struct and adds ResourceUsage variant to KernelMessage enum |
crates/kcserver/tests/resource_usage_test.rs |
Adds integration tests verifying resource usage data in HTTP responses and WebSocket messages |
crates/kcserver/src/server.rs |
Threads resource_sample_interval_ms parameter through server initialization and configuration updates |
crates/kcserver/src/resource_monitor.rs |
Implements global resource monitoring task with process tree traversal and metric collection |
crates/kcserver/src/main.rs |
Adds --resource-sample-interval CLI argument with 1000ms default |
crates/kcserver/src/lib.rs |
Exports new resource_monitor module |
crates/kcserver/src/kernel_state.rs |
Adds resource_usage field to store latest metrics |
crates/kcserver/src/kernel_session/mod.rs |
Includes resource_usage in session info responses |
crates/kcserver/Cargo.toml |
Bumps version to 0.1.62 |
crates/kallichore_api/src/models.rs |
Adds ResourceUsage model and updates ActiveSession and ServerConfiguration models |
crates/kallichore_api/docs/* |
Adds documentation for new API models |
crates/kallichore_api/api/openapi.yaml |
Updates OpenAPI specification with resource usage schemas |
crates/kallichore_api/README.md |
Updates build date metadata |
Cargo.lock |
Updates version lock for kcserver |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[derive(Debug, Clone, PartialEq, serde::Serialize, serde::Deserialize, validator::Validate)] | ||
| #[cfg_attr(feature = "conversion", derive(frunk::LabelledGeneric))] | ||
| pub struct ResourceUsage { | ||
| /// The percentage of CPU used by the kernel process |
There was a problem hiding this comment.
The description says "The percentage of CPU used by the kernel process" but should clarify that this includes child processes, consistent with the actual implementation.
| /// The percentage of CPU used by the kernel process | |
| /// The percentage of CPU used by the kernel process, including its child processes |
| #[cfg(not(any(target_os = "linux")))] | ||
| { | ||
| // On macOS and Windows, tasks() is not available | ||
| // Assume 1 thread per process as a baseline | ||
| total_threads += 1; | ||
| } |
There was a problem hiding this comment.
On non-Linux platforms (macOS, Windows), the thread count logic assumes 1 thread per process, which significantly underestimates the actual thread count. Modern Python kernels typically use multiple threads. Consider:
- Documenting this limitation in the API
- Using platform-specific APIs to get accurate thread counts
- Making the field optional when accurate data isn't available
This could mislead users who rely on thread count metrics for capacity planning or debugging.
| ------------ | ------------- | ------------- | ------------- | ||
| **cpu_percent** | **i64** | The percentage of CPU used by the kernel process | | ||
| **memory_bytes** | **i64** | The amount of memory used by the kernel process in bytes | | ||
| **thread_count** | **i64** | The number of threads used by the kernel process | |
There was a problem hiding this comment.
The description says "The number of threads used by the kernel process" but should clarify that this includes child processes.
| **thread_count** | **i64** | The number of threads used by the kernel process | | |
| **thread_count** | **i64** | The number of threads used by the kernel process and its child processes | |
| #[serde(rename = "memory_bytes")] | ||
| pub memory_bytes: i64, | ||
|
|
||
| /// The number of threads used by the kernel process |
There was a problem hiding this comment.
The description says "The number of threads used by the kernel process" but should clarify that this includes child processes.
| /// The number of threads used by the kernel process | |
| /// The number of threads used by the kernel process, including its child processes |
| // Collect all PIDs in process tree using BFS | ||
| let mut pids_to_check = vec![pid]; | ||
| let mut visited = HashSet::new(); | ||
|
|
||
| while let Some(check_pid) = pids_to_check.pop() { | ||
| if !visited.insert(check_pid) { | ||
| continue; // Already visited | ||
| } | ||
|
|
||
| // Find children by checking parent_pid | ||
| for (child_pid, proc) in system.processes() { | ||
| if proc.parent() == Some(check_pid) && !visited.contains(child_pid) { | ||
| pids_to_check.push(*child_pid); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The BFS process tree collection doesn't actually accumulate metrics for visited processes. The code collects all PIDs using BFS, but then only sums metrics in the second loop. However, there's a subtle issue: the visited set is built without checking if processes exist in the system. Then when summing, some PIDs might not have corresponding processes. While this won't cause errors due to the if let Some check, the logic could be clearer. Consider combining the BFS traversal with metric collection to make it more efficient and clearer.
There was a problem hiding this comment.
We know that the PIDs have corresponding processes because we just found them! I think this is OK as is.
| // Enforce a minimum interval of 100 ms so we don't try to sample | ||
| // too frequently | ||
| let new_interval_ms = if interval_ms > 0 && interval_ms < 100 { | ||
| log::warn!("Requested resource_sample_interval_ms is too low: {}. Setting to minimum of 100 ms.", interval_ms); | ||
| 100 | ||
| } else { | ||
| interval_ms as u64 | ||
| }; |
There was a problem hiding this comment.
Inconsistent validation logic: The code checks for negative values and enforces a minimum of 100ms, but the warning message at line 1976 says "Setting to minimum of 100 ms" while the actual enforcement allows 0 (disabled). The validation should either:
- Allow 0 or >= 100 (current behavior), and update the warning to clarify this
- Treat 0 specially before the minimum check
The current logic is: if value is between 1-99, set to 100; if 0, keep as 0. This is correct but the warning message doesn't reflect this nuance.
There was a problem hiding this comment.
100 is the minimum when metrics collection is enabled, not 0, so I stand by the warning.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Adds resource monitoring. By default, resources are monitored and reported every 1 second. Resource usage information is emitted on the kernel's websocket as kernel messages (rather than Jupyter messages, but on the same protocol), and is also returned when querying the active session.
The interval at which resources are collected is configurable (both at the command line and in settings), and can be set to 0 to disable resource monitoring entirely if desired for performance reasons.