Skip to content

Commit 39419cc

Browse files
authored
Retry on missing server at startup (#36)
Currently microgrid-rs retries when an existing connection is lost, but when there's not server available at startup, it exits immediately. This PR changes that: - `MicrogridClientHandle::try_new` returns immediately with a lazily established connection to the API. - `LogicalMeterHandle::try_new` awaits until a server is available, so that it can build the a component graph.
2 parents c8c9238 + a06d7c9 commit 39419cc

5 files changed

Lines changed: 77 additions & 36 deletions

File tree

RELEASE_NOTES.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@
66

77
## Upgrading
88

9-
<!-- Here goes notes on how to upgrade from previous versions, including deprecations and what they should be replaced with -->
9+
- `MicrogridClientHandle::try_new`, `LogicalMeterHandle::try_new`, and `Microgrid::try_new` no longer return an error when the microgrid API server is unreachable at startup or when the server returns data that doesn't yet form a valid component graph; instead they wait for the server to recover. Callers that relied on a quick failure to detect a misconfigured or unavailable endpoint should wrap the call in `tokio::time::timeout` (or equivalent) to bound the wait. URL validation still fails fast: a malformed endpoint URL is still surfaced as `ConnectionFailure` from `MicrogridClientHandle::try_new`, and an invalid `LogicalMeterConfig` still surfaces synchronously from `LogicalMeterHandle::try_new`.
1010

1111
## New Features
1212

13-
<!-- Here goes the main new features and examples or instructions on how to use them -->
13+
- The microgrid client now tolerates the API server being absent or returning incomplete data at startup. `MicrogridClientHandle::try_new` establishes the gRPC connection lazily, so it succeeds regardless of whether the server is reachable; transient stream errors are then handled by the existing per-stream retry loop. `LogicalMeterHandle::try_new` (and therefore `Microgrid::try_new`) wraps the entire component-graph setup — listing components, listing connections, and building the graph — in a single retry loop that sleeps 3 seconds between attempts, so applications block waiting for the server and a valid graph instead of exiting with an error.
14+
15+
- `Bounds::combine_parallel`, `Bounds::intersect`, and `Bounds::merge_if_overlapping` are now public, allowing external callers to combine bounds without going through higher-level types.
1416

1517
## Bug Fixes
1618

src/bounds.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ impl<Q: Quantity> Bounds<Q> {
3434
}
3535

3636
/// Combines two bounds as if their components were connected in parallel.
37-
pub(crate) fn combine_parallel(&self, other: &Self) -> Vec<Self> {
37+
pub fn combine_parallel(&self, other: &Self) -> Vec<Self> {
3838
if self.intersect(other).is_none() {
3939
return vec![self.clone(), other.clone()];
4040
}
@@ -67,7 +67,7 @@ impl<Q: Quantity> Bounds<Q> {
6767

6868
/// Returns the intersection of `self` and `other`, or `None` if the
6969
/// intersection is empty.
70-
pub(crate) fn intersect(&self, other: &Self) -> Option<Self> {
70+
pub fn intersect(&self, other: &Self) -> Option<Self> {
7171
let lower = Self::map_or_any(Q::max, self.lower, other.lower);
7272
let upper = Self::map_or_any(Q::min, self.upper, other.upper);
7373
if let (Some(lower), Some(upper)) = (lower, upper)
@@ -80,7 +80,7 @@ impl<Q: Quantity> Bounds<Q> {
8080

8181
/// If `self` and `other` overlap, returns the smallest single interval
8282
/// that contains both; otherwise returns `None`.
83-
pub(crate) fn merge_if_overlapping(&self, other: &Self) -> Option<Self> {
83+
pub fn merge_if_overlapping(&self, other: &Self) -> Option<Self> {
8484
self.intersect(other)?;
8585
Some(Bounds {
8686
lower: self.lower.and_then(|a| other.lower.map(|b| a.min(b))),

src/client/microgrid_client_handle.rs

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
99
use chrono::TimeDelta;
1010
use tokio::sync::{broadcast, mpsc, oneshot};
11-
use tonic::transport::Channel;
11+
use tonic::transport::{Channel, Endpoint};
1212

1313
use crate::{
1414
Bounds, Error,
@@ -36,20 +36,25 @@ pub struct MicrogridClientHandle {
3636
}
3737

3838
impl MicrogridClientHandle {
39-
/// Creates a new `MicrogridClientHandle` that connects to the microgrid API
40-
/// at the specified URL.
39+
/// Creates a new `MicrogridClientHandle` for the microgrid API at the
40+
/// specified URL.
41+
///
42+
/// The connection is established lazily on the first RPC, so this method
43+
/// succeeds even when no server is reachable yet. Per-call errors will
44+
/// surface from the individual RPC methods, and the actor's per-stream
45+
/// retry loop will keep attempting to reconnect telemetry streams.
46+
///
47+
/// Returns an error only if `url` is not a valid endpoint URL.
4148
pub async fn try_new(url: impl Into<String>) -> Result<Self, Error> {
42-
let client = match MicrogridClient::<Channel>::connect(url.into()).await {
43-
Ok(t) => t,
44-
Err(e) => {
45-
tracing::error!("Could not connect to server: {e}");
46-
return Err(Error::connection_failure(format!(
47-
"Could not connect to server: {e}"
48-
)));
49-
}
50-
};
51-
52-
Ok(Self::new_from_client(client))
49+
let url = url.into();
50+
let channel = Endpoint::from_shared(url.clone())
51+
.map_err(|e| {
52+
Error::connection_failure(format!("Invalid microgrid API URL {url}: {e}"))
53+
})?
54+
.connect_lazy();
55+
Ok(Self::new_from_client(MicrogridClient::<Channel>::new(
56+
channel,
57+
)))
5358
}
5459

5560
pub fn new_from_client(client: impl MicrogridApiClient) -> Self {

src/logical_meter/logical_meter_handle.rs

Lines changed: 46 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ use crate::{
1313
};
1414
use frequenz_microgrid_component_graph::{self, ComponentGraph};
1515
use std::collections::BTreeSet;
16+
use std::time::Duration;
1617
use tokio::sync::mpsc;
1718

1819
use super::{LogicalMeterConfig, logical_meter_actor::LogicalMeterActor};
@@ -26,6 +27,11 @@ pub struct LogicalMeterHandle {
2627

2728
impl LogicalMeterHandle {
2829
/// Creates a new LogicalMeter instance.
30+
///
31+
/// Listing the components and connections from the API and building the
32+
/// component graph is retried indefinitely with a 3 second backoff, so
33+
/// this call blocks until the server is reachable and returns data that
34+
/// forms a valid graph. Returns an error only if `config` is invalid.
2935
pub async fn try_new(
3036
client: MicrogridClientHandle,
3137
config: LogicalMeterConfig,
@@ -39,21 +45,19 @@ impl LogicalMeterHandle {
3945
clock: C,
4046
) -> Result<Self, Error> {
4147
let (sender, receiver) = mpsc::channel(8);
42-
let graph = ComponentGraph::try_new(
43-
client.list_electrical_components(vec![], vec![]).await?,
44-
client
45-
.list_electrical_component_connections(vec![], vec![])
46-
.await?,
47-
frequenz_microgrid_component_graph::ComponentGraphConfig {
48-
allow_component_validation_failures: true,
49-
allow_unconnected_components: true,
50-
allow_unspecified_inverters: false,
51-
disable_fallback_components: false,
52-
},
53-
)
54-
.map_err(|e| {
55-
Error::component_graph_error(format!("Unable to create a component graph: {e}"))
56-
})?;
48+
const RETRY_DELAY: Duration = Duration::from_secs(3);
49+
let graph = loop {
50+
match build_component_graph(&client).await {
51+
Ok(g) => break g,
52+
Err(reason) => {
53+
tracing::warn!(
54+
"Microgrid logical-meter setup failed, retrying in {:?}: {reason}",
55+
RETRY_DELAY
56+
);
57+
tokio::time::sleep(RETRY_DELAY).await;
58+
}
59+
}
60+
};
5761

5862
let logical_meter = LogicalMeterActor::try_new(receiver, client, config, clock)?;
5963

@@ -174,6 +178,33 @@ impl LogicalMeterHandle {
174178
}
175179
}
176180

181+
/// Lists the components and connections from the API and builds the
182+
/// component graph. Errors from each step are stringified with a prefix so
183+
/// the retry loop can log a concise reason.
184+
async fn build_component_graph(
185+
client: &MicrogridClientHandle,
186+
) -> Result<ComponentGraph<ElectricalComponent, ElectricalComponentConnection>, String> {
187+
let components = client
188+
.list_electrical_components(vec![], vec![])
189+
.await
190+
.map_err(|e| format!("fetching components failed: {e}"))?;
191+
let connections = client
192+
.list_electrical_component_connections(vec![], vec![])
193+
.await
194+
.map_err(|e| format!("fetching component connections failed: {e}"))?;
195+
ComponentGraph::try_new(
196+
components,
197+
connections,
198+
frequenz_microgrid_component_graph::ComponentGraphConfig {
199+
allow_component_validation_failures: true,
200+
allow_unconnected_components: true,
201+
allow_unspecified_inverters: false,
202+
disable_fallback_components: false,
203+
},
204+
)
205+
.map_err(|e| format!("building component graph failed: {e}"))
206+
}
207+
177208
#[cfg(test)]
178209
mod tests {
179210
use chrono::TimeDelta;

src/microgrid.rs

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,11 @@ impl Microgrid {
2121
/// Creates a new `Microgrid` instance with the given microgrid API URL and
2222
/// logical meter configuration.
2323
///
24-
/// Returns an error if the URL is unreachable, or if the component graph
25-
/// cannot be created with the given configuration.
24+
/// The microgrid API connection is established lazily and connection or
25+
/// component-graph build errors during setup are retried indefinitely, so
26+
/// this call blocks until the server is reachable and returns valid data.
27+
/// Returns an error only if the URL is malformed or if the provided
28+
/// logical meter configuration is invalid.
2629
pub async fn try_new(
2730
url: impl Into<String>,
2831
config: LogicalMeterConfig,

0 commit comments

Comments
 (0)