Skip to content

Add optional startup timeout while waiting for the initial federate connections#352

Open
NicoBiernat wants to merge 2 commits into
lf-lang:mainfrom
NicoBiernat:startup-timeout
Open

Add optional startup timeout while waiting for the initial federate connections#352
NicoBiernat wants to merge 2 commits into
lf-lang:mainfrom
NicoBiernat:startup-timeout

Conversation

@NicoBiernat
Copy link
Copy Markdown
Collaborator

Lingua Franca supports a global timeout property, after which the application is terminated.
However in federated LF applications, all federates have to initially connect to their "neighbors" and decide on a start tag. Only after that, the timeout starts.
But what if not all federates are available at the start?
Then the federates will wait (possibly forever) until the missing federate finally answers.

I really need to make sure that the application terminates, even in case of an error when some federates might be unavailable. In my case the termination ensures a reset of the MCU into the bootloader for multiple remote devices.

I added a STARTUP_TIMEOUT with a default value of FOREVER, so the behavior is the same if you don't specify it.

Also, I changed the wait_until call to use lf_time_add instead of simple addition, which handles NEVER and FOREVER correctly.

Comment thread src/startup_coordinator.c
}
// This will release the critical section and allow other tasks to run.
self->env->wait_until(self->env, self->env->get_physical_time(self->env) + wait_before_retry);
self->env->wait_until(self->env, lf_time_add(self->env->get_physical_time(self->env), wait_before_retry));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@erlingrj
Copy link
Copy Markdown
Collaborator

erlingrj commented May 5, 2026

Thanks for contributing! Before we accept this, we need to work through the exact semantics of your proposal. Startup and shutdown is quite subtle in distributed reactor programs. The current implementation in reactor-uc requires that all federates are running and in range of their neighbors to successfully start. We were discussing the possibility of having some "optional" federates that were not required for the startup procedure.

First, could you explain a little more about your use case? What exactly should happen if the STARTUP_TIMEOUT expires? You wish to reset the MCU in question? Then what? What about the other federates? Why is a reset necessary?

Comment thread src/startup_coordinator.c
#define TRANSIENT_WAIT_TIME MSEC(250)
#endif

#ifndef STARTUP_TIMEOUT
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are suggesting solving this via a compile time flag instead of an annotation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just didn't know how to add a new annotation to lfc and get its value in reactor-uc 😄
But an annotation would probably be a cleaner solution.

@NicoBiernat
Copy link
Copy Markdown
Collaborator Author

Thanks for contributing! Before we accept this, we need to work through the exact semantics of your proposal. Startup and shutdown is quite subtle in distributed reactor programs. The current implementation in reactor-uc requires that all federates are running and in range of their neighbors to successfully start. We were discussing the possibility of having some "optional" federates that were not required for the startup procedure.

First, could you explain a little more about your use case? What exactly should happen if the STARTUP_TIMEOUT expires? You wish to reset the MCU in question? Then what? What about the other federates? Why is a reset necessary?

I have the requirement that the program (federate) should never "hang" / infinitely loop. Some background on why:

I want to benchmark federated reactor-uc on ~140 RP2040 + W5500 (Ethernet). This hardware is part of the light art installation Project Lighthouse (https://www.youtube.com/watch?v=VU26qkZiyl8) in our office building at Kiel University, so updating the firmware manually for every benchmark is quite infeasible.
So, I'll be writing a bootloader and remote firmware update to run different applications on the MCUs.
But in order to be able to update to another application, the previous one must reset the MCU back into the bootloader - even in case of an error.
The regular timeout property / annotation only starts counting down after the federation has started (all federates connected, agreed on start tag and start executing the program).
However, if any federate is unavailable at startup (network problems, someone pulled a cable or something), all federates will wait indefinitely without ever resetting into the bootloader, which would be very annoying.
That's what the optional STARTUP_TIMEOUT is for:
Terminate the program while it is waiting for the other federates after a specified duration.
Basically: If the federation fails to start during the timeout period, terminate each federate.

I know, this is kind of a niche use case, so I could just use my fork, if you don't need or want the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants