Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration to determine agent behavior policy when network connection is lost #954

Closed
dofmind opened this issue Sep 30, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@dofmind
Copy link
Contributor

dofmind commented Sep 30, 2024

Please describe what you would like to see

When the network connection is lost, the current agent behavior is to disconnect and keep trying to reconnect. I would like to add another agent behavior where the agent terminates with a failure. Then we can expect the following.

  • If bluechi-agent terminates, the unit services running on the agent machine can be stopped when they have systemd dependency with bluechi-agent.service using Requires or BindsTo.
  • The bluechi-agent.service of systemd restarts bluechi-agent when bleuchi-agent is on-failure.

By adding a configuration that determines the agent's behavior policy when the network connection is lost, user can determine which policy to use through the configuration.

@mwperina mwperina added the enhancement New feature or request label Sep 30, 2024
@engelmi
Copy link
Member

engelmi commented Oct 9, 2024

@dofmind Thank you for opening this RFE issue!

Can you provide more context on what the systemd services with the dependency on bluechi-agent.service are intended to do? Is it important for those services that the agent terminated due to a disconnect? If so, then a "simple" BindsTo= doesn't take this into account since the bluechi-agent.service could also stop/fail due to other reasons.

I issue #858 is similar to this RFE. The solution proposed in that issue can be extended/adjusted to also solve this use case, I think:

We'd have a (really) small C program which checks and listens on changes of the Status property of the agent. If it detects that the agent is offline, it simply exits.
This C program can then be wrapped in a systemd unit, e.g. bluechi-agent-is-online.service, with the semantics

  • unit is active = agent connected
  • unit is inactive = agent not connected
    Adding an UpheldBy=bluechi-agent.service would constantly restart it as long as bluechi-agent is running. To prevent alternating between active and inactive, ExecStartPre= could be used to do an initial online check and keep the service in activating before listening for the offline signal.
    This just a first idea and is likely more complex as well as can be enhanced with additional features, e.g. with an initial wait time for the agent to be online instead of just checking its state.

@dofmind What do you think about this approach? It requires more effort to implement, but has the advantage of explicitly mapping the bluechi-agent's online/offline status to a systemd unit (and maybe .target if needed).

@dofmind
Copy link
Contributor Author

dofmind commented Oct 10, 2024

Can you provide more context on what the systemd services with the dependency on bluechi-agent.service are intended to do? Is it important for those services that the agent terminated due to a disconnect? If so, then a "simple" BindsTo= doesn't take this into account since the bluechi-agent.service could also stop/fail due to other reasons.

I agree with your comment. As you said, my solution can't handle it if bluechi-agent is terminated for any reason other than disconnection. There is a requirement that when a node running bluechi-agent is disconnected, running units of the node should be terminated, because they will be re-executed on another node.

@dofmind What do you think about this approach? It requires more effort to implement, but has the advantage of explicitly mapping the bluechi-agent's online/offline status to a systemd unit (and maybe .target if needed).

Looks good to me. It has the advantage of only handling disconnection case of bluechi-agent.

@dofmind
Copy link
Contributor Author

dofmind commented Oct 11, 2024

We'd have a (really) small C program which checks and listens on changes of the Status property of the agent. If it detects that the agent is offline, it simply exits.

@engelmi I'm trying to add a small C program to the client with following bluechictl command:

bluechictl wait-for [offline|online]

The wait-for is a command to wait BlueChi Agent to be the desired status - offline or online.
What do you think about this?

@engelmi
Copy link
Member

engelmi commented Oct 14, 2024

I think it should even be a completely new program, e.g.

$ bluechi-is-online --help
bluechi-is-online [agent|node|system] [OPTIONS]
If online, exit with 0. Otherwise 1.

Options:
--monitor: keeps monitoring as long as agent|node|system is online. Will only exit if offline detected. 
--initial-wait: in seconds. If not online, then monitor n seconds. 

The systemd unit would then be using it like this:

[Unit]
UpheldBy=bluechi-agent.service
...
[Service]
# will keep it in "activating" state. if it fails, will be restarted by bluechi-agent.service
ExecStartPre=/usr/bin/bluechi-is-online --initial-wait=2
ExecStart=/usr/bin/bluechi-is-online --monitor

One reason for not adding it to an existing program like bluechictl would be to have it kind of modular - you don't need to install the is-online checking package, but if you need it, you can. Also, bluechictl is meant to be a developer tool and probably should be installed in production (depending on use case, of course).
Quite some code can be reused and structured similar to bluechictl (with the CLI option parsing etc.), though.

@dofmind
Copy link
Contributor Author

dofmind commented Oct 15, 2024

Actually, I tried to reuse the code of bluechictl, but it seems more reasonable to create a new program. Thanks for the guide.

@engelmi
Copy link
Member

engelmi commented Oct 15, 2024

Actually, I tried to reuse the code of bluechictl, but it seems more reasonable to create a new program. Thanks for the guide.

@dofmind Thank you for the good discussion and refining this RFE.
As a result, I created #962, combining this issue and #858. If you would like to take #962, please add a small comment there. Otherwise we will start working on it soon-ish, depending on our capacity.

@dofmind
Copy link
Contributor Author

dofmind commented Oct 15, 2024

I don't have time to do it right now. And I think your members can do it better than me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants