Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync pacemakerd with sbd #1957

Closed
wants to merge 2 commits into from

Conversation

wenningerk
Copy link
Contributor

@wenningerk wenningerk commented Dec 9, 2019

This tries to both address robust detection of graceful pacemaker-shutdown
and synchronization of sbd and pacemaker in a way that sbd not being
able to connect to the cib wouldn't automatically lead to the assumption
that neither pacemaker nor resources are active. (One reason for the latter to happen
is an selinux misconfiguration leading to ipc-connection denied for sbd.)

Basically it adds a status-ping - similar to controld - to pacemakerd.
In case pacemakerd detects presence of sbd it will stall till being pinged once
before starting the sub-daemons.
During shutdown it will again stall (again just in case of detection of sbd)
when the sub-daemons are shut down till this state is fetched via ping.
cibadmin -P implements this status-ping for pacemakerd (just to the local
instance) for easy testing without need of sbd.

The implementation probably needs some polishing like e.g.:

  • enum for pacemakerd-state instead of comparing strings
  • less ugly approach than sleep(5) to make client receive ping reply in shutdown-case
  • when do we really have a clean shutdown? (when one of the sub-daemons had to
    be stopped via SIGSEGV this is e.g. probably a sign for something not being that clean)
  • just open the channel really needed in crmadmin
  • overhauling pacemakerd for a cleaner state-machine would help detecting possible races
    (this PR so far makes things rather more obfuscated)
  • possibly implement a real client-API as with cib or fencing instead of handy-crafting xml

There is a bit of ugliness in a status-ping actually influencing behavior.
But on the other hand it is very little intrusive regarding API-changes and it is local
to pacemakerd without the need for anything else to know about the new sync-mechanism.
To catch cases where sub-daemons are misbehaving we can easily add an additional
tag with a timestamp when pacemakerd has last seen all the sub-daemons alive.
For the first to satisfy the interface to sbd it is just a simple timestamp
for the moment when the ping-request is received. Later on pacemakerd can go for
checking the sub-daemons like what is done initially already now periodically (presence
of pid & ipc) and later on maybe active pings towards the sub-daemons (controld would
already support a status-ping while other daemons might still need it to be added).
But these improvements can be added over time with sbd being agnostic of the progress.

From sbd-perspective the world would look something like
ClusterLabs/sbd#106.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general approach is sound. I'm a little concerned the wait at shutdown might run up against a shutdown timeout, but documenting that effect might be sufficient. (As an aside, we really need to document pacemaker's sbd usage in Pacemaker Explained's fencing chapter.)

daemons/pacemakerd/pacemakerd.c Outdated Show resolved Hide resolved
daemons/pacemakerd/pacemakerd.c Outdated Show resolved Hide resolved
daemons/pacemakerd/pacemakerd.c Outdated Show resolved Hide resolved
daemons/pacemakerd/pacemakerd.c Outdated Show resolved Hide resolved
daemons/pacemakerd/pacemakerd.c Show resolved Hide resolved
tools/crmadmin.c Show resolved Hide resolved
tools/crmadmin.c Outdated Show resolved Hide resolved
tools/crmadmin.c Outdated Show resolved Hide resolved
tools/crmadmin.c Outdated Show resolved Hide resolved
tools/crmadmin.c Outdated Show resolved Hide resolved
Make pacemakerd wait to be pinged by sbd before starting
sub-daemons. Pings further reply health-state and timestamp
of last successful check. On shutdown bring down all the
sub-daemons and wait to be polled for state by sbd before
finally exiting pacemakerd.
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Now we just need an API :)

daemons/pacemakerd/pacemakerd.c Outdated Show resolved Hide resolved
crm_err("Failed building ping-reply");
}
/* just proceed state on sbd pinging us */
if (from && strstr(from, "sbd")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should exit in the sbd + reply==NULL case. That should only be possible when out of memory.

@@ -251,7 +268,8 @@ do_work(void)
cib_t *the_cib = cib_new();
xmlNode *output = NULL;

int rc = the_cib->cmds->signon(the_cib, crm_system_name, cib_command);
int rc = the_cib->cmds->signon(the_cib,
ipc_name?ipc_name:crm_system_name, cib_command);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would simplify a bit to do "if (ipc_name == NULL) ipc_name = crm_system_name" after parsing args

tools/crmadmin.c Outdated Show resolved Hide resolved
tools/crmadmin.c Outdated Show resolved Hide resolved
} else {
pacemakerd_state = XML_PING_ATTR_PACEMAKERDSTATE_STARTINGDAEMONS;
init_children_processes(NULL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you traced what happens when pacemakerd itself respawns? find_and_track_existing_processes() will "adopt" any child daemons already running, so maybe we shouldn't wait for a ping in that case (the assumption would be that the ping succeeded with the previous pacemakerd instance), but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about that but for simplicity and to prevent it from coming up when it shouldn't under some respawn-race I decided to wait this additional second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_and_track_existing_processes() returns the number of existing daemons found, so we can distinguish three cases:

  • 0: as above
  • SIZEOF(pcmk_children): Most likely, pacemakerd itself crashed, was respawned by systemd, and found all the daemons already running. In this case it probably makes sense to go straight to RUNNING just to reflect reality.
  • Any other positive number: Most likely, corosync dropped, all the corosync-based daemons including pacemakerd exited, systemd respawned pacemakerd, and pacemakerd found the non-corosync daemons (scheduler and executor) already running. I suspect WAITPING makes sense here too, but I'm not 100% sure.

@wenningerk
Copy link
Contributor Author

The 2nd commit has the API + adaption of crmadmin to use the API instead of handcrafting the xml. Finally they probably should probably be squashed.

@wenningerk wenningerk force-pushed the sync_with_sbd branch 2 times, most recently from 211f7f4 to f9ca5d9 Compare December 17, 2019 15:43
Add new api as not to make the xml-structure an external interface.
@wenningerk
Copy link
Contributor Author

Now that there is an enum for pacemakerd-states (+ conversion to strings back an forth) one could as well think of using that not just on the client side as abstraction for the actual xml-tags but as well in pacemakerd.c.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think it's ready with a little bit of polishing

}
#endif

#endif // PACEMAKERD_TYPES__H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline

enum pacemakerd_conn_state {
pacemakerd_conn_connected,
pacemakerd_conn_disconnected
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted about the naming. I'd prefer that new public APIs use the "pcmk_" prefix but I can't think of a good choice for this.

Looking at existing client APIs, we have: crm_/crm_cluster_t, cib/cib_t, attrd_/crm_ipc_t, lrmd_/lrmd_t, and stonith_/stonith_t. I'm not a fan of the "d" prefixes because I'd rather use those within the daemons themselves, and be clearer that these APIs are for clients. Going forward I could see something like pcmk_cib_, pcmk_attr_, pcmk_exec_.

Maybe "pcmk_master_"?

Copy link
Contributor Author

@wenningerk wenningerk Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all quite a mess as we are ending up with a mixture of the old and the new names.
We've just deprecated master/slave but I'm hesitant using master for this here still.
As we're anyway not getting away from it in the ipc what about MCP (pcmk_mcp_ then)?
For the filenames then as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcp stands for master control process, so that only hides it :)
Maybe "launcher" or "supervisor"? pcmk_launcher_ping() / pcmk_super_ping()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeah I know what it stands for but when you see "mcp" it doesn't pattern-match "master" and after unfolding you get what it really is about.
I'm just not feeling well using a third name for the same thing unless we make pacemakerd pacemaker-launcher but I guess that would cause a lot of issues.
What about something that can't be confused with a component like pcmk_toplevel_api ... not really impressive but as an idea to develop maybe ... or nothing at all pcmk_api ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pacemaker-launchd was a serious possibility for 2.0 but we decided against it. mcp isn't used at all publicly in 2.0, so we wouldn't be adding a third, we'd just be relabeling it like the other 1.1 names. pcmk_api_ feels redundant. I like "supervisor" as being closest to what it actually does. "global" or "core" might work.

struct pacemakerd_s {
enum pacemakerd_conn_state conn_state;
void *pacemakerd_private;
pacemakerd_api_operations_t *cmds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the methods here are invariant, I'd recommend something more like the crm_ipc_t model:

  • Keep just the typedef in the public header, and have the entire struct definition private
  • Add a getter method for conn_state if needed
  • Use regular functions as the methods rather than function pointers in a struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to break with that? crm_ipc_t is rather the exception so far while other APIs don't make much use of virtual methods either.
Getters and setters aren't optimized away as when used with C++ - right. But of course the impact is probably minor compared to the flexibility they offer over direct access.
Not seeing that we are actually using the possibilities so far but there are probably a lot of things that could be done with methods called via pointers.
Things coming to my mind is handling of deprecated stuff like on the fly insertion of a wrapper, add a new prototype of a method at the end while keeping the deprecated original with an altered name at the old place so that things compiled against the old types will still work while stuff compiled at the new versions will see the new prototype - all without version-handling and stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the value for a case like this, but I'm fine with keeping pointers if you want them. I'd still hide conn_state (callers shouldn't change it), and I'd put the pointers in the main struct rather than use a separate struct for them (since we're hiding everything else anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to walk away from the pointers on the longer run and with new APIs for now I have nothing against it. Just wanted to be sure we aren't overlooking some potential of what we already have before we alter it or proceed in an inconsistent way.
Having the connection-state visible was a decision for consistency and a little bit of performance. Was as well not sure if it could make sense to put a timestamp into it as well having some kind of default ping-handler that just updates this timestamp and does the
pinging/reconnecting stuff autonomously. But of course we can read that timestamp with a
getter as well - with single-threaded pacemaker we probably won't need a wrapper for atomic-access but who knows what the future will bring.
Of course there is no need for having the pointers in an extra allocated memory once we don't
want to grow the data in the base-structure and the function-pointer-list both in a compatible manner any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't think of being able to add to the bottom of each separately. Though even then, it would just be for visual organization -- you could mix new members and methods at the end of a single struct.

I like hiding the entire type partly because we have complete freedom over the implementation (don't have to add things at the end, can change types of existing members, etc.) and partly because it's impossible for callers to misuse it by statically declaring a struct or using sizeof() on it. Using the void* private technique gets most of that though.

#include <crm/common/mainloop.h>
#include <crm/common/pacemakerd_types.h>

CRM_TRACE_INIT_DATA(pacemakerd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just needed once per library -- we don't need it here since libcrmcommon already does it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't checked what was behind it but as I found it multiple times in other libraries I was assuming it added some static stuff as well.
But to be honest I don't see where it currently would be defined to anything else than NIL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look that far, I wouldn't be surprised if it was unnecessary. :) The only library that has it twice is an accident because two libraries were merged into one.

XML_PING_ATTR_PACEMAKERDSTATE_WAITPING,
XML_PING_ATTR_PACEMAKERDSTATE_RUNNING,
XML_PING_ATTR_PACEMAKERDSTATE_SHUTTINGDOWN,
XML_PING_ATTR_PACEMAKERDSTATE_SHUTDOWNCOMPLETE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking the definition of these should move from msg_xml.h to the new header, which might help shorten them a little too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no XML_.ATTR. defines in any other header so far.
We could split msg_xml.h into multiple files and include them in msg_xml.h so that code using msg_xml.h wouldn't have to change. If we do the includes where the defines had previously
been we wouldn't even mess up the whole structure.
For this very case though we already have the enum and the converters in both directions.
So we could keep the strings entirely inside the c-file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't XML element or attribute names, so I was thinking we could rename them when moving them. E.g. if we went with pcmk_core_ we would use PCMK_CORE_STATE_*.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But right, in this case we can now make them private to the file

(pacemakerd->cmds->free)) {
pacemakerd->cmds->free(pacemakerd);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline

crm_time_free(crm_when);

if (BE_SILENT && pacemakerd_state_enum2text(state) != NULL) {
fprintf(stderr, "%s\n", pacemakerd_state_enum2text(state));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why crmadmin displays "quiet" output to stderr, but that will probably change when we convert it to use the new output object, where only errors go to stderr. Let's just use printf() here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that appeared broken to me.
But I wanted it to be consistent with crmd-ping as far as possible and as I wasn't sure who (CTS, RAs, ... haven't investigated) was relying on that behavior I'd like to keep it in sync so that it might be fixed together with crmd-ping that will anyway change with the new API there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, though I don't think anything relies on it -- crmadmin was created more for debugging than anything else. The man page even calls it a "development tool" and says it is "likely to be replaced by crm_node" (though it's not).

return rv;
}

ipc_source =
mainloop_add_ipc_client(
DO_PACEMAKERD_HEALTH?CRM_SYSTEM_MCP:CRM_SYSTEM_CRMD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check DO_PACEMAKERD_HEALTH past here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops a survivor from before I had adapted crmadmin to use the new api.
Well no false behavior so tests didn't show it after I had read over it.

} else {
pacemakerd_state = XML_PING_ATTR_PACEMAKERDSTATE_STARTINGDAEMONS;
init_children_processes(NULL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_and_track_existing_processes() returns the number of existing daemons found, so we can distinguish three cases:

  • 0: as above
  • SIZEOF(pcmk_children): Most likely, pacemakerd itself crashed, was respawned by systemd, and found all the daemons already running. In this case it probably makes sense to go straight to RUNNING just to reflect reality.
  • Any other positive number: Most likely, corosync dropped, all the corosync-based daemons including pacemakerd exited, systemd respawned pacemakerd, and pacemakerd found the non-corosync daemons (scheduler and executor) already running. I suspect WAITPING makes sense here too, but I'm not 100% sure.

@wenningerk
Copy link
Contributor Author

Hmm ... why doesn't github allow me to reply directly to the find_and_track_existing_processes comment!?
What you describe would probably work and be a little bit more efficient in the current state and it would prevent one or the other glitch of states we are in just for very short.
But I still think the current behavior should work as well and have no really bad effects.
I didn't want to turn it around too much before we exactly know how we are gonna do the alive-checks of the processes.
Not sure if what find_and_track_existing_processes() does is enough on the long run with checking just the existence of the process and some ipc-stuff.
Doing some ping-response-game periodically could fit into something like the init_children_processes (something like check_children_processes then rather of course) logic why I wanted to go through that at least once initially before switching to running-state.
I was as well thinking of going into a cleanup when not pinged within a certain timeout as last resort.
Which is another reason I wanted to leave it in.

@kgaillot
Copy link
Contributor

How would we handle remote nodes with this? pacemakerd IPC currently isn't proxied to remote nodes, and we don't have pacemakerd starting children on a remote node.

I have thought about running pacemakerd on remote nodes and having it launch pacemaker-remoted as a child. That would be a major project though.

I suppose we don't really need the monitoring on remote nodes because pacemaker-remoted can't run resources without a cluster connection, and the cluster monitors the connection.

@wenningerk
Copy link
Contributor Author

Yeah I guess introduction of an additional process just to be monitored that in turn would then have to monitor pacemaker-remoted would be some kind of overkill.
As we don't have multiple processes we don't need a mechanism on remote nodes that reduces external (sbd) observation needs to just observing a single process as we just have this single process.
Regarding how sbd and pacemaker could potentially be working together remote nodes are somewhat different anyway unfortunately - manly due to a cib-query requiring a functional cluster more or less.
Of course pacemaker-remoted could implement this state-query/lifecycle/heartbeat or however you would like to call it API directly. That would make operation of sbd on remote-nodes and cluster-nodes much more similar. And the issues with ClusterLabs/sbd#14 should be gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants