-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract controller API from crm_resource #1981
Conversation
Hi @HideoYamauchi , can you test this and see if it helps the issue in CLBZ#5420? @wenningerk and @clumens , this is a rough idea for a daemon IPC model, as discussed in #1957 for pacemakerd. I think it needs considerable work to be usable for all our use cases, but it should give some ideas of a direction. |
Hi Ken,
Okay! Best Regards, |
Hi Ken, I checked the operation with the modified version.
Let's look at the details a bit from tomorrow. Best Regards, |
7f6c592
to
f1f43dc
Compare
Hi Hideo, Thanks for the test. I've added a commit that ensures the connection is always closed gracefully before exit; would you have a chance to test it again? |
Hi Ken, Okay! Best Regards, |
Hi Ken, I tested the new fix.
Best Regards, |
Hi Ken, I am seeing the problem a little more. If a problem occurs, it seems to be sent after the client is disconnected at the timing when sending to the client with crm_ipcs_flush_events (). If another message transfer is entered on the way, the problem will not occur because it recognizes the disconnection of the client.
In the first place, in the procedure I reported, the number of controld daemon responses is three, so it seems that crm_resource that waits for two can not handle it in many cases. Since the response is returned by send_task_ok_ack(), suppressing the response to crm_resource in delete_rsc_entry() may solve the problem.
I think a little more about how to fix the problem. Best Regards, |
This is an initial step towards a public API in libcrmcommon.
Previously, crm_resource would close open connections and free memory before exiting only in certain circumstances. Now, it always does.
For all of the daemons except the executor, the IPC post-connection handler did nothing but log a trace message with the libqb client pointer address. This is unnecessary since the daemons log a trace when first accepting the connection, and libqb logs another when creating the client.
Previously, some callers of pcmk__client_data2xml() trace-logged the resulting XML, and some didn't. Moving the log into the function itself reduces duplication and ensures the trace is always done (at the minor cost of not being able to trace just one daemon's incoming IPC). Also remove an unused argument from the function.
Hi @HideoYamauchi , I am still investigating this, but I've added more commits to improve the logging on the controller side. I now suspect the problem is in the controller and not crm_resource. I can reproduce the problem by running crm_resource --refresh three times in quick succession, but not if I wait a couple of seconds in between. That suggests one client connection is affecting how the controller handles the next connection if there's not some time in between. Another possibility is that the difference between a good run and a bad run has to do with whether there is a monitor operation in flight when the resource is cleaned up. |
Hi @HideoYamauchi , I am getting further with the investigation. There appears to be a problem on each side (crm_resource and the controller). It occurs when refreshing a clone that hasn't been probed yet (whether at start-up or just after a refresh). On the crm_resource side, the behavior is different depending on whether the clone has been probed. If the clone has been probed, crm_resource sends the controller one request per clone instance, with each instance properly assigned to a particular node. Here is an example of part of the IPC request:
Note that the id is "pinger" with a long-id of "pinger:0", and in this case there will be only one request sent for pinger:0, on node1. When the clone has not been probed, crm_resource sends a request for each clone instance and node combination, using only the numbered instance name:
Note the id is "pinger:0" with no long-id, and in this case, this request will be repeated for each node, and then requests for pinger:1 will be sent for each node, etc. My first thought is that for anonymous clones that aren't probed, the crm_resource should only send one request per node, using the unnumbered name. For unique clones the current behavior makes sense. It's possible the current behavior is there to be able to clear the history in case a unique clone was converted to anonymous, but I doubt that's the right approach. (This fix would work around the problem, and is probably a good idea, but isn't necessary.) On the controller side, when receiving the requests in the first case, everything proceeds reasonably, with one reply per request. However in the second case, for some reason I haven't figured out yet, the controller replies twice to each request for the local node. Since crm_resource is only counting replies (not matching replies to requests), this throws off its count and it exits early. Fixing the controller such that it only replies once to each reply would avoid the issue correctly, and would make the work-around on the crm_resource side unnecessary. Finally, another good idea would be to make the controller and crm_resource match requests to replies. That would allow crm_resource to discard any duplicate replies for the same request, which would also work around the problem. However this would be a major change in the controller IPC protocol; it currently does not have a "call ID" concept like the CIB, fencing, and executor IPC protocols. This is definitely a good idea but is probably a longer term project. |
Even though this PR does not fix the issue, I am inclined to merge it as-is once testing completes, since it is a good first step towards improving the controller IPC protocol, and it also provides better logs. |
Hi Ken,
Let's consider this issue further to improve it in the future. Best Regards, |
This isolates the controller-related code of crm_resource into new files crm_resource_controller.h and crm_resource_controller.c, creating an IPC API interface for the controller somewhat like what we have for other daemons. This is intended as a first step toward a common model for daemon IPC APIs, to be refined and eventually moved to libcrmcommon.
It is also hoped to fix CLBZ#5420, or at least make it easier to fix.