-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC/Feature Request: vtorc should make it easy to find audit info related to recoveries #17465
Comments
VTOrc seems to be designed to be very ephemeral. As you pointed out, the backend data is ephemeral, also the I guess what I'm saying here is if an improved UI is made, I feel it should be located somewhere "stable" and more built for the purpose such as VTAdmin
The backend of VTOrc is assumed to be ephemeral now and this RFC is requesting some of that data is not ephemeral. I suggest the data that requires persistence is separated vs. breaking those assumptions for all backend data
There is an "auditing" system in VTOrc. Currently it lets you output audit data to logs, a file or the ephemeral backend DB. The usefulness of backend and to some degree file are questionable to me in such an ephemeral daemon/pod, but 🤷. The quality/detail of the audit events could be improved in my opinion, but to me this existing "audit" concept feels like the logical place to build upon Thinking aloud, could support for remote, persistent audit backends and support for reading it in VTAdmin address this? Also, is a web UI required? Could |
Today you can use I think that we should start by improving what we have before we consider adding more complexity and tradeoffs that some may object to. To put it another way, we should explore any gaps today for option 1. That may just be improving the usefulness of vtorc log messages, which is always a good thing. We should also improve the output of the existing vtorc http endpoint. These are both useful improvements that nobody can reasonably object to and which would make the product much better IMO w/o any downsides. |
Hey @ejortegau me and Deepthi discussed this, and I think we can for now go ahead with the first proposal of having distinguishable log entries for now. Currently VTOrcs don't register themselves in the topo-server, which means they aren't discoverable by any other running service that doesn't explicitly have its endpoint, so that makes the other two options a little trickier to implement, so let's start with the first. |
This is meant to make recovery actions more easily identified from the logs. See vitessio#17465 Signed-off-by: Eduardo J. Ortega U. <[email protected]>
To me we're tackling 2 x topics:
Focusing on the 1st point, the audit system has a "file", "backend" and "syslog" options. If it doesn't exist already, what if an interface such as Perhaps the existing "file", "backend" and "syslog" options could be migrated to implement this I personally feel VTOrc should remain ephemeral, so I feel a persistent backend both components can access would be a remote datastore. Could we just persist the recovery data in-Vitess, perhaps in the sidecar database of the shard in a simple schema with JSON fields? Of course, this might not scale at read-time. Or at worst, in a remote store such as a plain MySQL (either as a custom As an example, in our AWS-based infra, something like an RDS database or S3 would be viable. If it could be stored in the shards, even better |
Feature Description
Context & prior art
Classical Orchestrator had under
/web/audit-recovery/uid/<recoveryid>
some details about recoveries and steps taken as part of each one of them. This allowed an operator to easily see whether the recovery had succeeded or not, as well as some information about each step taken:Furthermore, if one had more than one Orchestrator instance (and set them up with the shared MySQL database - not sure how this worked on the raft setup with independent local databases as I never used it), no matter which Orchestrator instance was checked, the same information would be visible.
In
vtorc
, however, this information is (at least currently) not so clearly surfaced to the user anywhere. The currentvtorc
web UI only seems to show a list of recoveries with no details on each one:Furthermore, since there is no shared state, and no cluster leader, users need to check each individual
vtorc
instance to see all the recoveries. This is IMHO a considerable feature gap when it comes to observability.Feature Request
There should be some easy means to centrally view what recoveries took place and how each step of them went.
RFC
There are multiple ways to try to solve the issue above:
<Problem> Recovery <keyspace>/<shard>:
(e.g.DeadPrimary Recovery commerce/80-:
). This is probably the lowest effort one to implement, but assumes the existence of logging processing & visualization infrastructure.vtorc
implements a new web UI endpoint for the information. This comes with some challenges:vtorc
instances:vtcorc
s need to know about each other and the one serving the web request fetches the information from the rest to present a single view. This woul impliy that, in addition to the UI,vtorc
should also expose recovery details view APi endpoint that can be queried by the othervtorc
s to consolidate information from all of them.vtorc
currently uses ansqlite
DB which defaults to in-memory storage. Care should be taken to ensure it's kept acrossvtorc
process restarts if the user wants to not lose the recovery details across restarts. E.g., use an actual sqlite file instead of inmemory, and, if running in k8s, use a persistent volume claim for its path.vtadmin
adds functionality to query allvtorc
instances, aggregates the information about recoveries from all instances and shows it to the user in a new web UI. This requires registering allvtorc
instances somewhere so thatvrtadmin
can find them. I guess that would be the topology? In this scenario, the user should still take care of data persistency for thevtorc
DB.The last two approaches are more feature-rich/complete for the user as there's no need/assumption of existence of logging pipeline, but obviously require more work. Also, perhaps the first approach could be done as a stop gap measure while the second or third one are done.
Thoughts?
Use Case(s)
As a vites operator, I want to be able to easilky check whether
vtorc
detected and attempted to recover a problem, and the steps taken during that recovery.The text was updated successfully, but these errors were encountered: