-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ApiListener#Start(): auto-renew CA on its owner #9891
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really that easy to create a drop-in replacement for the CA certificate that also keeps all existing child certificates valid? So please provide a reference for which attributes have to be shared between both certificates for this to work.
a3b8468
to
b3315a4
Compare
Test protocolSetup
Now you have a 3 lvl cluster with nothing. /etc/icinga2/zones.conf
Preparation
Congratulations! Now you have a 3 lvl cluster with nothing which is about to collapse in one day. Examination
ConclusionTo do:
|
1194eb5
to
c05b877
Compare
Much better:
Prevented the cluster from collapsing! Everywhere However: 👎 Satellite and agent started w/o master are now in a re-connect loop. 🙈 |
e42f9e6
to
a1e3402
Compare
I have to correct myself: Only the agents won't need an update. |
a1e3402
to
b43f1e7
Compare
Test protocol IISetup
Now you have a 3 lvl cluster with nothing. /etc/icinga2/zones.conf
Preparation
Congratulations! Now you have a 3 lvl cluster with nothing which is about to collapse in one day. Examination
Conclusion🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is currently implemented, there's no way to manually trigger the renewal early, is it? I don't mean from a "there's a CLI command for that" perspective but even if there was, the logic for the actual redeployment seems tied to the IsCertUptodate()
function.
As an admin, I think I'd become nervous if a CA certificate came close to its expiry. Also, I think I would be surprised if the CA certificate changed magically as it probably needs to be replaced elsewhere too. Could the logic be adapted in a way to allow an early manual renewal? The current logic of renewing in the last 30 days could stay as a last attempt to prevent the cluster from breaking if the admin failed to do that.
Can you please share the CA certificates from one of your tests from before and after the renewal (ideally showing the diff of |
👍
|
I've tested this PR in my test cluster with an additional patch that should give more frequent CA renewal, around every 5 minutes:
diff --git a/lib/base/tlsutility.cpp b/lib/base/tlsutility.cpp
index 246bd5aee..cf991dbe7 100644
--- a/lib/base/tlsutility.cpp
+++ b/lib/base/tlsutility.cpp
@@ -796,7 +796,7 @@ bool IsCertUptodate(const std::shared_ptr<X509>& cert)
bool IsCaUptodate(X509* cert)
{
- return !CertExpiresWithin(cert, LEAF_VALID_FOR);
+ return !CertExpiresWithin(cert, ROOT_VALID_FOR - 300);
}
String CertificateToString(X509* cert)
diff --git a/lib/base/tlsutility.hpp b/lib/base/tlsutility.hpp
index b06412020..74460736d 100644
--- a/lib/base/tlsutility.hpp
+++ b/lib/base/tlsutility.hpp
@@ -36,7 +36,7 @@ const unsigned int DEFAULT_CONNECT_TIMEOUT = 15;
const auto ROOT_VALID_FOR = 60 * 60 * 24 * 365 * 15;
const auto LEAF_VALID_FOR = 60 * 60 * 24 * 397;
const auto RENEW_THRESHOLD = 60 * 60 * 24 * 30;
-const auto RENEW_INTERVAL = 60 * 60 * 24;
+const auto RENEW_INTERVAL = 60;
void InitializeOpenSSL();
diff --git a/test/base-tlsutility.cpp b/test/base-tlsutility.cpp
index c20b5ed0f..15a4fa92f 100644
--- a/test/base-tlsutility.cpp
+++ b/test/base-tlsutility.cpp
@@ -87,6 +87,7 @@ BOOST_AUTO_TEST_CASE(sha1)
BOOST_AUTO_TEST_CASE(iscauptodate_ok)
{
+ return;
auto key (GenKeypair());
BOOST_CHECK(IsCaUptodate(MakeCert("Icinga CA", key, "Icinga CA", key, [](ASN1_TIME* notBefore, ASN1_TIME* notAfter) { Observations so far:
|
One more reason not to copy the CA over manually! I mean, it's still not broken, but... 🤯 Or! You can copy it over again to restore the symmetry. 🙈
Well. The CA has LEAF_VALID_FOR time to get distributed through the whole cluster. That's 397d. That's 1.09y! Even in the extreme OP issue case there's a margin of safety of 5 months. Even that's 5x RENEW_THRESHOLD. TL;DR: No. At least not quicker than leaf certs (already), would be paradox.
Tbh I prefer testing the actual code. If this PR LGTY, but you wanna be sure on X and Y, please write down X and Y. I'll manipulate a few certificates with faketime openssl and share what Icinga did. But yes, I'd expect... oh! You have to wait one day which is the minimum diff for a new CA to be considered newer! |
What are the other reasons why I wouldn't want to do this? I mean if only one master can sign certificates, I don't have redundancy for that.
But why doesn't it happen quickly? Shouldn't certificate renewal trigger reconnects which trigger certificate requests which should renew these certificates until everything is renewed?
My idea was to do this as kind of a stress test. Like change the thresholds so that it happens frequently, leave it running for a day so that you get hundreds of iterations, see if anything strange happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The number of signing masters is directly proportional to the times you have to run ssh and icinga2 ca list in the worst case to see where a particular CSR is.
- I agree that even the original code, not to mention mine, should re-connect on cert deployment. I can test this with my code as-is along with some hard requirements you write down. But I don't see this as a hard requirement due to the already mentioned time periods. If regular renewals work despite this "bug"(?), CA ones will do for sure. Not to mention eventual non-obvious influence of you stress test patch. Apropos...
if (requestorCA && !IsCaUptodate(requestorCA)) { | ||
int days; | ||
|
||
if (ASN1_TIME_diff(&days, nullptr, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && days > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You can't properly speedrun renewals without updating all conditions. This one requires the new CA to expire 1d+ after the old one for propagation.
Indeed:
|
💡Yes. The satellite got a new cert+CA. And it cut off all connections. And the agent re-connects. And there's a new CA. But! It's still the master who decides who gets which cert and when! And it may or may not be connected, yet.
|
test/base-tlsutility.cpp
Outdated
return key; | ||
} | ||
|
||
static std::shared_ptr<X509> MakeCert(char* issuer, EVP_PKEY* signer, char* subject, EVP_PKEY* pubkey, std::function<void(ASN1_TIME*, ASN1_TIME*)> setTimes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The char *
parameters should be const, just noticed a new compiler warning scrolling by due to this:
.../test/base-tlsutility.cpp:92:36: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
92 | BOOST_CHECK(IsCaUptodate(MakeCert("Icinga CA", key, "Icinga CA", key, [](ASN1_TIME* notBefore, ASN1_TIME* notAfter) {
| ^~~~~~~~~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they're (unsigned char*)ed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, I hate OpenSSL 1.0.2 (or those Linux distributions that think it's a good idea to keep that version alive), in 1.1.1 they realized that making that parameter const
is probably a good idea.
Anyways, then please use const_cast<>()
, that's more of a "I'm intentionally doing this" and shouldn't issue a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Don't complain about your life. When I started working here, until my final exam, we had to support RHEL 5. I.e. Python 2.4, C++03 + auto and OpenSSL 0.9.8e with TLSv1.0. Not to mention the PHP version.
- The current cast does neither despite -Wall -Wextra (and is how our actual code does this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we're writing software in C++, not in C, but I get it, for some reason, you and C-style casts are just inseparable.
So certificate requests are only forwarded if the parent node is currently connected? And if it's not, we're basically relying on |
otherwise it would expire.
and a newer one is available.
5c82472
to
966216f
Compare
Ah, yes! Almost forgot that timer. 👍 |
Indeed, if I speedrun that timer (actually all) with |
I added the following to my patch from #9891 (comment): diff --git a/lib/remote/jsonrpcconnection-pki.cpp b/lib/remote/jsonrpcconnection-pki.cpp
index 340e12b30..41fea9664 100644
--- a/lib/remote/jsonrpcconnection-pki.cpp
+++ b/lib/remote/jsonrpcconnection-pki.cpp
@@ -113,9 +113,9 @@ Value RequestCertificateHandler(const MessageOrigin::Ptr& origin, const Dictiona
}
if (requestorCA && !IsCaUptodate(requestorCA)) {
- int days;
+ int days, secs;
- if (ASN1_TIME_diff(&days, nullptr, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && days > 0) {
+ if (ASN1_TIME_diff(&days, &secs, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && (days > 0 || secs > 0)) {
uptodate = false;
}
} With that and the other insights, things now behave more in way where I understand what's happening: it takes about a minute per cluster level for the certificates to propagate (master -> satellite -> 2nd-level-satellite -> agent takes 3 minutes), this is probably influenced by me running this with docker compose where I start everything at the same time, so the timers are pretty much synchronized. |
Other test I did:
|
otherwise it would expire.
fixes #9890
With this, the Icinga 2 node owning the CA periodically renews it locally. (Pretty much like 3753f86, but for the CA this time.)
Sooner or laterthat local CA cert will be(already)propagated through pki::UpdateCertificate.This way the root certificate never expires on the whole cluster.
TODO