New OKD 4.7 install problem #839

tman24 · 2021-08-26T18:12:42Z

tman24
Aug 26, 2021

OKD first timer here. I'm following a bare metal 4.7 install guide and have got to the point where the bootstrap node is up and running, and I'm now in the process of setting up the control plane nodes. Boostrap and control plane nodes are all CoreOS 34.20210808.3.0 - services node is CentOS 8.4.

I've created the ignition files, and am now deploying the main nodes. All nodes get an IP address via static DHCP reservations. Bootstrap deployed ok, but the control plane nodes are giving me a problem. I've edited the CoreOS boot line with the correct parameters on control-plane-1, and it downloads and installs the CoreOS image ok (so networking is working, and during this phase, it responds to ICMP correctly on it's statically reserved IP address). After the writing to disk part though it reboots again, and from then on it's like it has no network. I just get a looped message saying it can't connect to https://apt-int....:22623/config/master. The URL works fine (tested), but I no longer get any ICMP response from the node, so either the network stack didn't come up or there's a bug/problem somewhere. Other than that, I've no idea.

It's a bit of a showstopper right now, and any advice is appreciated.

rvanderp3 · 2021-08-26T18:32:33Z

rvanderp3
Aug 26, 2021

Hi @tman24 would you mind providing installation debug logs?

0 replies

vrutkovs · 2021-08-26T22:49:06Z

vrutkovs
Aug 26, 2021
Maintainer

Boostrap and control plane nodes are all CoreOS 34.20210808.3.0

We've noticed recent FCOS is giving us trouble on boot. Could you try 34.20210626.3.1? See https://getfedora.org/en/coreos/download?tab=metal_virtualized&stream=stable, replace version in the URL

1 reply

tman24 Aug 27, 2021
Author

Thanks - I'll give that a try and report back.

tman24 · 2021-08-27T16:30:22Z

tman24
Aug 27, 2021
Author

Just to confirm that reverting back to 34.20210626.3.1 fixed the problem, and after the first boot, the network stack came up ok, and config continued. As this working version is part of the current stable stream, there's probably not much reason to change at this time.

0 replies

tman24 · 2021-09-02T14:30:05Z

tman24
Sep 2, 2021
Author

Well, the problems go on. I'm following the inext.io install guide, and while the bootstrap and master nodes have deployed, bootstrap doesn't seem to be coming up properly, which is preventing the workers from deploying too!

journalctl -b -f -u release-image.service -u bootkube.service

ep 02 15:24:40 bootstrap bootkube.sh[386903]: Starting temporary bootstrap control plane...
Sep 02 15:24:40 bootstrap bootkube.sh[386903]: Error: open /etc/kubernetes/manifests/bootstrap-pod.yaml: file exists
Sep 02 15:24:40 bootstrap bootkube.sh[386903]: Tearing down temporary bootstrap control plane...
Sep 02 15:24:40 bootstrap bootkube.sh[386903]: Error: open /etc/kubernetes/manifests/bootstrap-pod.yaml: file exists
Sep 02 15:24:40 bootstrap podman[386903]: 2021-09-02 15:24:40.249971281 +0100 BST m=+0.498884428 container died f6f26fcc8cea08d130a012523ac5bcc916ca6c045a07298cb6b9146f232ff93a (image=quay.io/openshift/okd-content@sha256:99b54b2c35145c3084d773acabbe0ec94425189ae215c639b87055b55625e066, name=elegant_hugle)
Sep 02 15:24:40 bootstrap podman[386903]: 2021-09-02 15:24:40.285875736 +0100 BST m=+0.534788810 container remove f6f26fcc8cea08d130a012523ac5bcc916ca6c045a07298cb6b9146f232ff93a (image=quay.io/openshift/okd-content@sha256:99b54b2c35145c3084d773acabbe0ec94425189ae215c639b87055b55625e066, name=elegant_hugle, io.openshift.tags=base rhel8, distribution-scope=public, io.openshift.build.commit.date=, summary=Provides the latest release of Red Hat Universal Base Image 8., com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Red Hat, Inc., io.openshift.build.commit.message=, architecture=x86_64, io.buildah.version=1.16.4, io.openshift.build.commit.author=, com.redhat.component=openshift-enterprise-base-container, com.redhat.build-host=cpt-1004.osbs.prod.upshift.rdu2.redhat.com, io.openshift.release.operator=true, vcs-type=git, vcs-url=https://github.com/openshift/cluster-bootstrap, io.openshift.build.source-context-dir=, build-date=2021-01-09T00:40:49.580557, name=openshift/ose-base, io.openshift.expose-services=, release=202101090039.11723, io.k8s.display-name=OpenShift Base, io.openshift.build.source-location=https://github.com/openshift/cluster-bootstrap, url=https://access.redhat.com/containers/#/registry.access.redhat.com/openshift/ose-base/images/v4.0-202101090039.11723, version=v4.0, description=The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly., vendor=Red Hat, Inc., io.openshift.build.namespace=, io.openshift.build.commit.ref=master, io.openshift.build.commit.id=6665cae3374c18d466f11c9e0b8e41a61fcb0819, vcs-ref=6665cae3374c18d466f11c9e0b8e41a61fcb0819, io.k8s.description=This is the base image from which all OpenShift images inherit., io.openshift.build.name=)
Sep 02 15:24:40 bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Sep 02 15:24:40 bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Sep 02 15:24:40 bootstrap systemd[1]: bootkube.service: Consumed 13.700s CPU time.

Yes, bootstrap-pod.yaml does exist in the specified location. Should it not? Everything on the services node checks out. The install-config.yaml is pretty simple;

apiVersion: v1
baseDomain: k8s.lan
metadata:
name: lab

compute:

hyperthreading: Enabled
name: worker
replicas: 0

controlPlane:
hyperthreading: Enabled
name: master
replicas: 3

networking:
clusterNetwork:

cidr: 10.128.0.0/14
hostPrefix: 23
networkType: OpenShiftSDN
serviceNetwork:
172.30.0.0/16

platform:
none: {}

fips: false

pullSecret: '{"auths":{"fake":{"auth":"aWQ6cGFzcwo="}}}'
sshKey: 'ssh-rsa MYSECRET'

openshift-install 4.7.0-0.okd-2021-08-22-163618
built from commit 156120c9d62ab5d217e573a5b49776f88d6e4ebf
release image quay.io/openshift/okd@sha256:6e8ef1f76a56819a96cc70635487032f5d2b64822f26e16da34304d3fe792a17

Services node is CentOS 8.4, all others are FCOS 34. Initial deployment should not be this hard!

Thanks

0 replies

tman24 · 2021-09-02T17:41:10Z

tman24
Sep 2, 2021
Author

I've just rebuilt bootstrap again, and it seemed to get a bit further, but is now stuck in this loop which occurs directly after cluster-bootstrap is called.

Sep 02 17:20:30 bootstrap bootkube.sh[19404]: [#4011] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
Sep 02 17:20:30 bootstrap bootkube.sh[19404]: [#4012] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused

Also, not sure about this cert warning being part of the problem;

7:39:30Z is after 2021-09-01T10:53:12Z
Sep 02 18:39:30 bootstrap bootkube.sh[5317]: [#691] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": x509: certificate has expired or is not yet valid: current time 2021-09-02T17:39:30Z is after 2021-09-01T10:53:12Z

For some reason, nothing on bootstrap is listening on 6443 though, although I do have some containers running;

Running kube-apiserver 6 2ffbcdf1f3c6f
Exited kube-controller-manager 6 9b1e16b6e4092
Exited kube-apiserver 5 2ffbcdf1f3c6f
Running cluster-policy-controller 0 9b1e16b6e4092
Running kube-apiserver-insecure-readyz 0 2ffbcdf1f3c6f
Exited setup 0 2ffbcdf1f3c6f
Running kube-scheduler 0 b4eb415b91610
Running cluster-version-operator 0 5cf7135329bfb
Running cloud-credential-operator 0 4cdd36c496fec
Running machine-config-server 0 871c18bb7da97
Exited machine-config-controller 0 871c18bb7da97
Running etcd 0 f5e7cb7c81d09

1 reply

geofreyr Sep 7, 2021

Interesting, I am having similar issues on coreos 34 running under virtlib. It's strange that it's referring to localhost:6443, that ought to be the proxy:6443 I believe.

I think there's an issue on this version with coreos with the Kubernetes kubelet deployment to the bootstrap node, or similar.

tman24 · 2021-09-07T11:37:32Z

tman24
Sep 7, 2021
Author

Any feedback? I'm pretty stuck right now. Why would the bootstrap node be trying to talk to itself on 6443 when it isn't even listening on that port. What could cause the 'connection refused' message?

0 replies

disposab1e · 2021-09-07T11:43:10Z

disposab1e
Sep 7, 2021

Fcos: 34.20210626.3.1
OKD: 4.7.0-0.okd-2021-08-22-163618
CentOS: 8.4
UPI with KVM and static IP's, no issues.

0 replies

disposab1e · 2021-09-07T11:48:47Z

disposab1e
Sep 7, 2021

Do you work with old ignitions? The certs expire after 24hours.

1 reply

tman24 Sep 7, 2021
Author

Thanks. If the ignition file certs expire after 24 hours, I'll try and re-create them again and see how I go. It doesn't leave much time to test things, but useful to know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New OKD 4.7 install problem #839

{{title}}

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

New OKD 4.7 install problem #839

tman24 Aug 26, 2021

Replies: 8 comments · 3 replies

rvanderp3 Aug 26, 2021

vrutkovs Aug 26, 2021 Maintainer

tman24 Aug 27, 2021 Author

tman24 Aug 27, 2021 Author

tman24 Sep 2, 2021 Author

tman24 Sep 2, 2021 Author

geofreyr Sep 7, 2021

tman24 Sep 7, 2021 Author

disposab1e Sep 7, 2021

disposab1e Sep 7, 2021

tman24 Sep 7, 2021 Author

tman24
Aug 26, 2021

Replies: 8 comments 3 replies

rvanderp3
Aug 26, 2021

vrutkovs
Aug 26, 2021
Maintainer

tman24 Aug 27, 2021
Author

tman24
Aug 27, 2021
Author

tman24
Sep 2, 2021
Author

tman24
Sep 2, 2021
Author

tman24
Sep 7, 2021
Author

disposab1e
Sep 7, 2021

disposab1e
Sep 7, 2021

tman24 Sep 7, 2021
Author