-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel freeze when running workloads on OKD #957
Comments
Hi, We have excatly same problem here when updating to 4.7.0-0.okd-2021-08-22-163618.
|
hey all - I've created some FCOS artifacts with a dev kernel build with a kernel commit reverted that we think is the problem and posted them over in the other kernel issue. Not sure how easy it is with OKD to switch out the base media, but maybe you can try with those artifacts or just use rpm-ostree to override replace the kernel with something like:
|
Right, that seems like the much easier path. |
Hello,
|
hmm - would definitely be nice to get more output after the |
Yes, definitely! That's quite strange. |
We are also experiencing this issue on OKD-4.7.0-0.okd-2021-08-22-163618. Is there any possible workaround? |
We have downgraded the kernel to the previous OKD version (4.7.0-0.okd-2021-08-07-063045): |
@depouill Thanks, did you experience any issues with that downgrade so far and have you taken any steps to disable automatic updates (pausing machineconfigpools etc.)? |
Nodes are stable since yesterday and with rpm-ostree override, machineconfig is ok (no need to pause MCP). Cluster is green. |
@depouill Is |
no, since we downgraded to 5.12.19-300, cluster works fine (since one week).
|
Thanks for the info; so we should downgrade too. We are also facing this issue on Openstack nodes, after upgraded to |
Updated to latest 4.8 (4.8.0-0.okd-2021-10-10-030117) with kernel 5.13.13-200.fc34.x86_64 and the issue is still present (unfortunately, I still don't get any messages on the console). |
Hey, Thanks to the instructions from depouill, we were able to temporarily mitigate the issue with kernel For a more permanent fix, we investigated how we could build our own OKD node images. Unfortunately, this was quite complicated and I documented the required steps here: |
We upgraded few days ago from Sometimes it does show BUG, stuck, rcu stalls, etc. Sometimes it just stops. This is bare metal, on AMD EPYC 7502P. I am attaching some logs, including kernel stuff, from few machines that experienced the issue. okd-4.8.0_linux-5.13.13-200_issues.tar.gz We will downgrade to the kernel |
As @baryluk mentioned, we've downgraded kernel ( |
Updated OKD 4 to the version released yesterday (4.8.0-0.okd-2021-10-24-061736). |
Still have this issue on OKD version @depouill's fix doesn't work for me anymore, as I tried to downgrade the kernel with
which fails with |
This is an rpm-ostree bug fixed in v2021.12. The
Yeah, rpm-ostree is really strict about this. Doing a base package replacement is not the same as removing a base package and overlaying another. |
For what it's worth, this issue is not limited to OKD/OpenShift. We're having the exactly same problem with upstream Kubernetes (v1.21.6). We deploy the cluster with kubespray and every 1-2 days, the server just crashes. We've put absolutely no nodes on the server (except for the DaemonSet nodes that all nodes must host, like Calico). There is also no log output when this happens. The server just "stops". Switching to the |
Hey @scrayos (or anyone else). We would be overly joyed if someone could give us a reproducer for this (step by step instructions would be great). It sounds like you are saying you're not even deploying any applications, just running Kubernetes and it's crashing for you? |
@dustymabe Exactly. I only included the node into the cluster and it kept crashing every 1-2 days. These were the only pods on the node: So only networking and the prometheus node exporter. There was absolutely nothing else deployed on the node. The node was set up with kubespray. So essentially I did this:
And that's about it. Then I just leave the server in idle and after 1-2 days it crashed three times in a row. Always with abruptly endling logs (sorry, I only made screenshots): To summarize:
I hope any of this helps. |
@scrayos can you please provide full HW specs? Or at least CPU, MB, RAM and perhaps disks. For us, we would notice about 1 node go down per day (AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003 960GB NVMe disks). Workload is mixed (java, python, spark to name a few). On a test node (VM) we were not able to reproduce this but I'm trying to push some java-based benchmark there soon in the hope of getting it to crash. |
@aneagoe Sure! It's this hetzner server with upgraded ECC RAM.
|
Looking at @baryluk logs, this may be some race related to a side-effect of accessing
It seems to affect mostly AMD cpus. It could be because of a vendor-specific path in the kernel, or just because those CPUs usually have a large number of cores. |
I'm running now mixed java workloads and also left running on node |
@scrayos - unfortunately I don't have access to Hetzner. Do you think there's any chance this would reproduce with the bare metal instances from AWS. Also, you've given a lot of detail about your Ignition config (Thanks!). Any chance you could share it (or preferably the Butane version of it) with anything redacted that you didn't want to share? |
@dustymabe - Sure! I've actually got a The butane filevariant: 'fcos'
version: '1.4.0'
boot_device:
# configure boot device mirroring for additional fault tolerance and robustness
mirror:
devices:
- '/dev/nvme0n1'
- '/dev/nvme1n1'
storage:
disks:
# create and partition both of the drives identically
- device: '/dev/nvme0n1'
partitions:
- label: 'root-1'
# set the size to twice the recommended minimum
size_mib: 16384
start_mib: 0
- label: 'var-1'
- device: '/dev/nvme1n1'
partitions:
- label: 'root-2'
# set the size to twice the recommended minimum
size_mib: 16384
start_mib: 0
- label: 'var-2'
raid:
# add both of the var drives to a common raid for additional fault tolerance and robustness
- name: 'md-var'
level: 'raid1'
devices:
- '/dev/disk/by-partlabel/var-1'
- '/dev/disk/by-partlabel/var-2'
filesystems:
# mount /var with the raid instead of individual hard drives
- path: '/var'
device: '/dev/md/md-var'
format: 'xfs'
wipe_filesystem: true
with_mount_unit: true
files:
# configure strict defaults for ssh connections and negotiation algorithms
- path: '/etc/ssh/sshd_config'
mode: 0600
overwrite: true
contents:
inline: |
# chroot sftp into its area and perform additional logging
Subsystem sftp internal-sftp -f AUTHPRIV -l INFO
# keep connections active
ClientAliveInterval 30
ClientAliveCountMax 2
# disable unecessary rsh-support
UseDNS no
# do not let root in - core is much more uncommon
PermitRootLogin no
AllowUsers core
# log key fingerprint on login, so we know who did what
LogLevel VERBOSE
# set log facility to authpriv so log access needs elevated permissions
SysLogFacility AUTHPRIV
# re-negotiate session key after either 500mb or one hour
ReKeyLimit 500M 1h
# only allow public-keys
PubKeyAuthentication yes
PasswordAuthentication no
ChallengeResponseAuthentication no
AuthenticationMethods publickey
# set stricter login limits
LoginGraceTime 30
MaxAuthTries 2
MaxSessions 5
MaxStartups 10:30:100
# adjust algorithmus
Ciphers [email protected],[email protected],[email protected],aes128-ctr,aes192-ctr,aes256-ctr
HostKeyAlgorithms [email protected],[email protected],[email protected],[email protected],ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,rsa-sha2-512,rsa-sha2-256
KexAlgorithms curve25519-sha256,diffie-hellman-group18-sha512,diffie-hellman-group16-sha512,diffie-hellman-group14-sha256,[email protected],diffie-hellman-group-exchange-sha256
MACs [email protected],[email protected],[email protected]
# adjust pluggable authentication modules
# pam sends last login + coreos message
UsePAM yes
PrintLastLog no
PrintMotd no
# add ignition and afterburn keys to the allowed directories
AuthorizedKeysFile .ssh/authorized_keys .ssh/authorized_keys.d/ignition .ssh/authorized_keys.d/afterburn
# include the drop-in configurations
Include /etc/ssh/sshd_config.d/*.conf
# clear default crypto policy as we define it in ssh config manually
- path: '/etc/sysconfig/sshd'
mode: 0640
overwrite: true
contents:
inline: |
CRYPTO_POLICY=
# perform updates only in allowed time frames, so we don't have surprise downtimes
- path: '/etc/zincati/config.d/55-updates-strategy.toml'
mode: 0644
contents:
inline: |
[updates]
strategy = "periodic"
[[updates.periodic.window]]
days = [ "Fri" ]
start_time = "02:00"
length_minutes = 60
# disable SysRq keys, so they won't be accidentally pressed (and we cannot use them anyways)
- path: '/etc/sysctl.d/90-sysrq.conf'
contents:
inline: |
kernel.sysrq = 0
# enable reverse path filtering for ipv4. necessary for calico (kubespray)
- path: '/etc/sysctl.d/reverse-path-filter.conf'
contents:
inline: |
net.ipv4.conf.all.rp_filter=1
directories:
# delete all contents of the default sshd drop-ins and overwrite folder
- path: '/etc/ssh/sshd_config.d'
overwrite: true
mode: 0700
user:
name: 'root'
group:
name: 'root'
systemd:
units:
# disable docker to use cri-o (see https://github.com/coreos/fedora-coreos-tracker/issues/229)
- name: 'docker.service'
mask: true
passwd:
users:
# configure authentication
- name: 'core'
ssh_authorized_keys:
- '{myPublicKey}' I can't tell anything regarding the bare metal instances from AWS though, as I've never used AWS before. But it's certainly possible because I doubt that everyone here uses Hetzner and we all got the same problem, so it's unlikely that this is related to the hardware or setup from Hetzner. |
@scrayos The issue seems to have been "seemingly" fixed; see this comment: #940 (comment). Would be great if you could also test this and confirm the same. ATM I don't have any spare bare-metal to try it on :( |
I've now re-ignited the node with the newest kernel (5.14.14-200.fc34.x86_64) and FCOS version (34.20211031.3.0). We'll know in a few days whether the server is stable now. 😆 |
I updated OKD to version 4.8.0-0.okd-2021-11-14-052418, which ships with kernel 5.14.14-200.fc34.x86_64. |
The node is running for roughly 3 days now and there was no crash so far. Seems like it's fixed for me as well! 🎉 |
Thanks for the feedback. I'll close it then. |
Thanks all for collaborating and helping us to find when this issue was fixed. I wish we could narrow it down to a particular kernel commit that fixed the problem, but the fact that it's fixed in |
Issue still actual into 5.14.9-200.fc34.x86_64 kernel for OKD 4.8 |
@gialloguitar that's expected, see #957 (comment). Kernel version |
Indeed, I can affirm that |
Describe the bug
When running OKD, which uses Fedora CoreOS 34 on the nodes, the kernel is sometimes freezing.
Original report on OKD bug tracker: okd-project/okd#864
Reproduction steps
Steps to reproduce the behavior:
Expected behavior
System doesn't freeze
Actual behavior
Node VM is consuming 100% and doesn't respond to ping or from input in the console.
Unfortunately, the console doesn't show the full kernel panic message, it stops after the line:
------------[ cut here ]------------
I tried to retrieve logs using netconsole kernel module, hoping I could get more information, but the result is the same.
Do you have a suggestion how to get more data from the panic, if possible?
System details
Kernel 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Ignition config
Since it's handled by OKD / machine operator, it's massive and might be difficult to sanitize.
The text was updated successfully, but these errors were encountered: