Skip to content

Commit

Permalink
fix: add note for drain stuck when upgrade from v1.4.0 to v1.4.1
Browse files Browse the repository at this point in the history
Signed-off-by: Webber Huang <[email protected]>
  • Loading branch information
WebberHuang1118 committed Jan 15, 2025
1 parent 534b8c3 commit 4c287eb
Show file tree
Hide file tree
Showing 14 changed files with 164 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/upgrade/v1-1-2-to-v1-2-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 7
sidebar_position: 8
sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-0-to-v1-2-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 6
sidebar_position: 7
sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-1-to-v1-2-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 5
sidebar_position: 6
sidebar_label: Upgrade from v1.2.1 to v1.2.2
title: "Upgrade from v1.2.1 to v1.2.2"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-2-to-v1-3-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 5
sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-3-1-to-v1-3-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 4
sidebar_label: Upgrade from v1.3.1 to v1.3.2
title: "Upgrade from v1.3.1 to v1.3.2"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-3-2-to-v1-4-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 3
sidebar_label: Upgrade from v1.3.2 to v1.4.0
title: "Upgrade from v1.3.2 to v1.4.0"
---
Expand Down
76 changes: 76 additions & 0 deletions docs/upgrade/v1-4-0-to-v1-4-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
sidebar_position: 2
sidebar_label: Upgrade from v1.4.0 to v1.4.1
title: "Upgrade from v1.4.0 to v1.4.1"
---

<head>
<link rel="canonical" href="https://docs.harvesterhci.io/v1.4/upgrade/v1-4-0-to-v1-4-1"/>
</head>

## General information

An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see [Start an upgrade](./automatic.md#start-an-upgrade).

For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).


## Known issues

---

### 1. An upgrade is stuck in the Pre-drained state due to Longhorn orphan engine(s)

You might see an upgrade is stuck in the "pre-drained" state:

![](/img/v1.2/upgrade/known_issues/3730-stuck.png)

In this stage, Kubernetes is supposed to drain the workload on the node, but some reasons might cause the process to stall.

Longhron instance manager's orphan engines process(es) could cause this issue. To check if that's the case, perform the following steps:

1. Assume the stuck node is `harvester-node-1`.
1. Check the `instance-manager` pod's name on the stuck node:

```
$ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
instance-manager-d80e13f520e7b952f4b7593fc1883e2a 1/1 Running 0 3d8h
```
The output above shows that the `instance-manager-d80e13f520e7b952f4b7593fc1883e2a` pod is on the node.
1. Check Longhorn manager logs and verify that the `instance-manager` pod can't be drained (because of engine `pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0`):
```
$ kubectl -n longhorn-system logs daemonsets/longhorn-manager
...
time="2025-01-14T00:00:01Z" level=info msg="Node instance-manager-d80e13f520e7b952f4b7593fc1883e2a is marked unschedulable but removing harvester-node-1 PDB is blocked: some volumes are still attached InstanceEngines count 1 pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0" func="controller.(*InstanceManagerController).syncInstanceManagerPDB" file="instance_manager_controller.go:823" controller=longhorn-instance-manager node=harvester-node-1
```
1. Run the command to check if the engine is still running on the stuck node:
```
$ kubectl -n longhorn-system get engines.longhorn.io pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0 -o jsonpath='{"Current state: "}{.status.currentState}{"\nNode ID: "}{.spec.nodeID}{"\n"}'
Current state: stopped
Node ID:
```
If the output shows the engine is not running, then we can conclude this issue happens.
1. Before applying the workaround, check if all volumes are healthy:
```
kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
```
The output should all be `healthy`. If this is not the case, you might want to uncordon nodes to make the volume healthy again.
1. Remove the instance manager's PDB:
```
kubectl delete pdb instance-manager-d80e13f520e7b952f4b7593fc1883e2a -n longhorn-system
```
- Related issue:
- [[BUG] v1.4.0 -> v1.4.1-rc1 upgrade stuck in Pre-drained and the node stay in Cordoned](https://github.com/harvester/harvester/issues/7366)
- [[IMPROVEMENT] Cleanup orphaned volume runtime resources if the resources already deleted](https://github.com/longhorn/longhorn/issues/6764)
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-1-2-to-v1-2-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 7
sidebar_position: 8
sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-2-0-to-v1-2-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 6
sidebar_position: 7
sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-2-1-to-v1-2-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 5
sidebar_position: 6
sidebar_label: Upgrade from v1.2.1 to v1.2.2
title: "Upgrade from v1.2.1 to v1.2.2"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-2-2-to-v1-3-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 5
sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-3-1-to-v1-3-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 4
sidebar_label: Upgrade from v1.3.1 to v1.3.2
title: "Upgrade from v1.3.1 to v1.3.2"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.4/upgrade/v1-3-2-to-v1-4-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 3
sidebar_label: Upgrade from v1.3.2 to v1.4.0
title: "Upgrade from v1.3.2 to v1.4.0"
---
Expand Down
76 changes: 76 additions & 0 deletions versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
sidebar_position: 2
sidebar_label: Upgrade from v1.4.0 to v1.4.1
title: "Upgrade from v1.4.0 to v1.4.1"
---

<head>
<link rel="canonical" href="https://docs.harvesterhci.io/v1.4/upgrade/v1-4-0-to-v1-4-1"/>
</head>

## General information

An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see [Start an upgrade](./automatic.md#start-an-upgrade).

For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).


## Known issues

---

### 1. An upgrade is stuck in the Pre-drained state due to Longhorn orphan engine(s)

You might see an upgrade is stuck in the "pre-drained" state:

![](/img/v1.2/upgrade/known_issues/3730-stuck.png)

In this stage, Kubernetes is supposed to drain the workload on the node, but some reasons might cause the process to stall.

Longhron instance manager's orphan engines process(es) could cause this issue. To check if that's the case, perform the following steps:

1. Assume the stuck node is `harvester-node-1`.
1. Check the `instance-manager` pod's name on the stuck node:

```
$ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
instance-manager-d80e13f520e7b952f4b7593fc1883e2a 1/1 Running 0 3d8h
```
The output above shows that the `instance-manager-d80e13f520e7b952f4b7593fc1883e2a` pod is on the node.
1. Check Longhorn manager logs and verify that the `instance-manager` pod can't be drained (because of engine `pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0`):
```
$ kubectl -n longhorn-system logs daemonsets/longhorn-manager
...
time="2025-01-14T00:00:01Z" level=info msg="Node instance-manager-d80e13f520e7b952f4b7593fc1883e2a is marked unschedulable but removing harvester-node-1 PDB is blocked: some volumes are still attached InstanceEngines count 1 pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0" func="controller.(*InstanceManagerController).syncInstanceManagerPDB" file="instance_manager_controller.go:823" controller=longhorn-instance-manager node=harvester-node-1
```
1. Run the command to check if the engine is still running on the stuck node:
```
$ kubectl -n longhorn-system get engines.longhorn.io pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0 -o jsonpath='{"Current state: "}{.status.currentState}{"\nNode ID: "}{.spec.nodeID}{"\n"}'
Current state: stopped
Node ID:
```
If the output shows the engine is not running, then we can conclude this issue happens.
1. Before applying the workaround, check if all volumes are healthy:
```
kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
```
The output should all be `healthy`. If this is not the case, you might want to uncordon nodes to make the volume healthy again.
1. Remove the instance manager's PDB:
```
kubectl delete pdb instance-manager-d80e13f520e7b952f4b7593fc1883e2a -n longhorn-system
```
- Related issue:
- [[BUG] v1.4.0 -> v1.4.1-rc1 upgrade stuck in Pre-drained and the node stay in Cordoned](https://github.com/harvester/harvester/issues/7366)
- [[IMPROVEMENT] Cleanup orphaned volume runtime resources if the resources already deleted](https://github.com/longhorn/longhorn/issues/6764)

0 comments on commit 4c287eb

Please sign in to comment.