fix: add note for drain stuck when upgrade from v1.4.0 to v1.4.1

Signed-off-by: Webber Huang <[email protected]>
harvester · Jan 15, 2025 · 4c287eb · 4c287eb
1 parent 534b8c3
commit 4c287eb
Show file tree

Hide file tree

Showing 14 changed files with 164 additions and 12 deletions.
diff --git a/docs/upgrade/v1-1-2-to-v1-2-0.md b/docs/upgrade/v1-1-2-to-v1-2-0.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 7
+sidebar_position: 8
 sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
 title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
 ---

diff --git a/docs/upgrade/v1-2-0-to-v1-2-1.md b/docs/upgrade/v1-2-0-to-v1-2-1.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 7
 sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
 title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
 ---

diff --git a/docs/upgrade/v1-2-1-to-v1-2-2.md b/docs/upgrade/v1-2-1-to-v1-2-2.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 sidebar_label: Upgrade from v1.2.1 to v1.2.2
 title: "Upgrade from v1.2.1 to v1.2.2"
 ---

diff --git a/docs/upgrade/v1-2-2-to-v1-3-1.md b/docs/upgrade/v1-2-2-to-v1-3-1.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 4
+sidebar_position: 5
 sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
 title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
 ---

diff --git a/docs/upgrade/v1-3-1-to-v1-3-2.md b/docs/upgrade/v1-3-1-to-v1-3-2.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 3
+sidebar_position: 4
 sidebar_label: Upgrade from v1.3.1 to v1.3.2
 title: "Upgrade from v1.3.1 to v1.3.2"
 ---

diff --git a/docs/upgrade/v1-3-2-to-v1-4-0.md b/docs/upgrade/v1-3-2-to-v1-4-0.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 2
+sidebar_position: 3
 sidebar_label: Upgrade from v1.3.2 to v1.4.0
 title: "Upgrade from v1.3.2 to v1.4.0"
 ---

diff --git a/docs/upgrade/v1-4-0-to-v1-4-1.md b/docs/upgrade/v1-4-0-to-v1-4-1.md
@@ -0,0 +1,76 @@
+---
+sidebar_position: 2
+sidebar_label: Upgrade from v1.4.0 to v1.4.1
+title: "Upgrade from v1.4.0 to v1.4.1"
+---
+
+<head>
+  <link rel="canonical" href="https://docs.harvesterhci.io/v1.4/upgrade/v1-4-0-to-v1-4-1"/>
+</head>
+
+## General information
+
+An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see [Start an upgrade](./automatic.md#start-an-upgrade).
+
+For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
+
+
+## Known issues
+
+---
+
+### 1. An upgrade is stuck in the Pre-drained state due to Longhorn orphan engine(s)
+
+You might see an upgrade is stuck in the "pre-drained" state:
+
+![](/img/v1.2/upgrade/known_issues/3730-stuck.png)
+
+In this stage, Kubernetes is supposed to drain the workload on the node, but some reasons might cause the process to stall.
+
+Longhron instance manager's orphan engines process(es) could cause this issue. To check if that's the case, perform the following steps:
+
+1. Assume the stuck node is `harvester-node-1`.
+1. Check the `instance-manager` pod's name on the stuck node:
+
+    ```
+    $ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
+    instance-manager-d80e13f520e7b952f4b7593fc1883e2a          1/1     Running   0              3d8h
+    ```
+
+    The output above shows that the `instance-manager-d80e13f520e7b952f4b7593fc1883e2a` pod is on the node.
+
+1. Check Longhorn manager logs and verify that the `instance-manager` pod can't be drained (because of engine `pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0`):
+
+    ```
+    $ kubectl -n longhorn-system logs daemonsets/longhorn-manager
+    ...
+    time="2025-01-14T00:00:01Z" level=info msg="Node instance-manager-d80e13f520e7b952f4b7593fc1883e2a is marked unschedulable but removing harvester-node-1 PDB is blocked: some volumes are still attached InstanceEngines count 1 pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0" func="controller.(*InstanceManagerController).syncInstanceManagerPDB" file="instance_manager_controller.go:823" controller=longhorn-instance-manager node=harvester-node-1
+    ```
+
+1. Run the command to check if the engine is still running on the stuck node:
+
+    ```
+    $ kubectl -n longhorn-system get engines.longhorn.io pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0 -o jsonpath='{"Current state: "}{.status.currentState}{"\nNode ID: "}{.spec.nodeID}{"\n"}'
+    Current state: stopped
+    Node ID:
+    ```
+
+    If the output shows the engine is not running, then we can conclude this issue happens.
+
+1. Before applying the workaround, check if all volumes are healthy:
+
+    ```
+    kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
+    ```
+
+    The output should all be `healthy`. If this is not the case, you might want to uncordon nodes to make the volume healthy again.
+
+1.  Remove the instance manager's PDB:
+
+    ```
+    kubectl delete pdb instance-manager-d80e13f520e7b952f4b7593fc1883e2a -n longhorn-system
+    ```
+
+- Related issue:
+  - [[BUG] v1.4.0 -> v1.4.1-rc1 upgrade stuck in Pre-drained and the node stay in Cordoned](https://github.com/harvester/harvester/issues/7366)
+  - [[IMPROVEMENT] Cleanup orphaned volume runtime resources if the resources already deleted](https://github.com/longhorn/longhorn/issues/6764)
diff --git a/versioned_docs/version-v1.4/upgrade/v1-1-2-to-v1-2-0.md b/versioned_docs/version-v1.4/upgrade/v1-1-2-to-v1-2-0.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 7
+sidebar_position: 8
 sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
 title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-2-0-to-v1-2-1.md b/versioned_docs/version-v1.4/upgrade/v1-2-0-to-v1-2-1.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 7
 sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
 title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-2-1-to-v1-2-2.md b/versioned_docs/version-v1.4/upgrade/v1-2-1-to-v1-2-2.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 sidebar_label: Upgrade from v1.2.1 to v1.2.2
 title: "Upgrade from v1.2.1 to v1.2.2"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-2-2-to-v1-3-1.md b/versioned_docs/version-v1.4/upgrade/v1-2-2-to-v1-3-1.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 4
+sidebar_position: 5
 sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
 title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-3-1-to-v1-3-2.md b/versioned_docs/version-v1.4/upgrade/v1-3-1-to-v1-3-2.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 3
+sidebar_position: 4
 sidebar_label: Upgrade from v1.3.1 to v1.3.2
 title: "Upgrade from v1.3.1 to v1.3.2"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-3-2-to-v1-4-0.md b/versioned_docs/version-v1.4/upgrade/v1-3-2-to-v1-4-0.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 2
+sidebar_position: 3
 sidebar_label: Upgrade from v1.3.2 to v1.4.0
 title: "Upgrade from v1.3.2 to v1.4.0"
 ---

diff --git a/versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md b/versioned_docs/version-v1.4/upgrade/v1-4-0-to-v1-4-1.md
@@ -0,0 +1,76 @@
+---
+sidebar_position: 2
+sidebar_label: Upgrade from v1.4.0 to v1.4.1
+title: "Upgrade from v1.4.0 to v1.4.1"
+---
+
+<head>
+  <link rel="canonical" href="https://docs.harvesterhci.io/v1.4/upgrade/v1-4-0-to-v1-4-1"/>
+</head>
+
+## General information
+
+An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see [Start an upgrade](./automatic.md#start-an-upgrade).
+
+For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).
+
+
+## Known issues
+
+---
+
+### 1. An upgrade is stuck in the Pre-drained state due to Longhorn orphan engine(s)
+
+You might see an upgrade is stuck in the "pre-drained" state:
+
+![](/img/v1.2/upgrade/known_issues/3730-stuck.png)
+
+In this stage, Kubernetes is supposed to drain the workload on the node, but some reasons might cause the process to stall.
+
+Longhron instance manager's orphan engines process(es) could cause this issue. To check if that's the case, perform the following steps:
+
+1. Assume the stuck node is `harvester-node-1`.
+1. Check the `instance-manager` pod's name on the stuck node:
+
+    ```
+    $ kubectl get pods -n longhorn-system --field-selector spec.nodeName=harvester-node-1 | grep instance-manager
+    instance-manager-d80e13f520e7b952f4b7593fc1883e2a          1/1     Running   0              3d8h
+    ```
+
+    The output above shows that the `instance-manager-d80e13f520e7b952f4b7593fc1883e2a` pod is on the node.
+
+1. Check Longhorn manager logs and verify that the `instance-manager` pod can't be drained (because of engine `pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0`):
+
+    ```
+    $ kubectl -n longhorn-system logs daemonsets/longhorn-manager
+    ...
+    time="2025-01-14T00:00:01Z" level=info msg="Node instance-manager-d80e13f520e7b952f4b7593fc1883e2a is marked unschedulable but removing harvester-node-1 PDB is blocked: some volumes are still attached InstanceEngines count 1 pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0" func="controller.(*InstanceManagerController).syncInstanceManagerPDB" file="instance_manager_controller.go:823" controller=longhorn-instance-manager node=harvester-node-1
+    ```
+
+1. Run the command to check if the engine is still running on the stuck node:
+
+    ```
+    $ kubectl -n longhorn-system get engines.longhorn.io pvc-9ae0e9a5-a630-4f0c-98cc-b14893c74f9e-e-0 -o jsonpath='{"Current state: "}{.status.currentState}{"\nNode ID: "}{.spec.nodeID}{"\n"}'
+    Current state: stopped
+    Node ID:
+    ```
+
+    If the output shows the engine is not running, then we can conclude this issue happens.
+
+1. Before applying the workaround, check if all volumes are healthy:
+
+    ```
+    kubectl get volumes -n longhorn-system -o yaml | yq '.items[] | select(.status.state == "attached")| .status.robustness'
+    ```
+
+    The output should all be `healthy`. If this is not the case, you might want to uncordon nodes to make the volume healthy again.
+
+1.  Remove the instance manager's PDB:
+
+    ```
+    kubectl delete pdb instance-manager-d80e13f520e7b952f4b7593fc1883e2a -n longhorn-system
+    ```
+
+- Related issue:
+  - [[BUG] v1.4.0 -> v1.4.1-rc1 upgrade stuck in Pre-drained and the node stay in Cordoned](https://github.com/harvester/harvester/issues/7366)
+  - [[IMPROVEMENT] Cleanup orphaned volume runtime resources if the resources already deleted](https://github.com/longhorn/longhorn/issues/6764)