From 640cfbfbe1a0c8f6fee3b75e6bc17bb8295f50fd Mon Sep 17 00:00:00 2001 From: Thomas Schraitle Date: Fri, 15 Jun 2018 13:11:09 +0200 Subject: [PATCH] Integrate bsc#1088466 (#80) * Integrate bsc#1088466 * Introduce section "Quorum Calculation" * Introduce new section about Use Case Scenarios * Carve out difference between 2-node and n-node clusters * Add Corosync excerpt of config for 2-node cluster * Add missing * Split variablelist and add additional titles Implement feedback from Tanja, Lars, and Yan. Many thanks! --- xml/ha_config_basics.xml | 405 +++++++++++++++++++++++++++++++++------ xml/ha_requirements.xml | 16 +- 2 files changed, 352 insertions(+), 69 deletions(-) diff --git a/xml/ha_config_basics.xml b/xml/ha_config_basics.xml index 6a6a16a7f..9e3792ed4 100644 --- a/xml/ha_config_basics.xml +++ b/xml/ha_config_basics.xml @@ -36,25 +36,256 @@ - - Global Cluster Options + + Use Case Scenarios + In general, clusters fall into one of two categories: + + + Two-node clusters + + + clusters with more than two nodes. This means usually an odd number of nodes. + + + + Adding also different topologies, different use cases can be derived. + The following use cases are the most common: + + + + + + Two-node cluster in one location + + + Configuration: + FC SAN or similar shared storage, layer 2 network. + + + Usage scenario: + Embedded clusters that focus on service high + availability and not data redundancy for data replication. + Such a setup is used for radio stations or assembly line controllers, + for example. + + + + + + Two-node clusters in two locations (most widely used) + + + Configuration: + Symmetrical stretched cluster, FC SAN, and layer 2 network + all across two locations. + + + Usage scenario: + Classical stretched clusters, focus on service high availability + and local data redundancy. For databases and enterprise + resource planning. One of the most popular setup during the last + years. + + + + + + Odd number of nodes in three locations + + + Configuration: + 2×N+1 nodes, FC SAN across two main locations. Auxiliary + third site with no FC SAN, but acts as a majority maker. + Layer 2 network at least across two main locations. + + + + Usage scenario: + Classical stretched cluster, focus on service high availability + and data redudancy. For example, databases, enterprise resource planning. + + + + + + + + + + + + Quorum Determination - Global cluster options control how the cluster behaves when confronted - with certain situations. They are grouped into sets and can be viewed and - modified with the cluster management tools like &hawk2; and the - crm shell. + Whenever communication fails between one or more nodes and the rest of the + cluster, a cluster partition occurs. The nodes can only communicate with + other nodes in the same partition and are unaware of the separated nodes. + A cluster partition is defined to have quorum (is quorate) + if it has the majority of nodes (or votes). + How this is achieved is done by quorum calculation. + Quorum is a requirement for fencing. + + + Quorum calculation has changed between &productname; 11 and + &productname; 15. For &productname; 11, quorum was calculated by + &pace;. + Starting with &productname; 12, &corosync; can handle quorum for + two-node clusters directly without changing the &pace; configuration. - - Overview - - For an overview of all global cluster options and their default values, - see &paceex;, available from . Refer to section - Available Cluster Options. - - - The predefined values can usually be kept. However, to make key + How quorum is calculated is influenced by the following factors: + + + Number of Cluster Nodes + + To keep services running, a cluster with more than two nodes + relies on quorum (majority vote) to resolve cluster partitions. + Based on the following formula, you can calculate the minimum + number of operational nodes required for the cluster to function: +
+ Formula to Calculate the Number of Operational Nodes + + + N ≥ C/2 + 1 + +N = minimum number of operational nodes +C = number of cluster nodes + + +
+ For example, a five-node cluster needs a minimum of three operational + nodes (or two nodes which can fail). + + We strongly recommend to use either a two-node cluster or an odd number + of cluster nodes. + Two-node clusters make sense for stretched setups across two sites. + Clusters with an odd number of nodes can be built on either one single + site or might be spread across three sites. + +
+
+ + Corosync Configuration + + &corosync; is a messaging and membership layer, see + and + . + + + +
+ + + Global Cluster Options + Global cluster options control how the cluster behaves when + confronted with certain situations. They are grouped into sets and can be + viewed and modified with the cluster management tools like &hawk2; and + the crm shell. + The predefined values can usually be kept. However, to make key functions of your cluster work correctly, you need to adjust the following parameters after basic cluster setup: @@ -70,27 +301,10 @@
- - Learn how to adjust those parameters with the cluster management tools - of your choice: - - - - - &hawk2;: - - - -
- Option <literal>no-quorum-policy</literal> + Global Option <literal>no-quorum-policy</literal> This global option defines what to do when a cluster partition does not have quorum (no majority of nodes is part of the partition). @@ -103,31 +317,23 @@ ignore + - The quorum state does not influence the cluster behavior; resource - management is continued. + Setting no-quorum-policy to ignore makes + the cluster behave like it has quorum. Resource management is + continued. - This setting is useful for the following scenarios: + This was the default for &slsa; 11 for a two-node cluster. + Starting with &slsa; 12, this option is obsolete. + Based on configuration and conditions, &corosync; gives cluster nodes + or a single node quorum—or not. + + + For two node clusters the only meaningful behaviour is to always + react in case of quorum loss. The first step always should be + trying to fence the lost node. - - - - Resource-driven clusters: For local clusters with redundant - communication channels, a split brain scenario only has a certain - probability. Thus, a loss of communication with a node most likely - indicates that the node has crashed. The surviving nodes - should recover and start serving the resources again. - - - If no-quorum-policy is set to - ignore, a 4-node cluster can sustain concurrent - failure of three nodes before service is lost. With the other - settings, it would lose quorum after concurrent failure of two - nodes. For a two-node cluster this option and value is never set. - - - @@ -167,7 +373,8 @@ If quorum is lost, all nodes in the affected cluster partition are - fenced. + fenced. This option works only in combination with SBD, see + . @@ -175,7 +382,7 @@ - Option <literal>stonith-enabled</literal> + Global Option <literal>stonith-enabled</literal> This global option defines whether to apply fencing, allowing &stonith; devices to shoot failed nodes and nodes with resources that cannot be @@ -191,7 +398,7 @@ aware that this has impact on the support status for your product. Furthermore, with stonith-enabled="false", resources like the Distributed Lock Manager (DLM) and all services depending on - DLM (such as LVM2, GFS2, and OCFS2) will fail to start. + DLM (such as cLVM, GFS2, and OCFS2) will fail to start. No Support Without &stonith; @@ -200,7 +407,91 @@ + + + &corosync; Configuration for Two-Node Clusters + + When using the bootstrap scripts, the &corosync; configuration contains + a quorum section with the following options: + + + Excerpt of &corosync; Configuration for a Two-Node Cluster + quorum { + # Enable and configure quorum subsystem (default: off) + # see also corosync.conf.5 and votequorum.5 + provider: corosync_votequorum + expected_votes: 2 + two_node: 1 +} + + + As opposed to &sle; 11, the votequorum subsystem in &sle; 12 is + powered by &corosync; version 2.x. This means that the + no-quorum-policy=ignore option must not be used. + + + By default, when two_node: 1 is set, the + wait_for_all option is automatically enabled. + If wait_for_all is not enbaled, the cluster should be + started on both nodes in parallel. Otherwise the first node will perform + a startup-fencing on the missing second node. + + + + &corosync; Configuration for N-Node Clusters + When not using a two-node cluster, we strongly recommend an odd + number of nodes for your N-node cluster. With regards to quorum + configuration, you have the following options: + + + Adding additional nodes with the ha-cluster-join + command, or + + + Adapting the &corosync; configuration manually. + + + + If you adjust /etc/corosync/corosync.conf manually, + use the following settings: + + + Excerpt of &corosync; Configuration for a N-Node Cluster + quorum { + provider: corosync_votequorum + expected_votes: N + wait_for_all: 1 +} + + + Use the quorum service from &corosync; + + + The number of votes to expect. This parameter can either be + provided inside the quorum section, or is + automatically calculated when the nodelist + section is available. + + + + Enables the wait for all (WFA) feature. + When WFA is enabled, the cluster will be quorate for the first time + only after all nodes have been visible. + To avoid some startup race conditions, setting + to 1 may help. + For example, in a five-node cluster every node has one vote and thus, + is set to 5. + As soon as three or more nodes are visible to each other, the cluster + partition becomes quorate and can start operating. + + + + +
+ Cluster Resources diff --git a/xml/ha_requirements.xml b/xml/ha_requirements.xml index 36d45951a..3477b40fc 100644 --- a/xml/ha_requirements.xml +++ b/xml/ha_requirements.xml @@ -228,18 +228,10 @@ Number of Cluster Nodes - - Odd Number of Cluster Nodes - - For clusters with more than three nodes, it is strongly recommended to use - an odd number of cluster nodes. - - To keep services running, a cluster with more than two nodes - relies on quorum (majority vote) to resolve cluster partitions. - A two- or three-node cluster can tolerate the failure of one node at a time, - a five-node cluster can tolerate failures of two nodes, etc. - - + For clusters with more than two nodes, it is strongly recommended to use + an odd number of cluster nodes to have quorum. For more information + about quorum, see . +