Skip to content

Commit

Permalink
Improve resource cleanup section (#350)
Browse files Browse the repository at this point in the history
* Update crm resource list to crm resource status

* Add more information about fail counts

bsc#1211019
jsc#DOCTEAM-977

* failcount -> fail count

* Improve resource cleanup procedure

bsc#1211019
jsc#DOCTEAM-977

* Apply suggestions from editorial review

Co-authored-by: Daria Vladykina <[email protected]>

---------

Co-authored-by: Daria Vladykina <[email protected]>
  • Loading branch information
tahliar and dariavladykina authored Oct 11, 2023
1 parent d275162 commit 62bf34a
Show file tree
Hide file tree
Showing 4 changed files with 47 additions and 29 deletions.
4 changes: 2 additions & 2 deletions xml/geo_booth_i.xml
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ ticket = "&ticket2;" <xref linkend="co-ha-geo-booth-config-ticket" xrefstyle="se
run on the current cluster site. That means, it checks if the cluster is
healthy enough to run the resource (all resource dependencies are
fulfilled, the cluster partition has quorum, no dirty nodes, etc.). For
example, if a service in the dependency-chain has a failcount of
example, if a service in the dependency-chain has a fail count of
<literal>INFINITY</literal> on all available nodes, the service cannot be
run on that site. In that case, it is of no use to claim the ticket.
</para>
Expand Down Expand Up @@ -734,7 +734,7 @@ ticket = "tkt-sap-prod" <xref linkend="co-ha-geo-booth-config-ticket" xrefstyle=
cluster site. That means, it checks if the cluster is healthy enough to
run the resource (all resource dependencies are fulfilled, the cluster
partition has quorum, no dirty nodes, etc.). For example, if a service in
the dependency-chain has a failcount of <literal>INFINITY</literal> on all
the dependency chain has a fail count of <literal>INFINITY</literal> on all
available nodes, the service cannot be run on that site. In that case, it
is of no use to claim the ticket.
</para>
Expand Down
12 changes: 6 additions & 6 deletions xml/ha_management.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
<!--taroth 2011-09-16: in accordance with kdupke, man pages are removed
from the book, except for a general overview-->
<appendix xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="app-ha-management">
<!--
<!--
The source for Pacemaker manual page can be found at the
Mercurial respository.
1. Install the package mercurial.
2. Clone the URL:
$ hg clone http://hg.clusterlabs.org/pacemaker/doc pacemaker-doc
Expand Down Expand Up @@ -155,7 +155,7 @@ from the book, except for a general overview-->
<para>
The <command>crm_failcount</command> command queries the number of
failures per resource on a given node. This tool can also be used to
reset the failcount, allowing the resource to again run on nodes where
reset the fail count, allowing the resource to again run on nodes where
it had failed too often. See the <command>crm_failcount</command> man
page for a detailed introduction to this tool's usage and command
syntax.
Expand All @@ -178,10 +178,10 @@ from the book, except for a general overview-->
</listitem>
</varlistentry>
</variablelist>
<!--
<!--
The source for Pacemaker manual page can be found at the
Mercurial respository.
1. Install the package mercurial.
2. Clone the URL:
$ hg clone http://hg.clusterlabs.org/pacemaker/doc pacemaker-doc
Expand All @@ -200,4 +200,4 @@ from the book, except for a general overview-->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ha_crmshadow.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ha_crmstandby.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ha_crmverify.xml"/>-->
</appendix>
</appendix>
40 changes: 29 additions & 11 deletions xml/ha_managing_resources.xml
Original file line number Diff line number Diff line change
Expand Up @@ -374,17 +374,17 @@ primitive admin_addr IPaddr2 \
<title>Cleaning up cluster resources</title>
<para>
A resource is automatically restarted if it fails, but each failure
increases the resource's failcount.
increases the resource's fail count.
</para>
<para>
If a <literal>migration-threshold</literal> has been set for the resource,
the node will no longer run the resource when the number of failures reaches
the migration threshold.
</para>
<para>
A resource's failcount can either be reset automatically (by setting a
<literal>failure-timeout</literal> option for the resource), or it can be
reset manually using either &hawk2; or &crmsh;.
By default, fail counts are not automatically reset. You can configure a fail count
to be reset automatically by setting a <literal>failure-timeout</literal> option for the
resource, or you can manually reset the fail count using either &hawk2; or &crmsh;.
</para>
<sect2 xml:id="sec-conf-hawk2-manage-cleanup">
<title>Cleaning up cluster resources with &hawk2;</title>
Expand Down Expand Up @@ -429,17 +429,35 @@ primitive admin_addr IPaddr2 \
<para>
Get a list of all your resources:
</para>
<screen>&prompt.root;<command>crm resource list</command>
...
Resource Group: dlm-clvm:1
dlm:1 (ocf:pacemaker:controld) Started
clvm:1 (ocf:heartbeat:lvmlockd) Started</screen>
<screen>&prompt.root;<command>crm resource status</command>
Full List of Resources
* admin-ip (ocf:heartbeat:IPaddr2): Started
* stonith-sbd (stonith:external/sbd): Started
* Resource Group: dlm-clvm:
* dlm: (ocf:pacemaker:controld) Started
* clvm: (ocf:heartbeat:lvmlockd) Started</screen>
</step>
<step>
<para>
Show the fail count of a resource:
</para>
<screen>&prompt.root;<command>crm resource failcount <replaceable>RESOURCE</replaceable> show <replaceable>NODE</replaceable></command></screen>
<para>
For example, to show the fail count of the resource <literal>dlm</literal> on node
<literal>&node1;</literal>:
</para>
<screen>&prompt.root;<command>crm resource failcount dlm show &node1;</command>
scope=status name=fail-count-dlm value=2</screen>
</step>
<step>
<para>
To clean up the resource <literal>dlm</literal>, for example:
Clean up the resource:
</para>
<screen>&prompt.root;<command>crm resource cleanup dlm</command></screen>
<screen>&prompt.root;<command>crm resource cleanup <replaceable>RESOURCE</replaceable></command></screen>
<para>
This command cleans up the resource on all nodes. If the resource is part of a group,
&crmsh; also cleans up the other resources in the group.
</para>
</step>
</procedure>
</sect2>
Expand Down
20 changes: 10 additions & 10 deletions xml/ha_resource_constraints.xml
Original file line number Diff line number Diff line change
Expand Up @@ -817,7 +817,7 @@
A resource is automatically restarted if it fails. If that cannot
be achieved on the current node, or it fails <literal>N</literal> times
on the current node, it tries to fail over to another node. Each time
the resource fails, its failcount is raised. You can define several
the resource fails, its fail count is raised. You can define several
failures for resources (a <literal>migration-threshold</literal>), after
which they will migrate to a new node. If you have more than two nodes
in your cluster, the node a particular resource fails over to is chosen
Expand All @@ -838,12 +838,12 @@
for resource <literal>rsc1</literal> to preferably run on
<literal>&node1;</literal>. If it fails there,
<literal>migration-threshold</literal> is checked and compared to the
failcount. If failcount &gt;= migration-threshold then the resource is
fail count. If <literal>failcount</literal> &gt;= migration-threshold, then the resource is
migrated to the node with the next best preference.
</para>
<para>
After the threshold has been reached, the node will no longer be
allowed to run the failed resource until the resource's failcount is
allowed to run the failed resource until the resource's fail count is
reset. This can be done manually by the cluster administrator or by
setting a <literal>failure-timeout</literal> option for the resource.
</para>
Expand All @@ -862,7 +862,7 @@
<itemizedlist>
<listitem>
<para>
Start failures set the failcount to <literal>INFINITY</literal> and
Start failures set the fail count to <literal>INFINITY</literal> and
thus always cause an immediate migration.
</para>
</listitem>
Expand Down Expand Up @@ -907,7 +907,7 @@
</step>
<step>
<para>
If you want to automatically expire the failcount for a resource, add the
If you want to automatically expire the fail count for a resource, add the
<literal>failure-timeout</literal> meta attribute to the resource as
described in
<xref linkend="pro-conf-hawk2-primitive-add" xrefstyle="select:label title nopage"/>,
Expand All @@ -934,8 +934,8 @@
</step>
</procedure>
<para>
Instead of letting the failcount for a resource expire automatically, you
can also clean up failcounts for a resource manually at any time. Refer to
Instead of letting the fail count for a resource expire automatically, you
can also clean up fail counts for a resource manually at any time. Refer to
<xref linkend="sec-conf-hawk2-manage-cleanup"/> for details.
</para>
</sect2>
Expand All @@ -944,19 +944,19 @@
<title>Specifying resource failover nodes with &crmsh;</title>
<para>
To determine a resource failover, use the meta attribute
<literal>migration-threshold</literal>. If failcount exceeds
<literal>migration-threshold</literal>. If the fail count exceeds
<literal>migration-threshold</literal> on all nodes, the resource
remains stopped. For example:
</para>
<screen>&prompt.crm.conf;<command>location rsc1-&node1; rsc1 100: &node1;</command></screen>
<para>
Normally, <literal>rsc1</literal> prefers to run on <literal>&node1;</literal>.
If it fails there, <literal>migration-threshold</literal> is checked and compared
to the failcount. If <literal>failcount</literal> &gt;= <literal>migration-threshold</literal>
to the fail count. If <literal>failcount</literal> &gt;= <literal>migration-threshold</literal>,
then the resource is migrated to the node with the next best preference.
</para>
<para>
Start failures set the failcount to inf depend on the
Start failures set the fail count to inf depend on the
<option>start-failure-is-fatal</option> option. Stop failures cause
fencing. If there is no STONITH defined, the resource will not migrate.
</para>
Expand Down

0 comments on commit 62bf34a

Please sign in to comment.