Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crmsh-4.6] Dev: pre-migration: implement pre-migration checks for SLES 16 (jsc#PED-11808) #1629

Open
wants to merge 23 commits into
base: crmsh-4.6
Choose a base branch
from

Conversation

nicholasyang2022
Copy link
Collaborator

@nicholasyang2022 nicholasyang2022 commented Dec 13, 2024

This pull request implements pre-migration checks for SLES 16. These checks are expected to run before migrating to SLES 16.

These checks ensure all the cluster nodes runs the latest version of corosync 2.x and pacemaker 2.1.x, and report used resource agents and fence agents that will be removed in SLES 16. It also provides advice for further actions need to take after migration.


Use Cases

crm cluster health sles16

suse@ha-1-1:~> sudo crm cluster health sles16
------ ha-1-2 ------
[WARN] Cluster services are not running. Check results may be outdated or inaccurate.
       * corosync
       * pacemaker
[WARN] Corosync transport "udpu" will be deprecated in corosync 3.
       After migrating to SLES 16, run "crm cluster health sles16 --fix" to migrate it to transport "knet".
[PASS] Good to migrate to SLES 16.

------ localhost ------
[INFO] Checking dependency version...
[INFO] Checking service status...
[WARN] Cluster services are not running. Check results may be outdated or inaccurate.
       * corosync
       * pacemaker
[INFO] Checking used corosync features...
[WARN] Corosync transport "udpu" will be deprecated in corosync 3.
       After migrating to SLES 16, run "crm cluster health sles16 --fix" to migrate it to transport "knet".
[INFO] Checking used resource agents...
[FAIL] SAPHanaSR Classic will be removed in SLES 16.
       Before migrating to SLES 16, replace it with SAPHanaSR-angi.
[WARN] stonith:external/sbd will be deprecated.
       * After migrating to SLES 16, please replace it with stonith:fence_sbd.
[FAIL] Please fix all the "FAIL" problems above before migrating to SLES 16.

crm cluster health sles16 --fix

> sudo crm cluster health sles16 --fix
ERROR: "--fix" is only available in SLES 16.

When some of the nodes are offline

------ ha-1-2 ------
Cannot create SSH connection to root@ha-1-2: ssh: connect to host ha-1-2 port 22: No route to host

------ localhost ------
[INFO] Checking dependency version...
[INFO] Checking service status...
[WARN] Cluster services are not running. Check results may be outdated or inaccurate.
       * corosync
       * pacemaker
[INFO] Checking used corosync features...
[WARN] Corosync transport "udpu" will be deprecated in corosync 3.
       After migrating to SLES 16, run "crm cluster health sles16 --fix" to migrate it to transport "knet".
[INFO] Checking used resource agents...
[FAIL] Please fix all the "FAIL" problems above before migrating to SLES 16.

@nicholasyang2022 nicholasyang2022 changed the title [crmsh] Dev: pre-migration: implement pre-migration checks for corosync 3 (jsc#PED-8252) [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 (jsc#PED-8252) Dec 16, 2024
Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 82.18182% with 49 lines in your changes missing coverage. Please review.

Project coverage is 67.00%. Comparing base (f5b1328) to head (252befd).

Files with missing lines Patch % Lines
crmsh/migration.py 82.37% 40 Missing ⚠️
crmsh/ui_cluster.py 75.67% 9 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
integration 52.97% <82.18%> (+0.31%) ⬆️
unit 49.21% <28.72%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crmsh/cibquery.py 100.00% <100.00%> (ø)
crmsh/sh.py 93.22% <ø> (ø)
crmsh/xmlutil.py 69.23% <100.00%> (+0.03%) ⬆️
crmsh/ui_cluster.py 74.21% <75.67%> (-0.65%) ⬇️
crmsh/migration.py 82.37% <82.37%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zzhou1 zzhou1 changed the title [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 (jsc#PED-8252) [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 (jsc#PED-11808) Dec 17, 2024
@nicholasyang2022 nicholasyang2022 force-pushed the ped-8252-20241204-4.6 branch 2 times, most recently from d9038c6 to a2518dd Compare December 24, 2024 05:34
@nicholasyang2022 nicholasyang2022 force-pushed the ped-8252-20241204-4.6 branch 7 times, most recently from a740b3f to 2bcad35 Compare December 31, 2024 09:28
@liangxin1300 liangxin1300 self-requested a review January 1, 2025 13:54
stonith:fence_vmware-rest
stonith:fence_wti
stonith:fence_xenapi
stonith:fence_zvm
Copy link
Collaborator

@liangxin1300 liangxin1300 Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How has the above data been generated?
Can we automatically do that based on the locally installed resource-agents and fence-agents-* rpm?
Or, can we generate the data at the time crmsh using it (_load_supported_resource_agents)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How has the above data been generated?

Extracted the file list from resource-agents/fence-agents.

Can we automatically do that based on the locally installed resource-agents and fence-agents-* rpm?

No, because we need to know what is supported by SLES 16 instead of SLES 15 SP6.

@nicholasyang2022 nicholasyang2022 force-pushed the ped-8252-20241204-4.6 branch 3 times, most recently from 2932b25 to 8cf8ef3 Compare January 2, 2025 00:15
@nicholasyang2022 nicholasyang2022 changed the title [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 (jsc#PED-11808) [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 and pacemaker 3 (jsc#PED-11808) Jan 2, 2025
@nicholasyang2022 nicholasyang2022 marked this pull request as ready for review January 2, 2025 00:57
@nicholasyang2022 nicholasyang2022 changed the title [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for corosync 3 and pacemaker 3 (jsc#PED-11808) [crmsh-4.6] Dev: pre-migration: implement pre-migration checks for SLES 16 (jsc#PED-11808) Jan 2, 2025
@nicholasyang2022 nicholasyang2022 force-pushed the ped-8252-20241204-4.6 branch 4 times, most recently from 22ea8a0 to a338bc6 Compare January 2, 2025 08:57
@liangxin1300
Copy link
Collaborator

liangxin1300 commented Jan 3, 2025

Suggestion:

1. Add completer for crm cluster health, now hawk2 can be completed, so do sles16.

The same point for #1422

2. Add success info

# crm cluster health sles16
------ localhost ------
[INFO] Checking dependency version...
[INFO] Checking service status...
[INFO] Checking used corosync features...
[WARN] Corosync transport "udpu" will be deprecated in corosync 3.
       After migrating to SLES 16, run "crm health sles16 --fix" to migrate it to transport "knet".
[INFO] Checking used resource agents...
[PASS]

How about change the last line as

[PASS] All checks completed successfully.

Or

[INFO] All checks completed successfully.

I prefer this [INFO] one, since [PASS] is not the log level
The same for #1422

3. Add failure error

# crm cluster health sles16
------ localhost ------
[INFO] Checking dependency version...
[INFO] Checking service status...
[FAIL] Cluster services are not running
       * corosync
       * pacemaker
[INFO] Checking used corosync features...
[WARN] Corosync transport "udpu" will be deprecated in corosync 3.
       After migrating to SLES 16, run "crm health sles16 --fix" to migrate it to transport "knet".
[INFO] Checking used resource agents...
[FAIL]

How about change the last line similar:

[ERROR] Some checks failed, please check

And, change [FAIL] Cluster services are not running
as

[ERROR] Cluster services are not running

The same for #1422

4. Add above output into /var/log/crmsh/crmsh.log log file, too

The same for #1422

5. Keep the color of node line consistency

Better keep all of node lines white (no color)?
Screenshot from 2025-01-03 14-14-14
Screenshot from 2025-01-03 14-13-58
The same for #1422

@liangxin1300
Copy link
Collaborator

liangxin1300 commented Jan 3, 2025

Suggestion:

6. How about there are Failed Resource Actions?

For some reason, some RA failed. For now, crm cluster health sles16 seems quiet for such an error.
I think we should hint the user to address that failed RA, then do the migration job
The same for #1422

7. crm report collect pre-migration info

I think we also need to collect the output of crm cluster health sles16 in crm report.
How about implementing a collect_pre_migraion function in report.collect.py, and dump the output into pre-migration.txt?

8. Inaccurate usage?

# crm cluster health sles15
usage: crm [-h] [-f] {hawk2,sles16}
crm: error: argument component: invalid choice: 'sles15' (choose from 'hawk2', 'sles16')

Should be

usage: crm cluster health {hawk2,sles16}

The same for #1422

9. Fence agents checking

[FAIL] The following fence agents will be removed in SLES 16.
       * stonith:external/sbd

Better to add what's the alternative?

[ERROR] The following fence agents will be replaced
      - stonith:external/sbd
      + stonith:fence_sbd

handler.log_info("Checking used corosync features...")
transport = 'udpu' if corosync.is_unicast() else 'udp'
handler.handle_tip(f'Corosync transport "{transport}" will be deprecated in corosync 3.', [
'After migrating to SLES 16, run "crm health sles16 --fix" to migrate it to transport "knet".',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be crm cluster health sles16 --fix

])
if corosync.get_value("totem.rrp_mode") in {'active', 'passive'}:
handler.handle_tip(f'Corosync RRP will be deprecated in corosync 3.', [
'After migrating to SLES 16, run "crm health sles16 --fix" to migrate it to knet multilink.',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be crm cluster health sles16 --fix

@@ -0,0 +1,25 @@
"""utilities for parsing CIB xml"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to move these codes into xmlutil.py instead of creating a new py file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. xmlutil.py is too long, and its name does not reflect its content. Instead of general xml processing utilities, it is a mixed mess with shell utils, CIB specific routines, and even CLI handlers.

)


def has_primitive_filesystem_ocfs2(cib: lxml.etree.Element) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we might use this to detect gfs2, could you please change 'ocfs2' as a parameter?


def check_unsupported_resource_agents(handler: CheckResultHandler):
handler.log_info("Checking used resource agents...")
crm_mon = xmlutil.CrmMonXmlParser()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this file under crmsh/crmsh/data, in case there are other txt files in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to do that. I think this will need to add a python module crmsh.data, making our build scripts more complex.

sys.stdout.write(' Good to migrate to SLES 16.\n\n')


def check(args: typing.Sequence[str]) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to add a docstring to:

  • explain what the --json and --local option are used for
  • introduce this function and return code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite self-explained. Adding docstrings like "migration.check: run migration checks", or "--json output the result in json format" does not make sense. And the return code is also the common main() style return code.

@liangxin1300
Copy link
Collaborator

This usage is correct

# crm cluster health xxx
usage: crm cluster health [-h] [-f] {hawk2,sles16}
crm cluster health: error: argument component: invalid choice: 'xxx' (choose from 'hawk2', 'sles16')

But not this

# crm cluster health sles16 fs
usage: crm [-h] [--json [{oneline,pretty}]] [--local]
crm: error: unrecognized arguments: fs

@nicholasyang2022
Copy link
Collaborator Author

Usages changed to

suse@ha-1-1:~> sudo crm cluster health foo
usage: health [-h] [-f] {hawk2,sles16}
health: error: argument component: invalid choice: 'foo' (choose from 'hawk2', 'sles16')
suse@ha-1-1:~> sudo crm cluster health sles16 foo
usage: sles16 [-h] [--json [{oneline,pretty}]] [--local]
sles16: error: unrecognized arguments: foo

handler.handle_tip(f'Corosync transport "{transport}" will be deprecated in corosync 3.', [
'After migrating to SLES 16, run "crm cluster health sles16 --fix" to migrate it to transport "knet".',
])
if corosync.get_value("totem.rrp_mode") in {'active', 'passive'}:
Copy link
Collaborator

@liangxin1300 liangxin1300 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although with low possibility, rrp_mode can be set to none for the one-ring
should include none value here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none means SRP, which is supported and does not needs any actions.

* `sles16`: check whether the cluster is good to migrate to SLES 16.

The optional `--fix` argument attempts to automatically resolve any detected
issues.
Copy link
Collaborator

@liangxin1300 liangxin1300 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add note that for sles16, --fix only available at SLES 16

if remote_ret > ret:
ret = remote_ret
if not parsed_args.json:
print('------ summary ------')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better change this line as

****** Summary ******

To mark this is not the node name

Copy link
Collaborator

@liangxin1300 liangxin1300 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great work!

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants