-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvmeof Gateway fails to start up in brand new cluster #669
Comments
Some additional infos as I forgot to put those in the first mail. Target version is 1.2.9, Ceph version is ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable). Set up is brand new with no previous configuration in place. No extraordinary strange configuration either. Any help will be greatly appreciated. Thank you very much in advance again. |
I faced similar issues with v1.2.x in 18.2.2. I had to move to v1.0.0 and it functions as expected. To set to pull v1.0.0
I am facing similar challenges when I upgraded to 18.2.4 and currently testing deployments. |
Right now the GW cannot work without a special build of ceph. The reason is that the GW depends on the new nvmeof paxos service which is a part of ceph/ceph#54671. |
These are the commands, I followed to create for ceph-nvmeof v1.0.0 in 18.2.2 and 18.2.4. I'm currently looking into deploying latest versions of ceph-nvmeof
|
Please use quay.io/ceph/nvmeof:1.2.16 |
and quay.io/ceph/nvmeof-cli:1.2.16 |
The listener add command changed. I will update the documentation upstream soon, but meanwhile this is the right command: |
With 19.1.0(rc) (upgraded from 18.2.4), I have been able to deploy ceph-nvmeof v1.2.16 and add
With cephadm deployed v18.2.4, when I tried deploying ceph-nvmeof v1.2.16, I encountered the following in service status, which showed it to be associating with v19.0.0
|
The
|
@Peratchi-Kannan what issues are you still having now? were you able to add ns? |
@caroav I am not able to add namespace to the subsystem. When I try to add namespace, it throws an exception saying |
Hi @Peratchi-Kannan, I met the same issue when running the latest Ceph container image |
@Peratchi-Kannan see comment above from @xin3liang. We are planning to remove the "nvmeof gw monitor: disable by default" permanently from ceph. But this is pending on some cosmetic changes that we were asked to do. The changes are ongoing so I really hope that we could do that very soon. Meanwhile, you need to build as described in last comment. |
Hi @xin3liang , The nvmeof service does not start when using |
Hi @Peratchi-Kannan, I just verified the aarch64 image, not the x86 one. |
Hi @Peratchi-Kannan, you could try this Ceph image: |
Hi @xin3liang , I confirm nvmeof works as expected with Ceph image: Thanks |
Hello everyone, I am facing the same issue. I got nvmeof working under 18.2 but somehow it got broken, so I decided to upgrade to 19.2 that was just released. I am using 1.3.2 version and I am getting stuck at adding it to the namespace(Exception: chosen ANA group is 0), basically the same as @alarmed-ground has reported. I would like to stay with 19.2, not building it from the source, but get nvmeof working, even without the HA functionality for now. Any ideas how to do that ? |
Hello everyone, I am able to replicate @RobertLukan's situation. I am using |
Hello everyone, Just a quick update. I was able to add new namespace from the WebUI instead of using the command(v1.3.2). I had one namespace before the upgrade and created one after the upgrade. Interestingly, after adding the new namespace from the WebUI, I am unable to connect to both namespaces in the client but I am able to discover the subsystem. My nvme version on the client is |
As others have stated, things are not working with Reef 18.2.4. I've tried using the stock nvmeof 1.0.0 version, 1.2.16, and 1.2.17 and all fail to keep the gateway up and running. Since the 1.0.0 version is deprecated and not recommended for use, I won't provide details about that, but the run log is as follows:
|
Has anyone found a combination that works ? I tried 1.2.17, 1.1, 1.3.1, 1.3.2 without success. |
The nvmeof is not a part of the official ceph reef and squid branches. It was approved to be merged to main long after that reef and squid were created. It will be a part of the next ceph upstream release. For now, anyone that needs the nvmeof to be working with reef or squid, you can build ceph from - https://github.com/ceph/ceph-ci/tree/squid-nvmeof , or https://github.com/ceph/ceph-ci/tree/reef-nvmeof. |
If this is the case, why do https://docs.ceph.com/en/reef/rbd/nvmeof-overview/ and https://docs.ceph.com/en/squid/rbd/nvmeof-overview/ exist? The official Ceph documentation suggests that NVMe-oF is working since version 18. |
According to ceph/ceph-nvmeof#669 (comment) "The nvmeof is not a part of the official ceph reef and squid branches." It should be removed from the documentation as currently the docs suggest that a working NVMe-oF gateway can be deployed easily with the orchestrator. Signed-off-by: Robert Sander <[email protected]>
I understand that the HA feature is not yet part of Ceph reef/squid, but I wonder why non-HA is not part of it ? Especially, I managed to get it working with nvmeof version 1.0.0, but unfortunately the integration has not survived the reboot of hosts. |
According to ceph/ceph-nvmeof#669 (comment) "The nvmeof is not a part of the official ceph reef and squid branches." It should be removed from the documentation as currently the docs suggest that a working NVMe-oF gateway can be deployed easily with the orchestrator. Signed-off-by: Robert Sander <[email protected]>
There is also the ability to deploy an NVME-oF gateway in the current Reef dashboard, so there is definitely a disconnect as to what is production ready. |
There is no - non ha mode. A single gw is a also managed by the ceph mon. I need to check about the documentation and we need to fix it is misleading. In any case, as I suggested, you can build ceph from the branches I mentioned and get it working. |
@caroav, is there any word on if this will merge with a reef/squid update? I'm not familiar enough with how Ceph does feature and patching lifecycles. |
|
@caroav Which PRs need to be backported? The process would be that you backport the relevant PRs/commits and open PRs against squid/reef. |
I have tired to build these and they fail at Building CXX object src/librbd/CMakeFiles/rbd_api.dir/librbd.cc.o , where can I go to troubleshoot this build issue? I can provide more details if needed but I don't want to use this thread for that is I should be using some other resource. Here is one line of the error |
I am trying to set up ceph-nvmeof 1.2.9 on Reef. Fresh cluster installed a few hours ago with cephadm. Deployed as per documentation. nvmeof fails to come up, logging messages I see are
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw void NVMeofGwMonitorClient::tick()
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw bool get_gw_state(const char*, const std::map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, std::map<std::__cxx11::basic_string, NvmeGwState> >&, const NvmeGroupKey&, const NvmeGwId&, NvmeGwState&) can not find group (nvme,None) old map map: {}
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw void NVMeofGwMonitorClient::send_beacon() sending beacon as gid 24694 availability 0 osdmap_epoch 0 gwmap_epoch 0
May 23 12:55:14 ceph2 bash[76745]: debug 2024-05-23T12:55:14.333+0000 785f205e5700 0 can't decode unknown message type 2049 MSG_AUTH=17
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f609868640 0 client.0 ms_handle_reset on v2:10.4.3.11:3300/0
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f609868640 0 client.0 ms_handle_reset on v2:10.4.3.11:3300/0
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.337+0000 70f609868640 0 nvmeofgw virtual bool NVMeofGwMonitorClient::ms_dispatch2(ceph::ref_t&) got map type 4
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.337+0000 70f609868640 0 ms_deliver_dispatch: unhandled message 0x5e584cc24820 mon_map magic: 0 from mon.1 v2:10.4.3.11:3300/0
Another message is
May 23 12:57:26 ceph1 bash[146371]: 1: [v2:10.4.3.11:3300/0,v1:10.4.3.11:6789/0] mon.ceph2
May 23 12:57:26 ceph1 bash[146371]: 2: [v2:10.4.3.12:3300/0,v1:10.4.3.12:6789/0] mon.ceph3
May 23 12:57:26 ceph1 bash[146371]: -12> 2024-05-23T12:57:24.746+0000 73e68f1de640 0 nvmeofgw virtual bool NVMeofGwMonitorClient::ms_dispatch2(ceph::ref_t&) got map type 4
May 23 12:57:26 ceph1 bash[146371]: -11> 2024-05-23T12:57:24.746+0000 73e68f1de640 0 ms_deliver_dispatch: unhandled message 0x5757d2e9d380 mon_map magic: 0 from mon.0 v2:10.4.3.10:3300/0
May 23 12:57:26 ceph1 bash[146371]: -10> 2024-05-23T12:57:24.746+0000 73e68f1de640 10 monclient: handle_config config(2 keys)
May 23 12:57:26 ceph1 bash[146371]: -9> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 set_mon_vals callback ignored cluster_network
May 23 12:57:26 ceph1 bash[146371]: -8> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 set_mon_vals callback ignored container_image
May 23 12:57:26 ceph1 bash[146371]: -7> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 nvmeofgw NVMeofGwMonitorClient::init()::<lambda()> nvmeof monc config notify callback
May 23 12:57:26 ceph1 bash[146371]: -6> 2024-05-23T12:57:25.654+0000 73e68d1da640 10 monclient: tick
May 23 12:57:26 ceph1 bash[146371]: -5> 2024-05-23T12:57:25.654+0000 73e68d1da640 10 monclient: _check_auth_tickets
May 23 12:57:26 ceph1 bash[146371]: -4> 2024-05-23T12:57:26.654+0000 73e68d1da640 10 monclient: tick
May 23 12:57:26 ceph1 bash[146371]: -3> 2024-05-23T12:57:26.654+0000 73e68d1da640 10 monclient: _check_auth_tickets
May 23 12:57:26 ceph1 bash[146371]: -2> 2024-05-23T12:57:26.742+0000 73e68b1d6640 0 nvmeofgw void NVMeofGwMonitorClient::tick()
May 23 12:57:26 ceph1 bash[146371]: -1> 2024-05-23T12:57:26.742+0000 73e68b1d6640 4 nvmeofgw void NVMeofGwMonitorClient::disconnect_panic() Triggering a panic upon disconnection from the monitor, elapsed 102, configured disconnect panic duration 100
May 23 12:57:26 ceph1 bash[146371]: 0> 2024-05-23T12:57:26.746+0000 73e68b1d6640 -1 *** Caught signal (Aborted) **
May 23 12:57:26 ceph1 bash[146371]: in thread 73e68b1d6640 thread_name:safe_timer
Cluster has a cluster network configured and I saw some messages about the option not being able to be changed at runtime. I did add it to ceph.conf though for the target, so that should be good. Any help will be greatly appreciated. Thank you in advance.
The text was updated successfully, but these errors were encountered: