Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Hanging when using string as tag #1049

Open
cctry opened this issue Jun 18, 2024 · 6 comments
Open

[BUG] Hanging when using string as tag #1049

cctry opened this issue Jun 18, 2024 · 6 comments

Comments

@cctry
Copy link

cctry commented Jun 18, 2024

Hello,

My program hangs when using string as the tag.

Client code

flag = 0
async def main():
    global flag
    ep = await ucp.create_endpoint(host, port)
    arr = torch.zeros((n_bytes, ) , dtype=torch.float32, device='cuda')
    print("Send Original torch tensor")
    await ep.send(arr, tag=str(flag))  # send the real message
    flag += 1

    print("Receive Incremented torch tensor")
    resp = torch.empty_like(arr)
    await ep.recv(resp, tag=str(flag))  # receive the echo
    await ep.close()
    assert torch.allclose(resp, arr + 1)

Server code

flag = 0
async def send(ep: ucp.Endpoint):
    global flag
    arr = torch.empty((n_bytes, ), dtype=dtype, device="cuda")
    await ep.recv(arr, tag=str(flag))
    assert torch.count_nonzero(arr).item() == 0
    print("Received torch tensor")
    flag += 1

    arr += 1
    print("Sending incremented torch tensor")
    await ep.send(arr, tag=str(flag))
    await ep.close()
    lf.close()

Replacing the above str(flag) with flag resolves the hanging.

The debug log of client indicates that it tries to resend the message

Send Original torch tensor
[1718732246.687108] [n121-014-226:3530201:0]           mpool.c:281  UCX  DEBUG mpool ud_tx_skb: allocated chunk 0x7fc112000018 of 6291432 bytes with 1489 elements
[1718732247.208728] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 2 in flight: 1
[1718732247.208755] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 1 rt_max_psn 1 acked_psn 0 max_psn 2 ack_req 1
[1718732247.208762] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed
[1718732266.929299] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732266.929326] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 2 rt_max_psn 2 acked_psn 1 max_psn 3 ack_req 1
[1718732266.929338] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed
[1718732287.450339] [n121-014-226:3530201:a]           ud_ep.c:93   UCX  DEBUG ep: 0x70e97a0 ca drop@cwnd = 3 in flight: 1
[1718732287.450370] [n121-014-226:3530201:a]           ud_ep.c:1427 UCX  DEBUG ep(0x70e97a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1718732287.450398] [n121-014-226:3530201:a]           ud_ep.c:1433 UCX  DEBUG ep(0x70e97a0): resending completed

The server waits after allocating

[1718732226.689047] [n123-017-156:3656205:0]           ud_ep.c:406  UCX  DEBUG   created ep ep=0x5623d098d000 iface=0x5623d09e1460 id=0
[1718732226.689052] [n123-017-156:3656205:0]       wireup_ep.c:483  UCX  DEBUG   ep 0x7f0083f40000: wireup_ep 0x5623d0f50930 created next_ep 0x5623d098d000 to <no debug data> using ud_mlx5/mlx5_5:1
[1718732226.689091] [n123-017-156:3656205:0]           async.c:231  UCX  DEBUG   added async handler 0x5623d0f50c30 [id=1000035 ref 1] ???() to hash
[1718732226.689106] [n123-017-156:3656205:0]           ud_ep.c:691  UCX  DEBUG   mlx5_5:1/RoCE slid 0 qpn 0x11186 epid 0 connected to ::ffff:192.168.6.226pkey 0xffff  qpn 0x10fff epid 0
[1718732226.689113] [n123-017-156:3656205:0]        ib_iface.c:796  UCX  DEBUG   iface 0x5623d09e1460: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.6.226 sgid_index=3 traffic_class=106
[1718732226.690859] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7efba1000018 of 20971496 bytes with 4964 elements
[1718732226.690904] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f89980
[1718732226.690908] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0681c10
[1718732226.690913] [n123-017-156:3656205:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7f0083f40000: destroy wireup ep 0x5623d0f50930
[1718732226.864187] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ucp_am_bufs: allocated chunk 0x5623d1e13bf4 of 24660 bytes with 128 elements
[1718732246.685229] [n123-017-156:3656205:0]           mpool.c:281  UCX  DEBUG mpool ud_tx_skb: allocated chunk 0x7f0072c00018 of 6291432 bytes with 1489 elements
@cctry
Copy link
Author

cctry commented Jun 18, 2024

I found the reason is that the hash() function in Python3 is not deterministic for types other than integers. I can submit a PR for it.

@pentschev
Copy link
Member

Hi @cctry , I'm glad you were able to figure out the issue. Yes, we've come across this in the past as well, I can't find where this was discussed previously though. PRs are always welcome, what do you have in mind though, clarifying the documentation or something else? Unfortunately, I don't think we can do much to "resolve" this issue given it's a Python feature, the user will be required to guarantee the value is hashable and deterministic.

It seems that you're also just starting with UCX-Py, I would recommend you have a look at UCXX which is a rewrite of UCX-Py in C++ but provides the same functionality as UCX-Py (in some cases, minor changes are required). We will later in the year deprecate and archive UCX-Py in favor of UCXX.

@cctry
Copy link
Author

cctry commented Jun 19, 2024

Thank you for letting me know UCXX!

I think we can use other hash functions like md5 to replace hash(). Python salts it to reduce the predicability which is exactly what we need in the tagged messages.

@pentschev
Copy link
Member

Can you explain why you're saying reduced predictability is exactly what we need? I'd argue the exact opposite, the tag must be deterministic and thus we don't want to reduce predictability. Tags are necessary to match the message on the receiving end, so it must be exactly the same as the sender (or must match a mask, but that is not currently exposed in UCX-Py's async API).

Depending on your needs I would also suggest looking at active message, in which case communication is entirely tracked endpoint<->endpoint and there's no need for tags. See am_send/am_recv and this test for an example on how to use it.

@wence-
Copy link
Contributor

wence- commented Jun 19, 2024

Can you explain why you're saying reduced predictability is exactly what we need?

I think ambiguity of english :).

"Python salts it to reduce the predictability which is exactly what we need in the tagged message."

Here the clarifying subclause (starting with "which") can be read ambiguously (Strunk and White have an example for effectively exactly this): does it relate to "Python salts it", or does it relate to "predictability". I believe @cctry means "Python salts it to reduce predictability, and predictability is exactly what we need", whereas @pentschev has read this as "Python salts it to reduce predictability, and reduced predictability is exactly what we need".

@wence-
Copy link
Contributor

wence- commented Jun 19, 2024

But, I would just send integers as the message tag, and perhaps we should clarify in the documentation that that is what one should do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants