Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP [RST] intermittently ignored #25314

Closed
samoconnor opened this issue Dec 29, 2017 · 5 comments
Closed

TCP [RST] intermittently ignored #25314

samoconnor opened this issue Dec 29, 2017 · 5 comments
Labels
waiting for author Anybody home?

Comments

@samoconnor
Copy link
Contributor

Most of the time when TCPSocket receives a [RST] packet, libuv calls uv_readcb() and UVError, ECONNRESET is thrown.

I have a test case where hundreds of pipelined HTTP PUT Requests are sent to AWS S3. Typically the Requests get ahead of the Responses (e.g. when Request No. 70 is being sent, we may only be up to reading Response No. 10).
At some point the S3 server hits an internal limit on the number of Requests per connection (about 100) and stops sending Response data (e.g. we might send Request No. 120 and then while we're reading Response No. 30 data stops arriving. Sometimes the server sends [RST] right away and a UVError, ECONNRESET is thrown as expected. Note: the S3 doc suggests not to send more than 90 requests per connection. I'm sending more than that as a way to test corner case behaviour in HTTP.jl.

However, monitoring with wireshark shows that sometimes the [RST] is not sent for a few minutes. It seems that in this case libuv does not notice the [RST], and uv_readcb is not called. The result is that the eof() call that the reader is waiting for blocks forever. I have a seperate task that periodically prints connection debug info. This shows that the LibuvStream.state remains StatusActive.

I have tried putting lots of printfs in libuv. What I see is that the uv__stream_io function is not called at all in the case where the [RST] is missed. Maybe there is a race-condition inside libuv where the [RST] is missed if kevent is not active when it arrives? Maybe for some reason libuv forgets to submit the socket to kevent, or does not indicate interest in the correct event type? (I'm not familiar with kqueue).

I have tried modifying wait_readnb so that it wakes up and does uv_read_start again every so often while waiting. This makes no difference.

As a practical solution for HTTP.jl I've implemented a Retry Layer that uses a seperate task to close stuck connections. Calling close results in the blocked eof() task waking up, discovering the connection is gone, and retring the Request.

Version 0.7.0-DEV.3090 (2017-12-18 19:26 UTC)
Commit 5abe9b1382* (10 days old master)
x86_64-apple-darwin14.5.0
@samoconnor
Copy link
Contributor Author

This issue (and this one: #14747) make me wonder if libuv is the best way to implement socket IO in Julia.

Microsoft now has their own implementation of epoll, AF_INET, AF_UNIX and AF_NETLINK. I believe that this is not directly accessible from a windows .exe or .dll but, perhaps there is some way around this.

Perhaps it would be better for Julia's network IO layer to be built on BSD sockets + epoll/kevent and use something like WSL to provide compatibility with windows.

It is frustrating to spend time figuring out what libuv is doing when debugging Julia IO stuff. The libuv documentation is thin and often says "See the linux man page for more". It often feels like it would be easier to work directly with the well defined BSD/Linux APIs that I know.

Moving the main event loop from libuv to Julia might also help with other event related stuff: #22631 #13763.

@vtjnash
Copy link
Member

vtjnash commented Feb 8, 2018

Microsoft now has their own implementation of epoll, AF_INET, AF_UNIX and AF_NETLINK. I believe that this is not directly accessible from a windows .exe or .dll but, perhaps there is some way around this.

They've had it for years – it's used by libuv. WSL is entirely tangential to this; the relevant subsystem is whether the underlying driver being used is WSK. The old API (which did not support epoll) was deprecated in Windows 7, although I know of at least one corporate firewall that tries to prevent user programs from accessing the new subsystem (as of a couple years ago when I last checked).

@samoconnor
Copy link
Contributor Author

WSK implements epoll ?, or do you mean that IoCompletionPort is similar to epoll?

@vtjnash
Copy link
Member

vtjnash commented Feb 8, 2018

Usually it just use IOCP, since presumably that's faster. But it looks like the original PoC repo has even been getting new updates recently https://github.com/piscisaureus/wepoll

@vtjnash
Copy link
Member

vtjnash commented Feb 5, 2021

Can you provide any update information here?

@brenhinkeller brenhinkeller added the waiting for author Anybody home? label Nov 21, 2022
@vtjnash vtjnash closed this as not planned Won't fix, can't repro, duplicate, stale Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author Anybody home?
Projects
None yet
Development

No branches or pull requests

3 participants