-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add active peer probing and a cached addr book #90
Conversation
ff3ec97
to
a20a4c3
Compare
this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).
As for Open questions, my thinking is:
- The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from
FindProviders
for which we have no cached multiaddrs remain unresolved.
- Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.
Indeed, looking at someguy_cached_router_peer_addr_lookups
shows we have cache miss
quite often (0 addrs + cache also does not have them).
Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:
- with CID duplicates:
ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
- only unique CIDs:
ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:
# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686
So yes, finding a way of decreasing miss
feels useful, given how high it is.
Two ideas:
- Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache
hit
- Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.
- Should we probe the last connected addr or all addresses we have for a Peer?
See comment inline, iiuc host.Connect
effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.
Note that vole libp2p identify <multiaddr>
connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.
- When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from
FindProviders
have multiaddrs for a peer, it's up to date.
I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}
- How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the
peers
map doesn't. temp solution: I've added some instrumentation for this
Cap at TTL of 48h?
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly ready to ship / deploy early next week, but we need to address things marked with
- proactively remove peers that failed probing beyond max probing backoff
- skip peers that have no addrs and skipped probing
details and other smaller nits inline
(apologies if I misunderstood something, did this late Thursday, but wanted to give feedback before Friday)
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
Co-authored-by: Marcin Rataj <[email protected]>
I made a couple of additions to your suggestion:
|
When the max backoff duration is reached and a connection attempt fails we clear the cached addresses and state. Since this state is useful to prevent unncessary attempts to dispatch a find peer we should keep it for as long as a provider record is valid for.
no need to close the channel. just the source iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @2color, lgtm, fyi added metric for tracking offline/online probe results (increase rate over time) – c1ac41b
Feel free to merge and release and deploy to https://delegated-ipfs.dev/ 🙏
We will see if any follow-up is needed once we observe metrics in prod (fine to pipe gateway cids to smoke-test there too).
Note for self / future reference:
- The way the cache works in this PR is that we require at least one successful libp2p identify lookup for a peer to be accepted to the cache.
- Once in cache, we periodically probe to confirm if peer is online/offline, and refresh addrs
- If a cached peer turns offline, we note it in cache, and stop returning it as one of providers, but we keep probing cached peers (incl. offline ones) with backoff just in case they are online again. If not, we drop once they been offline longer than the Amino DHT expiration window.
- Basing cache add on libp2p identify success means the cache does not include peerids that were returned by (dht?) router without addrs but "always failed to identify"
- I think it is ok to leave this as-is, if we ever need to decrease cost related to those, we can start caching these dead peers somewhere in "dispatched peer lookup" code to ensure they fall under backoff logic as well.
We only cache failure count and last failure since we don't have any addresses to cache for such peers.
We already do. When we dispatch a peer lookup, we use call someguy/server_cached_router.go Line 208 in 48e1943
Which records failed connections: someguy/server_cached_router.go Line 77 in 48e1943
Which ensures that the backoff logic applies to those peers too (ones that never successfully identiftied) for up to MaxBackoffDuration (48 hours) of failures. Only if a PeerID continues to be unreachable for 48 hours they will be removed from both address book and cache. |
Another follow up improvement we could look into is checking the last failed connection in the As it currently stands, even if probing a peer fails, the cached addresses will still be returned, at least until |
What
This is an attempt to fix #16 by implementing #53.
Also fixes #25
How
memoryAddrBook
whichNew magic numbers
We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:
someguy/server_addr_book.go
Lines 21 to 37 in 19b15aa
Open questions
FindProviders
for which we have no cached multiaddrs remain unresolved.FindProviders
have multiaddrs for a peer, it's up to date.peers
map doesn't. temp solution: I've added some instrumentation for this