feat: add active peer probing and a cached addr book #90

2color · 2024-11-27T14:27:15Z

What

This is an attempt to fix #16 by implementing #53.

Also fixes #25

How

New Cached Address Book
- Wraps memoryAddrBook which
New Cached Router and enriches results with cached addresses when records have no addresses.
- Implement a custom iterator for FindProviders that looks up cache, returns result with addrs if there's a cache HIT, or dispatches a FindPeer if there's a cache miss, which it then returns to the user once a result is rerturned
New background goroutine
- Subscribes to identify and connectedness events and updates cached address book.
- Runs a probe against all peers that meet probe criteria:
  - Not currently connected
  - Haven't been probed in the last threshold (1 hour)

New magic numbers

We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:

someguy/server_addr_book.go

Lines 21 to 37 in 19b15aa

    
           // The TTL to keep recently connected peers for. This should be enough time to probe 
        
           const RecentlyConnectedAddrTTL = time.Hour * 24 
        
           // Connected peers don't expire until they disconnect 
        
           const ConnectedAddrTTL = math.MaxInt64 
        
           // How long to wait since last connection before probing a peer again 
        
           const PeerProbeThreshold = time.Hour 
        
           // How often to run the probe peers function 
        
           const ProbeInterval = time.Minute * 5 
        
           // How many concurrent probes to run at once 
        
           const MaxConcurrentProbes = 20 
        
           // How many connect failures to tolerate before clearing a peer's addresses 
        
           const MaxConnectFailures = 3

Open questions

The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
- Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of othe providers in the iterator.
- Another way might be to subscribe to kad-dht query events (not 100% sure if this is possible) and add to probe loop
Should we probe the last connected addr or all addresses we have for a Peer?
When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.
How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

fix test by allowing private ips

this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.

lidel

Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).

As for Open questions, my thinking is:

The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.

Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.

Indeed, looking at someguy_cached_router_peer_addr_lookups shows we have cache miss quite often (0 addrs + cache also does not have them).

Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:

with CID duplicates: ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
only unique CIDs: ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"

A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686

So yes, finding a way of decreasing miss feels useful, given how high it is.

Two ideas:

Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache hit
Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.

Should we probe the last connected addr or all addresses we have for a Peer?

See comment inline, iiuc host.Connect effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.

Note that vole libp2p identify <multiaddr> connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.

When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.

I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}

How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

Cap at TTL of 48h?

cached_addr_book.go

CHANGELOG.md

server_cached_router.go

Co-authored-by: Marcin Rataj <[email protected]>

lidel

I think this is mostly ready to ship / deploy early next week, but we need to address things marked with ⚠️:

proactively remove peers that failed probing beyond max probing backoff
skip peers that have no addrs and skipped probing

details and other smaller nits inline
(apologies if I misunderstood something, did this late Thursday, but wanted to give feedback before Friday)

CHANGELOG.md

server.go

main.go

cached_addr_book.go

server_cached_router.go

Co-authored-by: Marcin Rataj <[email protected]>

2color · 2024-12-17T10:18:40Z

iiuc we don't have good mechanism for "forgetting" least peers, other than reaching PeerCacheSize limit here.

we have a single cache for peerState success (online) and failures (offline)

over time the cache will be 1M of peers and most of them may be offline (due to regular churn, or network attacks - we've seen sudden spikes in random peers that then disappeared forever)

A good safeguard would be to explicitly remove peer from cache if failed for longer than MaxBackoffDuration (see comment below)

the cache will unlikely drop items often on its own with PeerCacheSize = 1_000_000, so there is no point in using 2Q, maybe switch to simpler LRU at this point

I made a couple of additions to your suggestion:

When we remove the the peerState for a given peer, we also remove it from the cache. This is important, because otherwise, the peer will be probed again (since we deleted the state but we iterate on the address book for the probe)
Increase the MaxBackoffDuration to 48 hours (from 24 hours). This is to delay when a peer is removed from address book and peerState cache. Why? Because even if a peer announces a provider record, but then goes offline, we want to know that their offline (through the peerState cache) for as long as that provider record is valid for, so that we don't dispatch FindPeer calls to that peer in the fallback router.

When the max backoff duration is reached and a connection attempt fails we clear the cached addresses and state. Since this state is useful to prevent unncessary attempts to dispatch a find peer we should keep it for as long as a provider record is valid for.

no need to close the channel. just the source iterator

lidel

Thank you @2color, lgtm, fyi added metric for tracking offline/online probe results (increase rate over time) – c1ac41b

Feel free to merge and release and deploy to https://delegated-ipfs.dev/ 🙏
We will see if any follow-up is needed once we observe metrics in prod (fine to pipe gateway cids to smoke-test there too).

Note for self / future reference:

The way the cache works in this PR is that we require at least one successful libp2p identify lookup for a peer to be accepted to the cache.
- Once in cache, we periodically probe to confirm if peer is online/offline, and refresh addrs
- If a cached peer turns offline, we note it in cache, and stop returning it as one of providers, but we keep probing cached peers (incl. offline ones) with backoff just in case they are online again. If not, we drop once they been offline longer than the Amino DHT expiration window.
Basing cache add on libp2p identify success means the cache does not include peerids that were returned by (dht?) router without addrs but "always failed to identify"
- I think it is ok to leave this as-is, if we ever need to decrease cost related to those, we can start caching these dead peers somewhere in "dispatched peer lookup" code to ensure they fall under backoff logic as well.

2color · 2024-12-18T10:03:01Z

Basing cache add on libp2p identify success means the cache does not include peerids that were returned by (dht?) router without addrs but "always failed to identify"

We only cache failure count and last failure since we don't have any addresses to cache for such peers.

I think it is ok to leave this as-is, if we ever need to decrease cost related to those, we can start caching these dead peers somewhere in "dispatched peer lookup" code to ensure they fall under backoff logic as well.

We already do.

When we dispatch a peer lookup, we use call FindPeer on the cachedRouter:

someguy/server_cached_router.go

Line 208 in 48e1943

peersIt, err := it.router.FindPeers(ctx, *record.ID, 1)

Which records failed connections:

someguy/server_cached_router.go

Line 77 in 48e1943

    
           r.cachedAddrBook.RecordFailedConnection(pid) // record the failure used for probing/backoff purposes

Which ensures that the backoff logic applies to those peers too (ones that never successfully identiftied) for up to MaxBackoffDuration (48 hours) of failures. Only if a PeerID continues to be unreachable for 48 hours they will be removed from both address book and cache.

2color · 2024-12-18T10:46:25Z

Another follow up improvement we could look into is checking the last failed connection in the peerCache before blindly augmenting with addresses from the cachedAddrBook.

As it currently stands, even if probing a peer fails, the cached addresses will still be returned, at least until MaxBackoffDuration of 48 hours is reached and the addresses are removed from cache.

2color added 13 commits November 28, 2024 12:00

feat: add cached peer book with higher ttls

3896470

feat: initial implementation of active peer probing

7dd33ca

feat: use the cached router

0e86ea4

chore: go mod tidy

ec2a67a

feat: log probe duration

fe68140

chore: log in probe loop

06c2d0c

fix: update peer state if doesn't exist

fc76783

fix: add addresses to cached address book

e904c3e

fix: wrap with cached router only if available

814ae58

feat: make everything a little bit better

a4d6456

chore: small refinements

81feca7

test: add test for cached addr book

e75992f

chore: rename files

a20a4c3

2color force-pushed the add-peer-caching branch from ff3ec97 to a20a4c3 Compare November 28, 2024 11:01

2color added 6 commits November 28, 2024 12:23

feat: add options to cached addr book

c5f1d62

fix test by allowing private ips

feat: add instrumentation

e678be8

fix: thread safety

a0965bc

docs: update changelog

d82ad0f

fix: small fixes

a84d5f6

fix: simplify cached router

9ab02e1

2color marked this pull request as ready for review November 28, 2024 15:43

2color requested review from lidel and aschmahmann November 28, 2024 15:57

feat(metric): cached_router_peer_addr_lookups

9658af8

this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness.

lidel reviewed Nov 29, 2024

View reviewed changes

2color commented Nov 29, 2024

View reviewed changes

server_cached_router.go Show resolved Hide resolved

2color commented Nov 29, 2024

View reviewed changes

server_cached_router.go Show resolved Hide resolved

2color and others added 3 commits November 29, 2024 16:33

Apply suggestions from code review

7cdb5be

Co-authored-by: Marcin Rataj <[email protected]>

Update CHANGELOG.md

762136e

Co-authored-by: Marcin Rataj <[email protected]>

chore: use service name for namespace

2cf46d4

refactor: small optimisation

af7c3a8

2color requested a review from lidel December 11, 2024 11:46

chore: re-add comment

62c0d9f

lidel requested changes Dec 13, 2024

View reviewed changes

2color and others added 10 commits December 16, 2024 14:04

Apply suggestions from code review

8b36b0c

Co-authored-by: Marcin Rataj <[email protected]>

Apply suggestions from code review

b58b50d

Co-authored-by: Marcin Rataj <[email protected]>

fix: use separate context for dispatched jobs

41922af

fix: ensure proper cleanup of cache fallback iter

06cef21

Update main.go

7a2160a

Co-authored-by: Marcin Rataj <[email protected]>

fix: formatting

84bc4f7

fix: let consumer handle cleanup

0c28c6b

fix: remove from address book when removed from peer state

e0a601f

fix: use normal lru cache instead of 2Q

7f0ec50

fix: update the metric when removing from the peer cache

2e025eb

2color requested a review from lidel December 17, 2024 10:25

2color and others added 7 commits December 17, 2024 11:50

feat: add env var for recently connected ttl

fe7ad54

feat: add env var to control active probing

49efe9b

fix: bug from closing the iterator twice

8ca4d19

no need to close the channel. just the source iterator

docs: update comment

317ccb7

docs: improve changelog

327f9cb

test: fix background test

48e1943

feat(metrics): track online vs offline probe ratio

c1ac41b

lidel approved these changes Dec 18, 2024

View reviewed changes

2color merged commit d117b28 into main Dec 18, 2024
7 checks passed

2color mentioned this pull request Dec 18, 2024

release 0.7 #93

Merged

2color mentioned this pull request Dec 18, 2024

improve readme #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add active peer probing and a cached addr book #90

feat: add active peer probing and a cached addr book #90

2color commented Nov 27, 2024 •

edited

Loading

lidel left a comment •

edited

Loading

lidel left a comment •

edited

Loading

2color commented Dec 17, 2024

lidel left a comment

2color commented Dec 18, 2024 •

edited

Loading

2color commented Dec 18, 2024

	// The TTL to keep recently connected peers for. This should be enough time to probe
	const RecentlyConnectedAddrTTL = time.Hour * 24

	// Connected peers don't expire until they disconnect
	const ConnectedAddrTTL = math.MaxInt64

	// How long to wait since last connection before probing a peer again
	const PeerProbeThreshold = time.Hour

	// How often to run the probe peers function
	const ProbeInterval = time.Minute * 5

	// How many concurrent probes to run at once
	const MaxConcurrentProbes = 20

	// How many connect failures to tolerate before clearing a peer's addresses
	const MaxConnectFailures = 3

feat: add active peer probing and a cached addr book #90

feat: add active peer probing and a cached addr book #90

Conversation

2color commented Nov 27, 2024 • edited Loading

What

How

New magic numbers

Open questions

lidel left a comment • edited Loading

Choose a reason for hiding this comment

lidel left a comment • edited Loading

Choose a reason for hiding this comment

2color commented Dec 17, 2024

lidel left a comment

Choose a reason for hiding this comment

2color commented Dec 18, 2024 • edited Loading

2color commented Dec 18, 2024

2color commented Nov 27, 2024 •

edited

Loading

lidel left a comment •

edited

Loading

lidel left a comment •

edited

Loading

2color commented Dec 18, 2024 •

edited

Loading