Faster status cache #3796

alessandrod · 2024-11-26T11:43:20Z

This PR removes the global RwLock around the status cache, and introduces more granular RwLocks per-blockhash and per-slot. Additionally, it changes the internal hash tables from std HashMap to Dashmap, so that operations at the blockhash and slot level can be done only holding read locks.

This is not the final design of A Performant Status Cache - which, I think, can make check and update go straight to 0 - but it's a good incremental improvement.

Results are pretty good: check_transactions is ~6x faster, and update_transaction_statuses is ~2.5x faster.

Remove the global RwLock around the status cache, and introduce more granular RwLocks per-blockhash and per-slot. Additionally, change the internal hash tables from std HashMap to Dashmap, so that operations at the blockhash and slot level can be done only holding read locks.

alessandrod · 2024-12-09T13:52:55Z

bench-tps against a single node

scheduler before and after

workers

apfitzge · 2024-12-09T14:44:08Z

runtime/src/bank.rs

@@ -3459,26 +3459,26 @@ impl Bank {
    }

    /// Forget all signatures. Useful for benchmarking.
+    #[cfg(feature = "dev-context-only-utils")]


apfitzge · 2024-12-09T14:47:29Z

runtime/src/status_cache.rs

-// Store forks in a single chunk of memory to avoid another lookup.
-pub type ForkStatus<T> = Vec<(Slot, T)>;
+// The number of shards to use for ::slot_deltas. We're going to store at most MAX_CACHE_ENTRIES
+// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare.


Suggested change

// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare.

// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are very rare.

apfitzge · 2024-12-09T14:54:50Z

runtime/src/status_cache.rs

+    fn push(&self, item: T) {
+        self.vec.push(item);
+    }
+
+    fn iter(&self) -> impl Iterator<Item = (usize, &T)> {
+        self.vec.iter()
+    }
+}
+
+impl<T> IntoIterator for ConcurrentVec<T> {
+    type Item = T;
+    type IntoIter = boxcar::IntoIter<T>;
+
+    fn into_iter(self) -> Self::IntoIter {
+        self.vec.into_iter()
+    }
+}
+
+impl<'a, T> IntoIterator for &'a ConcurrentVec<T> {
+    type Item = (usize, &'a T);
+    type IntoIter = boxcar::Iter<'a, T>;
+
+    fn into_iter(self) -> Self::IntoIter {
+        self.vec.iter()
+    }
+}


maybe excepting serialize and deserialize, can't we just implement Deref to get access to the inner for all of this other stuff?

apfitzge · 2024-12-09T15:18:49Z

runtime/src/status_cache.rs

+
+// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the
+// case a tx only lands in one (the vast majority) or two forks.
+pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>;


Do we have stats on how many are on only 1 fork, 2 forks, 3+ forks?

Curious what the performance difference is, if for example 95+% are on a single fork - is it worth doubling the memory usage here for SmallVec<2> instead of SmallVec<1>?

anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)

apfitzge · 2024-12-09T15:27:39Z

runtime/src/status_cache.rs

+                panic!("Blockhash must exist if it exists in self.slot_deltas, slot: {slot}")
+            };
+
+            // cache_txs is self.blockhash_cache[blockhash]


self.blockhash_cache would probably be a good rename 😄

apfitzge · 2024-12-09T15:42:03Z

runtime/src/status_cache.rs

+                forks.retain(|(fork_slot, _key)| *fork_slot != slot);
+                if forks.is_empty() {


Might be over/prematurely optimizing... but this strikes me as a bit of an odd way to do this.

AFAIK, there is no chance of duplicate fork slots in the status cache. Seems we could check the len == 1, remove entry in that case. If != 1, then we can do retain.

Actually, retaining seems a bit odd to me because let's say we have 4 forks and we remove entry at 0. we shuffle everything over by 1. They're not huge, so copying over may be relatively cheap, but probably could just do something like this:

// if all the slots have been cleared or purged, we don't need to track this tx // anymore if forks.len() == 1 { cache_tx_entry.remove(); } else { forks .iter() .position(|(fork_slot, _)| *fork_slot == slot) .map(|index| forks.swap_remove(index)); }

apfitzge · 2024-12-09T16:01:06Z

runtime/src/status_cache.rs


        // Get the cache entry for this blockhash.
-        let (max_slot, key_index, hash_map) =
-            self.cache.entry(*transaction_blockhash).or_insert_with(|| {
+        let key_index = {


If we returned from this block (key_index, key_slice) would that perform a copy, or would compiler optimize it to effectively what you have now?

Thinking about how we can limit the scope of MaybeUninit so easier to maintain. Not opposed to what you have if there's any performance cost to my suggestion.

Basically change the inner spot where you're currently doing ptr copy to something like this:

// Grab the key slice. let key_index = (*key_index).min(max_key_index); let key_slice = { let mut key_slice = MaybeUninit::<[u8; CACHED_KEY_SIZE]>::uninit(); unsafe { ptr::copy_nonoverlapping( key.as_ref()[key_index..key_index + CACHED_KEY_SIZE].as_ptr(), key_slice.as_mut_ptr() as *mut u8, CACHED_KEY_SIZE, ); key_slice.assume_init() } }; // Insert the slot and tx result into the cache entry associated with // this blockhash and keyslice. let mut forks = txs.entry(key_slice).or_default(); forks.push((slot, res.clone())); (key_index, key_slice)

now all the unsafe is in one place.
IF this introduces extra copies, let's just keep it as is. I think the compiler should be smart enough to do what we want, but probably putting too much faith in it 😢

apfitzge · 2024-12-09T16:14:24Z

runtime/src/status_cache.rs

+        } else {
+            // Only take the write lock if this is the first time we've seen
+            // this blockhash in this slot.
+            let (_key_index, txs) = &mut *fork_entry


nit: I think we don't need mut here? txs doesn't actually need to be mutable, the mutation is happening only at fork_entry level.

bw-solana

LGTM. One minor suggestion

bw-solana · 2024-12-11T16:41:14Z

runtime/src/bank/check_transactions.rs


        check_results.extend(sanitized_txs.iter().zip(lock_results).map(
            |(sanitized_tx, lock_result)| {
                let sanitized_tx = sanitized_tx.borrow();
                if lock_result.is_ok()
-                    && self.is_transaction_already_processed(sanitized_tx, &rcache)
+                    && self.is_transaction_already_processed(sanitized_tx, &self.status_cache)


probably don't even need to pass status cache anymore

bw-solana · 2024-12-11T16:59:42Z

runtime/src/status_cache.rs

+
+// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the
+// case a tx only lands in one (the vast majority) or two forks.
+pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>;


anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)

apfitzge · 2024-12-11T22:42:03Z

runtime/src/status_cache.rs

-    /// Get the statuses for all the root slots
+    /// Get the statuses for all the root slots.
+    ///
+    /// This is never called concurrently with add_root(), and for a slot to be a root there must be


No concurrent calls with this right now. Not sure we can really prevent them from being added though.

apfitzge · 2024-12-11T22:59:14Z

runtime/src/status_cache.rs

-                self.cache.retain(|_, (fork, _, _)| *fork > min);
+                self.cache.retain(|_key, value| {
+                    let (max_slot, _, _) = &**value;
+                    max_slot.load(Ordering::Relaxed) > min


noting for myself.

retain will grab the write lock on the shard for checking the closure. This guarantees that we have unique access to this self.cache entry.
Concurrent access from insert will have completed, or will be held up wiating for this to complete on the shard; if removed, then insert would be possibly re-adding entry (that's fine).

I'm like 99% sure that Relaxed ordering here (and in insert) is fine because we have guaranteed the unique access via the write-lock on shard.

apfitzge

Few more comments on this round, but overall looks correct to me.

alessandrod · 2024-12-22T09:39:15Z

results on a thread ripper 7965wx 64 cores / 128 threads - 10x improvement, pretty crazy

alessandrod force-pushed the status-cache branch 14 times, most recently from 571fca8 to 2db0759 Compare November 27, 2024 05:52

alessandrod force-pushed the status-cache branch 2 times, most recently from d41386e to fc67d2f Compare December 9, 2024 09:50

alessandrod marked this pull request as ready for review December 9, 2024 09:50

alessandrod force-pushed the status-cache branch 2 times, most recently from 2ff1d2b to d9fdc54 Compare December 9, 2024 09:52

alessandrod changed the title ~~[WIP] faster status cache~~ Faster status cache Dec 9, 2024

alessandrod force-pushed the status-cache branch 6 times, most recently from 1309759 to 70974b7 Compare December 9, 2024 10:56

alessandrod force-pushed the status-cache branch from 70974b7 to 4661809 Compare December 9, 2024 11:22

alessandrod requested review from bw-solana and apfitzge December 9, 2024 12:36

apfitzge reviewed Dec 9, 2024

View reviewed changes

bw-solana mentioned this pull request Dec 9, 2024

v2.1: streamline status cache insert (backport of #3365) #3435

Open

bw-solana reviewed Dec 11, 2024

View reviewed changes

apfitzge reviewed Dec 11, 2024

View reviewed changes

bw-solana approved these changes Dec 24, 2024

View reviewed changes

bw-solana linked an issue Jan 3, 2025 that may be closed by this pull request

Performance: Status Cache #4271

Open

bw-solana mentioned this pull request Jan 4, 2025

Performance: Status Cache #4271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster status cache #3796

Faster status cache #3796

alessandrod commented Nov 26, 2024 •

edited

Loading

alessandrod commented Dec 9, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

bw-solana Dec 11, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

apfitzge Dec 9, 2024

bw-solana left a comment

bw-solana Dec 11, 2024

bw-solana Dec 11, 2024

apfitzge Dec 11, 2024

apfitzge Dec 11, 2024

apfitzge left a comment

alessandrod commented Dec 22, 2024

	// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare.
	// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are very rare.

		forks.retain(\|(fork_slot, _key)\| *fork_slot != slot);
		if forks.is_empty() {

Faster status cache #3796

Are you sure you want to change the base?

Faster status cache #3796

Conversation

alessandrod commented Nov 26, 2024 • edited Loading

alessandrod commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bw-solana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge left a comment

Choose a reason for hiding this comment

alessandrod commented Dec 22, 2024

alessandrod commented Nov 26, 2024 •

edited

Loading