Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster status cache #3796

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

alessandrod
Copy link

@alessandrod alessandrod commented Nov 26, 2024

This PR removes the global RwLock around the status cache, and introduces more granular RwLocks per-blockhash and per-slot. Additionally, it changes the internal hash tables from std HashMap to Dashmap, so that operations at the blockhash and slot level can be done only holding read locks.

This is not the final design of A Performant Status Cache - which, I think, can make check and update go straight to 0 - but it's a good incremental improvement.

Results are pretty good: check_transactions is ~6x faster, and update_transaction_statuses is ~2.5x faster.

Screenshot 2024-11-26 at 10 57 34 pm Screenshot 2024-11-26 at 10 59 04 pm

@alessandrod alessandrod force-pushed the status-cache branch 14 times, most recently from 571fca8 to 2db0759 Compare November 27, 2024 05:52
@alessandrod alessandrod force-pushed the status-cache branch 2 times, most recently from d41386e to fc67d2f Compare December 9, 2024 09:50
@alessandrod alessandrod marked this pull request as ready for review December 9, 2024 09:50
@alessandrod alessandrod force-pushed the status-cache branch 2 times, most recently from 2ff1d2b to d9fdc54 Compare December 9, 2024 09:52
@alessandrod alessandrod changed the title [WIP] faster status cache Faster status cache Dec 9, 2024
@alessandrod alessandrod force-pushed the status-cache branch 6 times, most recently from 1309759 to 70974b7 Compare December 9, 2024 10:56
Remove the global RwLock around the status cache, and introduce more
granular RwLocks per-blockhash and per-slot. Additionally, change the
internal hash tables from std HashMap to Dashmap, so that operations at
the blockhash and slot level can be done only holding read locks.
@alessandrod
Copy link
Author

bench-tps against a single node

scheduler before and after

Screenshot 2024-12-10 at 12 33 42 am Screenshot 2024-12-10 at 12 30 24 am

workers

Screenshot 2024-12-10 at 12 30 44 am Screenshot 2024-12-10 at 12 31 54 am Screenshot 2024-12-10 at 12 32 04 am Screenshot 2024-12-10 at 12 32 14 am

@@ -3459,26 +3459,26 @@ impl Bank {
}

/// Forget all signatures. Useful for benchmarking.
#[cfg(feature = "dev-context-only-utils")]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

// Store forks in a single chunk of memory to avoid another lookup.
pub type ForkStatus<T> = Vec<(Slot, T)>;
// The number of shards to use for ::slot_deltas. We're going to store at most MAX_CACHE_ENTRIES
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare.
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are very rare.

Comment on lines +396 to +421
fn push(&self, item: T) {
self.vec.push(item);
}

fn iter(&self) -> impl Iterator<Item = (usize, &T)> {
self.vec.iter()
}
}

impl<T> IntoIterator for ConcurrentVec<T> {
type Item = T;
type IntoIter = boxcar::IntoIter<T>;

fn into_iter(self) -> Self::IntoIter {
self.vec.into_iter()
}
}

impl<'a, T> IntoIterator for &'a ConcurrentVec<T> {
type Item = (usize, &'a T);
type IntoIter = boxcar::Iter<'a, T>;

fn into_iter(self) -> Self::IntoIter {
self.vec.iter()
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe excepting serialize and deserialize, can't we just implement Deref to get access to the inner for all of this other stuff?


// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the
// case a tx only lands in one (the vast majority) or two forks.
pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have stats on how many are on only 1 fork, 2 forks, 3+ forks?

Curious what the performance difference is, if for example 95+% are on a single fork - is it worth doubling the memory usage here for SmallVec<2> instead of SmallVec<1>?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)

panic!("Blockhash must exist if it exists in self.slot_deltas, slot: {slot}")
};

// cache_txs is self.blockhash_cache[blockhash]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.blockhash_cache would probably be a good rename 😄

Comment on lines +144 to +145
forks.retain(|(fork_slot, _key)| *fork_slot != slot);
if forks.is_empty() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be over/prematurely optimizing... but this strikes me as a bit of an odd way to do this.

AFAIK, there is no chance of duplicate fork slots in the status cache. Seems we could check the len == 1, remove entry in that case. If != 1, then we can do retain.

Actually, retaining seems a bit odd to me because let's say we have 4 forks and we remove entry at 0. we shuffle everything over by 1. They're not huge, so copying over may be relatively cheap, but probably could just do something like this:

                // if all the slots have been cleared or purged, we don't need to track this tx
                // anymore
                if forks.len() == 1 {
                    cache_tx_entry.remove();
                } else {
                    forks
                        .iter()
                        .position(|(fork_slot, _)| *fork_slot == slot)
                        .map(|index| forks.swap_remove(index));
                }


// Get the cache entry for this blockhash.
let (max_slot, key_index, hash_map) =
self.cache.entry(*transaction_blockhash).or_insert_with(|| {
let key_index = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we returned from this block (key_index, key_slice) would that perform a copy, or would compiler optimize it to effectively what you have now?

Thinking about how we can limit the scope of MaybeUninit so easier to maintain. Not opposed to what you have if there's any performance cost to my suggestion.

Basically change the inner spot where you're currently doing ptr copy to something like this:

            // Grab the key slice.
            let key_index = (*key_index).min(max_key_index);
            let key_slice = {
                let mut key_slice = MaybeUninit::<[u8; CACHED_KEY_SIZE]>::uninit();
                unsafe {
                    ptr::copy_nonoverlapping(
                        key.as_ref()[key_index..key_index + CACHED_KEY_SIZE].as_ptr(),
                        key_slice.as_mut_ptr() as *mut u8,
                        CACHED_KEY_SIZE,
                    );
                    key_slice.assume_init()
                }
            };

            // Insert the slot and tx result into the cache entry associated with
            // this blockhash and keyslice.
            let mut forks = txs.entry(key_slice).or_default();
            forks.push((slot, res.clone()));

            (key_index, key_slice)

now all the unsafe is in one place.
IF this introduces extra copies, let's just keep it as is. I think the compiler should be smart enough to do what we want, but probably putting too much faith in it 😢

} else {
// Only take the write lock if this is the first time we've seen
// this blockhash in this slot.
let (_key_index, txs) = &mut *fork_entry
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we don't need mut here? txs doesn't actually need to be mutable, the mutation is happening only at fork_entry level.

Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor suggestion


check_results.extend(sanitized_txs.iter().zip(lock_results).map(
|(sanitized_tx, lock_result)| {
let sanitized_tx = sanitized_tx.borrow();
if lock_result.is_ok()
&& self.is_transaction_already_processed(sanitized_tx, &rcache)
&& self.is_transaction_already_processed(sanitized_tx, &self.status_cache)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't even need to pass status cache anymore


// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the
// case a tx only lands in one (the vast majority) or two forks.
pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)

/// Get the statuses for all the root slots
/// Get the statuses for all the root slots.
///
/// This is never called concurrently with add_root(), and for a slot to be a root there must be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concurrent calls with this right now. Not sure we can really prevent them from being added though.

self.cache.retain(|_, (fork, _, _)| *fork > min);
self.cache.retain(|_key, value| {
let (max_slot, _, _) = &**value;
max_slot.load(Ordering::Relaxed) > min

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting for myself.

retain will grab the write lock on the shard for checking the closure. This guarantees that we have unique access to this self.cache entry.
Concurrent access from insert will have completed, or will be held up wiating for this to complete on the shard; if removed, then insert would be possibly re-adding entry (that's fine).

I'm like 99% sure that Relaxed ordering here (and in insert) is fine because we have guaranteed the unique access via the write-lock on shard.

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments on this round, but overall looks correct to me.

@alessandrod
Copy link
Author

results on a thread ripper 7965wx 64 cores / 128 threads - 10x improvement, pretty crazy

Screenshot 2024-12-22 at 8 38 25 pm

@bw-solana bw-solana linked an issue Jan 3, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: Status Cache
3 participants