Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: find_or_insert? #2

Open
d33tah opened this issue Aug 18, 2020 · 2 comments
Open

Feature request: find_or_insert? #2

d33tah opened this issue Aug 18, 2020 · 2 comments

Comments

@d33tah
Copy link

d33tah commented Aug 18, 2020

Hi! Thanks for the library! I played with it for a while, though unfortunately it wasn't fast enough for my use case. Does it use CPU's built-in hash capabilities to quickly create hashes?

Anyway, I might have a feature request: when counting how many entries are in a given set, I was trying to use the library but found that it's probably hashing twice the same entries. Here's the relevant code:

        if !filter.contains_hash(&record[1]) {
            filter.insert_hash(&record[1]);
            cnt += 1;
        }

Would it make sense to add insert_if_not_exists() that returns information about whether the hash was already there?

@domodwyer
Copy link
Owner

Hi @d33tah

I'm surprised it's not fast enough for your use case! Are you hashing the raw data and inserting the hash or is record[1] the raw data? What sort of numbers are you seeing?

I just ran a quick check using xxhash (though in reality the choice of hash here has no impact):

pub fn xxhash_bench(c: &mut Criterion) {
    use std::hash::BuildHasher;
    use std::hash::Hasher;
    use twox_hash::*;

    // Really any u64/hash bytes is fine, but this shows how to use xxhash.
    let mut bloom = CompressedBitmap::new(FilterSize::KeyBytes2);
    let mut hasher = RandomXxHashBuilder64::default().build_hasher();
    hasher.write(&[42]); // Hash the raw data bytes
    let hash = hasher.finish().to_le_bytes();

    c.bench_function("xxhash_insert", |b| b.iter(|| bloom.insert_hash(hash)));

    c.bench_function("xxhash_lookup hit", |b| {
        b.iter(|| black_box(bloom.contains_hash(hash)))
    });

    c.bench_function("xxhash_lookup miss, partial match", |b| {
        b.iter(|| black_box(bloom.contains_hash([1, 2, 3, 4, 5, 6, 7, 8])))
    });
}

And got:

curiosity:bloom2 % cargo bench -- xxhash
   Compiling bloom2 v0.1.0 (/Users/dom/Documents/rust/bloom2)
    Finished bench [optimized] target(s) in 12.46s
     Running target/release/deps/bench-8985f05ed1c18d86
Gnuplot not found, using plotters backend
xxhash_insert           time:   [26.184 ns 26.251 ns 26.321 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

xxhash_lookup hit       time:   [19.978 ns 20.069 ns 20.176 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

xxhash_lookup miss, partial match
                        time:   [4.6333 ns 4.6693 ns 4.7297 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

I'm compiling with -Ctarget-cpu=native to enable SSE/AVX and allow rust to use the all the features of the CPU - if your aiming for flexible builds (say, packaging up the binary to be run by 3rd parties on unknown CPUs) then skip this - it will still be pretty fast! But if you're deploying binaries to known hardware, everything will be a bit faster with this enabled and is worth doing!


To answer your actual question: sounds like a great idea!

Maybe something like:

fn try_insert<T: AsRef<[u8]>>(&self, hash: T) -> bool

Where the returned bool is true/false depending on if it is inserted?

I think this crate could do with some love to be honest - adding a generic parameter for the default hasher (xxhash?) the user can override instead of requiring them to pass hashes directly, and adding more utility methods seems like an obvious improvement.

My original use case was dealing with hashes, so I didn't want the bloom filter to then _re_hash the hash I was giving it, so that's why it wound up like this, but now it's public that is likely not the most useful form this crate could take...

Dom

@d33tah
Copy link
Author

d33tah commented Aug 18, 2020

@domodwyer thanks for the RUSTFLAGS hint! It speeded up the program twice. Now I guess that the next low hanging fruit is the try_insert function you proposed. If you're interested in adding it, please ping me once it's ready for testing. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants