Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] A partitioned HIBF layout. #230

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Nov 9, 2023

Still a work in progress.

Design decision on how to partition

[NO] Partitioning by k-mer suffix/prefix

This means that each partition will get a sub-content of each user bin, and each partition will be built on every user bin.

Disadvantages:

  • Partitions need to count k-mers which is slow in the HIBF
  • Book keeping for query counts is needed
  • Requires layout(s) on many user bins and the layout algorithm is slow for many user bins

Advantages:

  • Every query hash is only searched once (they are also split by suffix/prefix and only searched in the respective partition)

[NO] Partitioning by IBF (subtree) or The Lazy HIBF

This means that there is still one big HIBF and its individual IBFs or subtrees of IBFs are the partitions.

Disadvantages:

  • Requires one big layout on many user bins, which is slow
  • Book keeping for queries is needed (which query hits which bin)
  • Query hashes need to be repeatedly searched in different partitions

Advantages:

  • Not every partition has to be queried with every query, since the bookkeeping can track which partition needs to be searched for which query

[YES] Partitioning by user bin

This means each partition will index a subset of the user bins, each with an individual layout.

Disadvantages:

  • Every query must be searched in every single partition
  • Query results from different partitions must be merged (result = all UBs per query)

Advantages:

  • Smaller layouts must be computed on a subset of data
  • No extra Book keeping: query results can be collected per partition

Goals for partitioning

What goals do we have (top-down by priority):

  1. The partitions should have roughly the same size (reason: reduces the number of partitions needed)
  2. The partitions should roughly have the same number of UBs (reason: Layout computation time, which can be done in parallel for each partition, depends on the maximum number of user bins in a layout.)
  3. The individual HIBFs should have reasonably good layouts, with high average load factors
  4. Searching each partition should have the same runtime on average (reason: good for offloading)

The main challenge is: how to assign user bins to partitions?

Possible solutions

  1. Assign user bins at random to partitions until each partition reaches the estimated average partition size
  2. Sort user bins by size and assign user bins of roughly equal size to partitions until each partition reaches the estimated average partition size

Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Attention: 153 lines in your changes are missing coverage. Please review.

Comparison is base (bdcecb6) 95.74% compared to head (457b614) 79.42%.
Report is 7 commits behind head on main.

Files Patch % Lines
src/layout/execute.cpp 24.25% 153 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #230       +/-   ##
===========================================
- Coverage   95.74%   79.42%   -16.33%     
===========================================
  Files          19       19               
  Lines         729      904      +175     
===========================================
+ Hits          698      718       +20     
- Misses         31      186      +155     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@seqan-actions seqan-actions added lint and removed lint labels Dec 13, 2023
@seqan-actions seqan-actions added lint and removed lint labels Dec 13, 2023
@seqan-actions seqan-actions added lint and removed lint labels Jan 9, 2024
…improve user bin distribution across partitions.
@seqan-actions seqan-actions added lint and removed lint labels Jan 9, 2024
@seqan-actions seqan-actions added lint and removed lint labels Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants