Skip to content

Commit

Permalink
scale clip back to 5B items (#122)
Browse files Browse the repository at this point in the history
* scale clip back to 5B items

* new arrow provider in clip back
* index combiner script
* parquet to arrow script

* use arrow properly and add dedup feature
  • Loading branch information
rom1504 authored Mar 12, 2022
1 parent 8bdd3cd commit a206f79
Show file tree
Hide file tree
Showing 9 changed files with 363 additions and 84 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,8 @@ That decreases the memory usage to zero.
A `--enable_hdf5 True` option can be passed to enable hdf5 caching for the metadata.
HDF5 caching makes it possible to use the metadata with almost no memory usage.

`--use_arrow True` allows using arrow instead of hdf5. Should be used along with [clip_back_prepro](clip_back_prepro) for very large datasets (billions)

`--reorder_metadata_by_ivf_index True` option takes advantage of the data locality property of results of a knn ivf indices: it orders the metadata collection in order of the IVF clusters. That makes it possible to have much faster metadata retrieval as the reads are then accessing a few mostly sequential parts of the metadata instead of many non sequential parts. In practice that means being able to retrieve 1M items in 1s whereas only 1000 items can be retrieved in 1s without this method. This will order the metadata using the first image index.

hdf5 caching is a good idea to use if:
Expand Down
4 changes: 4 additions & 0 deletions clip_retrieval/cli.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
"""cli entry point"""

from clip_retrieval.clip_back_prepro.parquet_to_arrow import parquet_to_arrow
from clip_retrieval.clip_back import clip_back
from clip_retrieval.clip_inference import clip_inference
from clip_retrieval.clip_filter import clip_filter
from clip_retrieval.clip_index import clip_index
from clip_retrieval.clip_end2end import clip_end2end
from clip_retrieval.clip_front import clip_front
from clip_retrieval.clip_back_prepro.index_combiner import index_combiner
import fire


Expand All @@ -19,6 +21,8 @@ def main():
"filter": clip_filter,
"end2end": clip_end2end,
"front": clip_front,
"index_combiner": index_combiner,
"parquet_to_arrow": parquet_to_arrow,
}
)

Expand Down
Loading

0 comments on commit a206f79

Please sign in to comment.