Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing hashes from embeddings #1

Open
rom1504 opened this issue Oct 17, 2022 · 5 comments
Open

Computing hashes from embeddings #1

rom1504 opened this issue Oct 17, 2022 · 5 comments

Comments

@rom1504
Copy link

rom1504 commented Oct 17, 2022

facebookresearch/faiss#2531 (comment) some thoughts here

https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuRyzDotL_HHnKuNB9H3Zc/edit?usp=drivesdk more thoughts there

Also this LAION-AI/project-menu#28

@rom1504
Copy link
Author

rom1504 commented Oct 17, 2022

Trying to find a function f such that for an embedding e of size let's say 1024 floats i have for most e1, e2 in the space of my embeddings :
h1 = f(e1)
h2 = f(e2)
Such that if sim(e1, e2) > threshold then h1 = h2 (and if sim(e1,e2) < threshold then h1 != h2)
With h preferably encoded as a small amount of bytes

One use case of such a function f would be to perform efficient deduplication of items represented by embeddings.

I think it would be possible to directly train a neural net to be f. But I'm wondering if using the quantization techniques implemented in faiss could be also a good technique.

Maybe the encodings produced by IndexLSH could work. Maybe ones produced by PQ index could be helpful too.

@rom1504
Copy link
Author

rom1504 commented Oct 17, 2022

On the trained network path:

We could probably generate a bunch of positive and negative by using the existing faiss index.
Then use that to train f with triple loss/ a loss like clip.

f would need to quantize the embedding into a small amount of = comparable bytes
Maybe needed to use something like Gumbel softmax, not sure

@rom1504
Copy link
Author

rom1504 commented Oct 17, 2022

But it seems to me the quantization performed in faiss are quite similar to what we want here. But might not be fully optimizable towards the right task
Maybe https://github.com/lucidrains/vector-quantize-pytorch

@mehdidc
Copy link
Collaborator

mehdidc commented Oct 17, 2022

@rom1504 related to what you say, on the trained network path, we might want to look at what people do in deep hashing. e.g. DistillHash (https://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_DistillHash_Unsupervised_Deep_Hashing_by_Distilling_Data_Pairs_CVPR_2019_paper.pdf) seems to be relevant, it specifically deals with the case where we can sample positive/negative pairs following some pre-defined criterion, and it learns a hash function preserving the pairs relationship.

@rom1504
Copy link
Author

rom1504 commented Oct 24, 2022

https://www.algolia.com/blog/ai/vectors-vs-hashes/ learn binary hashes with a nn
Only a speed trick and doesn't change things compared to embeddings here but interesting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants