Skip to content

WSDM'22 Best Paper: Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

License

Notifications You must be signed in to change notification settings

hitxujian/RepCONC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepCONC

This is the official repo for our WSDM'22 paper, Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval (Best Paper Award).

Quick Links

Quick Tour

In this work, we propose RepCONC, which models quantization process as CONstrained Clustering and end-to-end trains the dual-encoders and the quantization method. Constrained clustering involves a clustering loss and a uniform clustering constraint. The clustering loss requires the embeddings to be around the quantization centroids to support end-to-end optimization, and the constraint forces the embeddings to be uniformly clustered to all centroids to maximize distinguishability. The training process and the clustering constraint are visualized as follows:

Training process Constrained Clustering

RepCONC achieves huge compression ratios ranging from 64x to 768x. It supports fast embedding search thanks to the adoption of IVF (inverted file system). With these designs, it outperforms a wide range of first-stage retrieval methods in terms of effectiveness, memory efficiency, and time efficiency. RepCONC also substantially boosts the second-stage ranking performance, as shown below:

Installation

Install RepCONC from our code:

git clone https://github.com/jingtaozhan/RepCONC
cd RepCONC
pip install . --use-feature=in-tree-build # built in-place without first copying to a temporary directory.

Besides, two special dependencies should be installed manually: RepCONC depends on PyTorch and Faiss, which require platform-specific custom configuration. They are not listed in the requirements and the installation is left to you.

How to use

RepCONC is an easy-to-use toolbox for compressing the index of any dense retrieval models. It jointly optimizes the dense encoders and index so that high retrieval effectiveness is obtained even with a very compact index. The code separates the design of dense retrieval models and the joint optimization process, so it supports any dense retrieval model no matter whether it is built-in!

Here are several examples about how to use RepCONC to compress index for different dense retrieval models. These examples are helpful if you want to use RepCONC for your dense retrieval models. Since RepCONC has several built-in dense retrieval models, it can be directly used to compress the index of many dense models without any code work. For example:

Even if some dense retrieval models are not built-in, it is also very easy to apply RepCONC on them. Just make the api of model class and tokenizer consistent with the built-in ones and you are good to go. For example, ANCE and TCT-ColBERT-v2 have customized model definitions and tokenization. Here is how RepCONC compresses their indexes.

Citation

If you find this repo useful, please consider citing our work:

@inproceedings{zhan2022learning,
author = {Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
title = {Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval},
year = {2022},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3488560.3498443},
doi = {10.1145/3488560.3498443},
booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
pages = {1328–1336},
numpages = {9},
location = {Virtual Event, AZ, USA},
series = {WSDM '22}
}

About

WSDM'22 Best Paper: Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%