Skip to content

Commit

Permalink
Product Quantization for Nearest Neighbor Search
Browse files Browse the repository at this point in the history
Signed-off-by: Zhao Junwang <[email protected]>
  • Loading branch information
zhjwpku committed Mar 6, 2024
1 parent d0695bb commit 3758fdc
Show file tree
Hide file tree
Showing 9 changed files with 55 additions and 0 deletions.
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
- [greenplum](./databases/htap/greenplum-htap.md)
- [vector db](./databases/vectordb/README.md)
- [hnsw](./databases/vectordb/hnsw.md)
- [product quantization](./databases/vectordb/pq.md)
- [citus](./databases/citus.md)
- [optimizer](./databases/optimizer/README.md)
- [executor](./databases/executor/README.md)
Expand Down
Binary file added src/assets/images/pq_distance_computation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/pq_equation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/pq_ivfadc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/pq_subvector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
2 changes: 2 additions & 0 deletions src/databases/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- **[Greenplum: A Hybrid Database for Transactional and Analytical Workloads][greenplum]**
- **[Vector DB](vectordb/index.html)**
- **[Hierarchical NSW][hnsw]**
- **[Product Quantization][pq]**
- **[Citus: Distributed PostgreSQL for Data-Intensive Applications][citus]**
- **[Optimizer](optimizer/index.html)**
- **[Executor](executor/index.html)**
Expand Down Expand Up @@ -47,3 +48,4 @@
[volcano]: executor/volcano.md
[citus]: citus.md
[hnsw]: vectordb/hnsw.md
[pq]: vectordb/pq.md
2 changes: 2 additions & 0 deletions src/databases/vectordb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

### Quantization

- [Product Quantization for Nearest Neighbor Search][pq], 2010
- [A Survey of Quantization Methods for Efficient Neural Network Inference](/assets/pdfs/A_Survey_of_Quantization_Methods_for_Efficient_Neural_Network_Inference.pdf), 2021

### Misc
Expand All @@ -22,3 +23,4 @@


[hnsw]: hnsw.md
[pq]: pq.md
50 changes: 50 additions & 0 deletions src/databases/vectordb/pq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
### [Product Quantization for Nearest Neighbor Search](/assets/pdfs/Product_Quantization_for_Nearest_Neighbor_Search.pdf)

> https://ieeexplore.ieee.org/document/5432202
Product Quantization 是一种用于高维向量压缩和近似相似度搜索的技术。它的思想是将高维向量分割成多个较低维的子向量,并对每个子向量进行独立的量化。这种分割和量化的过程可以减少 Memory footprint 和计算成本,并且在一定程度上保持原始向量的相似性。

![]()

<p align="center">
<img src="/assets/images/pq_subvector.png" alt="pq subvector" width="600"/>
</p>

PQ 的两个参数:

- m: 子向量的个数,需要保证 D % m = 0
- k*: 对每个 D* 维的子向量进行聚类后 centroid 的个数(2 的幂次方)

> The input vector x is split into m distinct subvectors `uj , 1 ≤ j ≤ m` of dimension **D\* = D/m**, where D is a multiple of m. The subvectors are quantized separately using **m distinct quantizers**.
<p align="center">
<img src="/assets/images/pq_equation.png" alt="pq equations" width="600"/>
</p>

PQ 的本质是将原始高维空间分解为多个低维子空间,并对这些子空间进行独立的量化。每个子空间的量化结果可以看作是一个码本(codebook),由一组离散的代表性向量构成。**PQ 可以看作是对这些子空间的笛卡尔积进行编码**

PQ 计算举例的两种策略:

- Symmetric distance computation (SDC): both the vectors x and y are represented by their respective centroids q(x) and q(y).
- Asymmetric distance computation (ADC): the database vector y is represented by q(y), but the query x is not encoded.

<p align="center">
<img src="/assets/images/pq_distance_computation.png" alt="pq distance computation" width="600"/>
</p>

SDC 需要对查询向量进行 quantization 之后去查询,距离计算转换为查表,ADC 不需要量化但需要计算距离。

> The only advantage of SDC over ADC is to limit the memory usage associated with the queries, as the query vector
> is defined by a code. This is most cases not relevant and one should then use the asymmetric version, which obtains a lower
> distance distortion for a similar complexity.
PQ 是一种压缩算法,如果需要更快的检索速度,需要搭配一些其它算法来实现,如 IVFPQ。

<p align="center">
<img src="/assets/images/pq_ivfadc.png" alt="IVFADC" width="600"/>
</p>

#### References:

- [Product Quantization: Compressing high-dimensional vectors by 97%](https://www.pinecone.io/learn/series/faiss/product-quantization/)

0 comments on commit 3758fdc

Please sign in to comment.