Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the problem of low recall rate of glass(hnsw) in multi-core environment #11

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cwj0bzxg
Copy link

This PR is to fix the issue mentioned in #10

In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.

There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp

@Wainberg
Copy link

Wainberg commented Jul 2, 2024

Thanks for fixing this! Unfortunately, it looks like this version is non-deterministic due to some kind of a concurrency issue in Build(). The following example code is deterministic, but if you take out the with threadpoolctl line, it becomes non-deterministic:

import glassppy as glass
import numpy as np
import threadpoolctl

n, d = 10000, 128
np.random.seed(0)
X = np.random.randn(n, d)
Y = np.random.randn(d)

index = glass.Index(index_type="HNSW", dim=d, metric="L2", R=32, L=50)
with threadpoolctl.threadpool_limits(limits=1, user_api='openmp'):
    graph = index.build(X)

searcher = glass.Searcher(graph=graph, data=X, metric="L2", level=1)
searcher.set_ef(32)
ret = searcher.search(query=Y, k=10)
print(ret)

@handsomeZhuang
Copy link

This PR is to fix the issue mentioned in #10

In the Graph structure of glass (hnsw), the neighbor's internal ID is stored in the neighbor list, and the internal ID is used to find the corresponding neighbor list. I noticed that you can store the neighbor's external ID in the neighbor list and use the external ID to access the corresponding neighbor list. These two methods are equivalent, but the latter method can avoid the problem of low recall rate caused by inconsistency between internal ID and external ID.

There is an example in the following branch: https://github.com/cwj0bzxg/pyglass/blob/fix_bug_deep10M/main.cpp
请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

@cwj0bzxg
Copy link
Author

cwj0bzxg commented Sep 4, 2024

请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.

@handsomeZhuang
Copy link

请问一下,你这里面的召回率精度有提升吗?我用main.cpp的例子跑出来的recall=81%左右?

The recall is improved compared to the original. I just tested it. For the deep10m dataset, when ef=500, recall=97.8973%, and when ef=1000, recall=99.2578%. The deep10m dataset was downloaded through create_dataset.py in https://github.com/harsha-simhadri/big-ann-benchmarks.

我这边测试数据集是SIFT1M 128的维度,ef=1000,精度约91%,麻烦可以发一下你那边build函数的其他参数吗?比如:index = std::make_uniqueglass::HNSW(dim, "L2", X,Y)中的X和Y,以及Optimize(Z)的参数Z吗?

@cwj0bzxg cwj0bzxg closed this Oct 8, 2024
@cwj0bzxg cwj0bzxg reopened this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants