Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse image search 2.0 #401

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Reverse image search 2.0 #401

wants to merge 7 commits into from

Conversation

liamwhite
Copy link
Contributor

This PR replaces the old "image intensities" reverse image search, and has come about due to the confluence of several key factors within the past year:

  • Computer vision utilities like those in PyTorch have become more accessible than ever, with native language bindings like tch-rs removing the need for a Python server
  • The self-distillation vision transformers DINOv2 and DINOv2 with registers have been released, which come with pretrained weights to extract semantic features from images without the need for a finetuned head. The authors claim that these systems can extract robust features on any type of downstream task as-is. I believe they are underselling how good it is, and found the recall to be excellent during model selection.
  • The OpenSearch project has released the k-NN plugin, which enables nearest neighbor search over dense vectors, like the kind representing the CLS token of a ViT.

Together, these factors are used to implement a reverse image search system that uses semantic meaning in the images to identify them, rather than their overall appearance. To establish what is meant by this, here are some examples of an original image and matches found when executing on Derpibooru:

Demo Result
Line art 2025-01-12 18-41-24
Hamburger 2025-01-12 18-41-48
Trixie 2025-01-12 18-42-25
Scenery 2025-01-12 18-44-24

The fact that DINOv2 has semantic extraction can be determined through generated attention maps for these images. The code to generate these attention maps can be found in this repository. These have been reprocessed at a higher scale for visibility:

Scaled original Attention map
442297 442297_attention
1110529 1110529_attention
1188964 1188964_attention
3515313 3515313_attention

The system works as follows:

  1. Image/video is previewed into a raw RGB bitmap
  2. Bitmap is resampled to model target dimensions
  3. Classification vector is retrieved from model
  4. Classification vector is normalized to convert the k-NN search into one ordered by cosine similarity, and delivered back to the application
  5. For indexing, the normalized vector is stored as a nested field into into image search index; for search, the nearest neighbors are retrieved using a HNSW index

Indexing the classification vector using a nested field allows for the possibility of extracting multiple vectors from each image, and the database table has been set up to allow this should it be desired in the future.

I have pre-computed the DINOv2 with registers features for ~3.5M images on Derpibooru, ~400K images on Furbooru, and ~35K images on Tantabus. Batch inference was run on a 3060 Ti using code from this repository, with the entire process heavily bottlenecked by memory copy bandwidth and image decode performance rather than the GPU execution itself. However, the inference code is efficient enough to run on a CPU in less than 0.5 seconds per image, and this is what is implemented in the repository (with the expectation that there will be no GPU requirement on the server).

This PR must not be merged until OpenSearch releases version 2.19, as 2.18 contains a critical bug that prevents the system from working in all cases. Other bugs
relating to filtering may or may not also be fixed in the 2.19 release, but have been worked around for now.

This PR must also not be merged until its dependents #389 and #400 are merged.

Fixes #331 (method outdated)

@liamwhite liamwhite marked this pull request as draft January 13, 2025 00:42
@Meow
Copy link
Member

Meow commented Jan 13, 2025

If this is meant to be a replacement to the deduper, I require that this is tested extensively to ensure there are no false-negatives. I'd rather have 10 dupe reports, 1 of which is correct, than less reports but duplicate images happily live on site.

@VcSaJen
Copy link
Contributor

VcSaJen commented Jan 16, 2025

I can see that it can successfully detect mirrored images and variations of the same image. Can it detect slightly cropped images, images with different brightness levels, etc? How does it perform on very thin and tall (webtoon-manhwa-like comics) images?

@liamwhite
Copy link
Contributor Author

liamwhite commented Jan 16, 2025

@VcSaJen

slightly cropped images

Yes, it can find images which are quite a bit more than slightly cropped. In this case the overall scores are much lower (around 0.7 cosine similarity, which is not indicative of a duplicate, vs >0.9 for actual duplicates) and the original may not score the highest in the result list.

images with different brightness levels

Absolutely no problem handling this. Example with 0.98 cosine similarty

Original Brightened
3518144_orig 3518144

very thin and tall (webtoon-manhwa-like comics) images

The performance is reasonably good, although the features are not terribly stable, if the exact same comic image is also reverse searched. It doesn't find panels or crops well, though I added enough flexibility that this could become possible in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reverse search improvement: store non-transparent intensities of transparent images
3 participants