-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLLB-CLIP with SigLIP + small tokenizer fix #741
Conversation
Could you share results? They would also fit well in the readme or its
referenced files
…On Sat, Nov 18, 2023, 23:48 Alexander Visheratin ***@***.***> wrote:
Hi! I trained NLLB-CLIP models with SigLIP (ViT and loss). They perform
much better than the previous version across all benchmarks.
I'm also working on integrating the multilingual benchmarks from the paper
<https://arxiv.org/abs/2309.01859> into the CLIP benchmark
<https://github.com/LAION-AI/CLIP_benchmark>. To make it work with the
NLLB tokenizer, I had to change the tokenizer method to batch_encode_plus
because the default __call__ doesn't take language-specific prefix tokens
into account.
------------------------------
You can view, comment on, or merge this pull request online at:
#741
Commit Summary
- faf7f75
<faf7f75>
Added configs.
- d657bc3
<d657bc3>
Added links to pretrained models.
- 69ddb46
<69ddb46>
Merge branch 'main' into main
- 94cbf14
<94cbf14>
Add NLLB-CLIP base/large results
- e0c8e63
<e0c8e63>
Merge branch 'main' of https://github.com/visheratin/open_clip into
main
- b41e962
<b41e962>
Merge branch 'main' of https://github.com/visheratin/open_clip into
main
- b14386c
<b14386c>
Added new version of NLLB-CLIP.
File Changes
(4 files <https://github.com/mlfoundations/open_clip/pull/741/files>)
- *A* src/open_clip/model_configs/nllb-clip-base-siglip.json
<https://github.com/mlfoundations/open_clip/pull/741/files#diff-80ed12b868285320e888ad515f39b6090eeb2c0a101364cb14ded2474f4a5e87>
(18)
- *A* src/open_clip/model_configs/nllb-clip-large-siglip.json
<https://github.com/mlfoundations/open_clip/pull/741/files#diff-9c68def1ad4f68ad3959fa464bdfe3c154086142f4acd39eb36cf34c88ae1501>
(18)
- *M* src/open_clip/pretrained.py
<https://github.com/mlfoundations/open_clip/pull/741/files#diff-321c5632a009a9b3bc66a731132fc10db2df82f78c6cdc84081e3cc9e10e013a>
(17)
- *M* src/open_clip/tokenizer.py
<https://github.com/mlfoundations/open_clip/pull/741/files#diff-d9902132a1bebeb30786ee67a47565c086ad5c9b639e75b89cbdefec6e081821>
(7)
Patch Links:
- https://github.com/mlfoundations/open_clip/pull/741.patch
- https://github.com/mlfoundations/open_clip/pull/741.diff
—
Reply to this email directly, view it on GitHub
<#741>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UJPRFGCKKJZKMFBNLYFE3KVAVCNFSM6AAAAAA7RIUSY6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDANRQHAZDMMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Here are the results for Crossmodal-3600 and XTD10 datasets. I didn't evaluate the models on English-only datasets. I think it may make sense to add a separate benchmark CSV file for multilingual models to the docs. |
I think it would be interesting to see how it compares to vit-h/14
multilingual in this repo
yes it could make sense to have a different csv for multilingual models
indeed
…On Sun, Nov 19, 2023 at 1:37 AM Alexander Visheratin < ***@***.***> wrote:
Here
<https://github.com/mlfoundations/open_clip/files/13401786/nllb-clip.csv>
are the results for Crossmodal-3600 and XTD10 datasets. I didn't evaluate
the models on English-only datasets. I think it may make sense to add a
separate benchmark CSV file for multilingual models to the docs.
—
Reply to this email directly, view it on GitHub
<#741 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UHJ66ZAKWG7NF42PLYFFIEXAVCNFSM6AAAAAA7RIUSY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXG4YDCNBXGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Can you tell me its model id and pretrained name? I have the testbed set up and running right now. Regarding other benchmarks, NLLB-CLIP base and large outperform SigLIP ViT-G (page 16) on text-to-image. |
@gabrielilharco can you share the script you use to create benchmark CSV files for the repo (like this) from CLIP benchmark outputs? |
xlm-roberta-large-ViT-H-14 |
Here is the file. Very impressive results! NLLB-CLIP large is a bit better on text-to-image. I'm wondering why there is such a discrepancy between t2i and i2t results for my models. Maybe undertrained text encoder. |
thanks, interesting indeed! The PR LGTM I think adding a mention on this new model in this section https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#nllb would be good, so people have a change to discover the model (by looking at openclip doc) |
this is how I trained this xlm-roberta-large-ViT-H-14 model https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#vit-h14-xlm-roberta-large looking at your paper https://arxiv.org/pdf/2309.01859.pdf the freezing method seems a bit similar; I froze the image encoder but not the text encoder I see you evaluate only on retrieval. I had evaluated as well on imagenet with translated class names, and found the model to perform better than previous ones but in absolute numbers really poorly. (for example 56% in italian while the model gets 78% for the same in english). I am not sure what is the cause but that may be of interest to you (just FYI) |
Thanks! I added a bit more info on NLLB-CLIP to the doc. I'll add more info about evals when I figure out how to make the eval CSV file readable - it has too many dimensions (language, i2t/t2i, recall@k). |
Regarding tasks, my original interest when starting the project was in multilingual retrieval. Because of that, I evaluated the model only on this task. I'll work on compiling something like ImageNet-200 when I have time. |
@visheratin looks from my end too. My scripts for running evals on the 38 datasets still need some cleaning up, I plan to do that in the future and push them so everyone can run it easily. Meanwhile, I'm running evals for the 2 new models, and will update here with the results once it's done |
@gabrielilharco thank you! In the meantime, I will compile a CSV for multilingual retrieval for NLLB-CLIP and XLM-RoBERTa. |
@visheratin I added eval results and profiling numbers for the new models |
Thanks! The models are still far from the top of the dashboard but they are 10% better than the first version =) @gabrielilharco I just added a CSV with the benchmark results for NLLB-CLIP and XLM RoBERTa. Can you please take a look? |
Thanks @visheratin! Can you make the numbers in the numbers in the new csv have less significant digits? E.g. 0.8569999933242798 becomes 0.8570. I think it's a bit easier to read this way. It would be nice to add the other nllb models to that table as well. Ideally all models actually, would this be too expensive for you to run? If so, I can try running on my end if you share the scripts. |
@gabrielilharco I updated the CSV file with the fixed numbers. Regarding benchmarking all models, I've reached my quota on GCP, where the test bench is deployed, so I'll be able to run full tests only in December, when the quota resets. To run the tests, you'd need the CLIP benchmark version from that PR, which is dependent on this PR. I propose to wait with the multilingual benchmark CSV until we have the results for all models. I'll remove the CSV from this PR and will create a separate PR when I have all the results. What do you think about it? |
Sounds good to me. Thanks @visheratin! |
Hi, @visheratin , thanks for your great work, Is there any plan to add NLLB-CLIP models(with SigLIP) to timm |
As far as I remember, timm is a pure CV library. The image encoder used in NLLB-CLIP with SigLIP exists in timm, if you want to use it. The best way to use NLLB-CLIP models is via OpenCLIP (when the next version will be released). |
@visheratin yea, thanks, looking forward to the next verison of open_clip. |
You can make a PR like this if you want a release
#679
That creates a pipy release on merge
…On Fri, Nov 24, 2023, 10:50 WILL LEE ***@***.***> wrote:
@visheratin <https://github.com/visheratin> yea, thanks, looking forward
to the next verison of open_clip.
—
Reply to this email directly, view it on GitHub
<#741 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437SCR4K5TTJJTTS3IX3YGBUVRAVCNFSM6AAAAAA7RIUSY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRVGQYDOMJWGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@rom1504 good to know, thank you! I'll wait until I benchmark all models on multilingual retrieval datasets and then create a release PR. |
* Added configs. * Added links to pretrained models. * Add NLLB-CLIP base/large results * Added new version of NLLB-CLIP. * Added more info on NLLB-CLIP. * add eval results and profiling * Added file with benchmarks. * Fixed CSV file. * Updated CSV file. --------- Co-authored-by: Gabriel Ilharco Magalhães <[email protected]> Co-authored-by: Gabriel Ilharco <[email protected]>
Hi! I trained NLLB-CLIP models with SigLIP (ViT and loss). They perform much better than the previous version across all benchmarks.
I'm also working on integrating the multilingual benchmarks from the paper into the CLIP benchmark. To make it work with the NLLB tokenizer, I had to change the tokenizer method to
batch_encode_plus
because the default__call__
doesn't take language-specific prefix tokens into account.