v3.1.1 - Patch hard negative mining & remove `numpy<2` restriction
This patch release fixes hard negatives mining for models that don't automatically normalize their embeddings and it lifts the numpy<2
restriction that was previously required.
Install this version with
# Full installation:
pip install sentence-transformers[train]==3.1.1
# Inference only:
pip install sentence-transformers==3.1.1
Hard Negatives Mining Patch (#2944)
The mine_hard_negatives
utility introduced in the previous release would fail if use_faiss=True
& the model does not automatically normalize its embeddings. This release patches that, allowing the utility to work with all Sentence Transformer models:
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1").bfloat16()
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train[:10000]")
print(dataset)
"""
Dataset({
features: ['query', 'answer'],
num_rows: 10000
})
"""
# Mine hard negatives
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
range_min=10,
range_max=50,
max_score=0.8,
margin=0.1,
num_negatives=5,
sampling_strategy="random",
batch_size=128,
use_faiss=True,
)
'''
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:21<00:00, 3.51it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:03<00:00, 25.77it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.98it/s]
Metric Positive Negative Difference
Count 10,000 47,711
Mean 0.7600 0.5376 0.2299
Median 0.7673 0.5379 0.2274
Std 0.0658 0.0387 0.0629
Min 0.3858 0.3732 0.1044
25% 0.7219 0.5129 0.1833
50% 0.7673 0.5379 0.2274
75% 0.8058 0.5617 0.2724
Max 0.9341 0.7024 0.4780
Skipped 48770 potential negatives (9.56%) due to the margin of 0.1.
Could not find enough negatives for 2289 samples (4.58%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
features: ['query', 'answer', 'negative'],
num_rows: 47711
})
'''
print(dataset[0])
'''
{
'query': 'where is the us navy base in japan located',
'answer': 'United States Fleet Activities Yokosuka The United States Fleet Activities Yokosuka (横須賀海 軍施設, Yokosuka kaigunshisetsu) or Commander Fleet Activities Yokosuka (司令官艦隊活動横須賀, Shirei-kan kantai katsudō Yokosuka) is a United States Navy base in Yokosuka, Japan. Its mission is to maintain and operate base facilities for the logistic, recreational, administrative support and service of the U.S. Naval Forces Japan, Seventh Fleet and other operating forces assigned in the Western Pacific. CFAY is the largest strategically important U.S. naval installation in the western Pacific.[1] As of August 2013[update], it was commanded by Captain David Glenister.',
'negative': "2011 Tōhoku earthquake and tsunami The earthquake took place at 14:46 JST (UTC 05:46) around 67\xa0km (42\xa0mi) from the nearest point on Japan's coastline, and initial estimates indicated the tsunami would have taken 10 to 30\xa0minutes to reach the areas first affected, and then areas farther north and south based on the geography of the coastline.[127][128] Just over an hour after the earthquake at 15:55 JST, a tsunami was observed flooding Sendai Airport, which is located near the coast of Miyagi Prefecture,[129][130] with waves sweeping away cars and planes and flooding various buildings as they traveled inland.[131][132] The impact of the tsunami in and around Sendai Airport was filmed by an NHK News helicopter, showing a number of vehicles on local roads trying to escape the approaching wave and being engulfed by it.[133] A 4-metre-high (13\xa0ft) tsunami hit Iwate Prefecture.[134] Wakabayashi Ward in Sendai was also particularly hard hit.[135] At least 101 designated tsunami evacuation sites were hit by the wave.[136]"
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")
Thanks to @omarnj-lab for pointing out the bug to me.
Numpy restriction lifted (#2937)
The v3.1.0 Sentence Transformers release required numpy<2
to prevent crashes on Windows. However, various third-parties (e.g. scipy) have now been recompiled & released, allowing the Windows tests to pass again.
If you experience the following snippet:
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.
Then consider 1) upgrading the dependency from which the error occurred or 2) downgrading numpy
to below v2:
pip install -U numpy<2
Thanks to @kozlek for pointing this out to me and helping getting it resolved.
All changes
- [
deps
] Attempt to remove numpy restrictions by @tomaarsen in #2937 - [
metadata
] Extend pyproject.toml metadata by @tomaarsen in #2943 - [
fix
] Ensure that the embeddings from hard negative mining are normalized by @tomaarsen in #2944
Full Changelog: v3.1.0...v3.1.1