`FaithfulnesswithHHEM` doesn't match original prompt, leading to inconsistent scores #1648

AshishSardana · 2024-11-09T00:25:12Z

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
FaithfulnesswithHHEM class for the faithfulness metric (using Vectara's model) doesn't lead to the same scores when compared with running the original implementation (on HF).

I've figured 2 reasons for this mismatch:

HHEM doesn't expect the response/answer to be simplified (as is being done here)
HHEM expects the input prompt to conform to this template which is missing.
RAGAS also expects the response/answer to be sentences (i.e. ending with ".", etc, here), which is not relevant to this particular issue but some datasets (popular Halubench) doesn't fulfill this requirement, which leads to answer --> "", thus score becomes 0.

Ragas version: 0.2.4 (latest)
Python version: 3.10

Code to Reproduce

# with RAGAS
from ragas.metrics import FaithfulnesswithHHEM
from ragas import evaluate

data = {
        'user_input': ["president of united states"],
        'response': ["donald trump"].tolist(),
        'retrieved_contexts': [["current president of united states -- joe biden, president for the next term -- donald trump"]]
    }
ragas_dataset = Dataset.from_dict(data)
default_ragas_scores = evaluate(
    ragas_dataset,
    metrics=[FaithfulnesswithHHEM]
)
print(default_ragas_scores['faithfulness_with_hhem'])

# with HF
from transformers import AutoModelForSequenceClassification
hhem = AutoModelForSequenceClassification.from_pretrained('vectara/hallucination_evaluation_model', trust_remote_code=True)

test_data = list(zip((data['retrieved_contexts'] + data['user_input']), data['response']))
hhem_predictions = hhem.predict(test_data).tolist()

print(hhem_predictions[0])

Error trace
No error

Expected behavior
I expect both the scores to be the same i.e. RAGAS implementation of HHEM resulting in same score as HHEM's original implementation on HF.

The text was updated successfully, but these errors were encountered:

AshishSardana added the bug Something isn't working label Nov 9, 2024

dosubot bot added the module-metrics this is part of metrics module label Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FaithfulnesswithHHEM` doesn't match original prompt, leading to inconsistent scores #1648

`FaithfulnesswithHHEM` doesn't match original prompt, leading to inconsistent scores #1648

AshishSardana commented Nov 9, 2024 •

edited

Loading

FaithfulnesswithHHEM doesn't match original prompt, leading to inconsistent scores #1648

FaithfulnesswithHHEM doesn't match original prompt, leading to inconsistent scores #1648

Comments

AshishSardana commented Nov 9, 2024 • edited Loading

`FaithfulnesswithHHEM` doesn't match original prompt, leading to inconsistent scores #1648

`FaithfulnesswithHHEM` doesn't match original prompt, leading to inconsistent scores #1648

AshishSardana commented Nov 9, 2024 •

edited

Loading