Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FaithfulnesswithHHEM doesn't match original prompt, leading to inconsistent scores #1648

Open
AshishSardana opened this issue Nov 9, 2024 · 0 comments
Labels
bug Something isn't working module-metrics this is part of metrics module

Comments

@AshishSardana
Copy link

AshishSardana commented Nov 9, 2024

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
FaithfulnesswithHHEM class for the faithfulness metric (using Vectara's model) doesn't lead to the same scores when compared with running the original implementation (on HF).

I've figured 2 reasons for this mismatch:

  1. HHEM doesn't expect the response/answer to be simplified (as is being done here)
  2. HHEM expects the input prompt to conform to this template which is missing.
  3. RAGAS also expects the response/answer to be sentences (i.e. ending with ".", etc, here), which is not relevant to this particular issue but some datasets (popular Halubench) doesn't fulfill this requirement, which leads to answer --> "", thus score becomes 0.

Ragas version: 0.2.4 (latest)
Python version: 3.10

Code to Reproduce

# with RAGAS
from ragas.metrics import FaithfulnesswithHHEM
from ragas import evaluate

data = {
        'user_input': ["president of united states"],
        'response': ["donald trump"].tolist(),
        'retrieved_contexts': [["current president of united states -- joe biden, president for the next term -- donald trump"]]
    }
ragas_dataset = Dataset.from_dict(data)
default_ragas_scores = evaluate(
    ragas_dataset,
    metrics=[FaithfulnesswithHHEM]
)
print(default_ragas_scores['faithfulness_with_hhem'])

# with HF
from transformers import AutoModelForSequenceClassification
hhem = AutoModelForSequenceClassification.from_pretrained('vectara/hallucination_evaluation_model', trust_remote_code=True)

test_data = list(zip((data['retrieved_contexts'] + data['user_input']), data['response']))
hhem_predictions = hhem.predict(test_data).tolist()

print(hhem_predictions[0])

Error trace
No error

Expected behavior
I expect both the scores to be the same i.e. RAGAS implementation of HHEM resulting in same score as HHEM's original implementation on HF.

@AshishSardana AshishSardana added the bug Something isn't working label Nov 9, 2024
@dosubot dosubot bot added the module-metrics this is part of metrics module label Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

1 participant