You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to benchmark llama-2-7b on the GLUE benchmark for in-context learning. But the accuracy I get for MNLI (mismatched validation) is 35.22 for both zero-shot and 8-shot. My questions are:
During you benchmarking, did you run the models for any classification tasks? Any experimental results you share would be great.
Currently, I am using the format prescribed by InContextLearningMultipleChoiceTaskDataset? Is there another recommend way to implement this?
PS: Also ran evaluation for the qqp task: 36.82% for 0-shot and 63.09 for 8-shot.
Any help would be greatly appreciated.
Thank you,
Additional context
The text was updated successfully, but these errors were encountered:
max_seq_len: 4096
seed: 28
model_name_or_path: ~/huggingface_cache/Llama-2-7b-hf
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
models:
-
model_name: ${model_name_or_path}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: mixed
pretrained: true
token: <HF Token>
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 4
precision: fp32
# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
icl_tasks:
-
label: mnli_mismatched
dataset_uri: scripts/eval/local_data/mnli_mismatched.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [8]
icl_task_type: multiple_choice
metric_names:
InContextLearningMultipleChoiceAccuracy
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: '' # this separates questions from answers
And here is a sample example from the jsonl file:
{
"premise": "Your contribution helped make it possible for us to provide our students with a quality education.",
"hypothesis": "Your contributions were of no help with our students' education.",
"label": 2,
"idx": 0,
"query": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:",
"choices": [
"entailment",
"neutral",
"contradiction"
],
"gold": 2,
"context": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:\n"
}
❓ Question
I am trying to benchmark
llama-2-7b
on the GLUE benchmark for in-context learning. But the accuracy I get for MNLI (mismatched validation
) is 35.22 for both zero-shot and 8-shot. My questions are:InContextLearningMultipleChoiceTaskDataset
? Is there another recommend way to implement this?PS: Also ran evaluation for the
qqp
task: 36.82% for 0-shot and 63.09 for 8-shot.Any help would be greatly appreciated.
Thank you,
Additional context
The text was updated successfully, but these errors were encountered: