Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation set to EvalAI #30

Open
dchichkov opened this issue Aug 14, 2024 · 3 comments
Open

Add validation set to EvalAI #30

dchichkov opened this issue Aug 14, 2024 · 3 comments

Comments

@dchichkov
Copy link

Would it be possible to add MMMU validation to EvalAI?

It'd be great to be able to compare the numbers calculated on the validation set with the ones produced by EvalAI.

@xiangyue9607
Copy link
Contributor

Thank you! That is a good suggestion. We will consider it! Will update here later!

@dchichkov
Copy link
Author

Thanks! The issue is, we see a consistent gap between validation and test set results. While models did not use the validation set to optimize. Multiple teams resorted to reporting validation rather than test results in their papers. I'm guessing this could be because they don't trust the test result (which they can't repro/validate). It'd be good to triage and rectify that, at least by having the validation part reproducible with the EvalAI measurement.

MMMU is a great benchmark. It measures overall LLM/VLM performance. But these issues with test/validation discrepancies (and misunderstanding that it's not just the visual part that matters) give it some bad light.

I'd also suggest considering releasing the test set, maybe under a separate NC license and token password protection, to avoid accidental contamination. The benefits of the test set being used and potential cleanup/resolving this test/validation gap can outweigh the benefits of using the test set in a more controlled environment.

@xiangyue9607
Copy link
Contributor

Thank you for your feedback. The discrepancy between the validation and test sets arises from the slight differences in their distributions. In the validation set, each subject has an equal number of samples, whereas in the test set, the number of samples per subject varies.

We are also considering releasing a portion of the test set while retaining a small part to prevent contamination or overfitting. We appreciate your valuable comments and encourage you to stay tuned for further updates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants