-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on Human-Eval #6
Comments
Hi, thanks for your interest. I'm working on cleaning up our human-eval code a bit -- will check in soon. But in the meantime:
This should be fine - I used fp16 in verification experiments.
The HuggingFace tokenizer should prepend this automatically when encoding text. You can verify this by calling tokenizer.encode(doc) and checking to see that the first ID in the sequence is 2.
This could make a difference -- our pass@1 scores reported in the paper used temp=0.2, while pass@10 and pass@100 used temp=0.8 (following what Chen et al. did in the codex paper).
Our stop tokens are
i.e. I should note too that all the experiments in the paper used a fairseq implementation of our model, but I checked that the scores are similar with the HF-converted version of InCoder 6B on HumanEval left-to-right generation (I haven't gotten the infilling with HF integrated into our eval harness yet): 15 pass@1 for fairseq and HF versions of the 6B model. Code for this coming soon! |
I've checked the code for HumanEval in now to https://github.com/dpfried/incoder/tree/main/evaluation . Please let me know if you try it and run into issues or are still unable to replicate! |
Thanks for the quick response! I was referring to the pass@100 metric, in my experiments. It can be broken down to:
Have you seen behavior like this in your experiments? How did you handle random seeds in experiments, since you run x10 times a 20-generation for a total of 200 generations per problem? |
Thanks for the info! I'm running replication experiments now with temperature 0.8 to get the pass@100 scores for the HF version of the model. How big of a gap between our reported pass@100 scores and yours are you seeing? We didn't set the random seed in our experiments, so every sampled generation (of the 200) should be generated independently. There may be some differences between the inference procedure I used (in the code I checked in) and the text generation pipeline. The one that I'm aware of is that the generation pipeline doesn't prepend BOS (#3 (comment)), but it sounds like you're accounting for that already. |
In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% pass@100. I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time! |
Thanks, yeah that does seem plausible - you may be getting only 10 distinct
candidates.
I'll report back once I have results!
…On Thu, Jun 16, 2022, 07:35 Spyros Mouselinos ***@***.***> wrote:
In fixed seed format (seed = 1 for both torch / random libs), and using
num_generations=10 (repeated 20 times in a loop) i seem to get around 35%
***@***.*** I think the "fixing the seed" part might be limiting to the
expressiveness. Let me know if you find out any discrepancies in the HF
version, and thanks for taking the time!
—
Reply to this email directly, view it on GitHub
<#6 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHG2HDGB6EJP7GHH4BFM4LVPM3UJANCNFSM5YT52HNA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello, I am trying to reproduce the results of your model on the Human-Eval Dataset and so far I am getting a lower-than-expected performance.
To make everything more clear:
Is there a different procedure/way that the Incoder model solves the Human-Eval dataset? Are the results published assuming the full 32bit weights or a different input format?
The text was updated successfully, but these errors were encountered: