Running on Human-Eval #6

SpyrosMouselinos · 2022-06-13T11:44:03Z

Hello, I am trying to reproduce the results of your model on the Human-Eval Dataset and so far I am getting a lower-than-expected performance.
To make everything more clear:

I load the large 6B model from hugging face on its fp16 version.
I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.
I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.
The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Is there a different procedure/way that the Incoder model solves the Human-Eval dataset? Are the results published assuming the full 32bit weights or a different input format?

dpfried · 2022-06-13T23:55:00Z

Hi, thanks for your interest. I'm working on cleaning up our human-eval code a bit -- will check in soon. But in the meantime:

I load the large 6B model from hugging face on its fp16 version.

This should be fine - I used fp16 in verification experiments.

I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.

The HuggingFace tokenizer should prepend this automatically when encoding text. You can verify this by calling tokenizer.encode(doc) and checking to see that the first ID in the sequence is 2.

I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.

This could make a difference -- our pass@1 scores reported in the paper used temp=0.2, while pass@10 and pass@100 used temp=0.8 (following what Chen et al. did in the codex paper).

The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Our stop tokens are

HUMAN_EVAL_STOP_WORDS = ["\nclass", "\ndef", "\n#", "\nif"]

i.e. class, def, comment, or if at the beginning of the line (we would have used print as well, following Chen et al., but the codex API can only handle 4 stop words so we did the same across all experiments for compatibility)

I should note too that all the experiments in the paper used a fairseq implementation of our model, but I checked that the scores are similar with the HF-converted version of InCoder 6B on HumanEval left-to-right generation (I haven't gotten the infilling with HF integrated into our eval harness yet): 15 pass@1 for fairseq and HF versions of the 6B model. Code for this coming soon!

dpfried · 2022-06-14T03:13:34Z

I've checked the code for HumanEval in now to https://github.com/dpfried/incoder/tree/main/evaluation . Please let me know if you try it and run into issues or are still unable to replicate!

SpyrosMouselinos · 2022-06-16T13:03:38Z

Thanks for the quick response! I was referring to the pass@100 metric, in my experiments.
I use the text generation pipeline from the transformers library which has a slightly different behavior from the typical
encode --> model.generate() --> decode procedure you use.

It can be broken down to:

The inputs to the Incoder model are not simply starting with the BOS Token but also with the prefix "<| file ext=.py|>" before the actual code.
The text-generation pipeline seems to be using the same randomness seed in the generation process having an in-batch diverse but lower cross-batch variety.

Have you seen behavior like this in your experiments? How did you handle random seeds in experiments, since you run x10 times a 20-generation for a total of 200 generations per problem?

dpfried · 2022-06-16T14:21:56Z

Thanks for the info! I'm running replication experiments now with temperature 0.8 to get the pass@100 scores for the HF version of the model. How big of a gap between our reported pass@100 scores and yours are you seeing?

We didn't set the random seed in our experiments, so every sampled generation (of the 200) should be generated independently.

There may be some differences between the inference procedure I used (in the code I checked in) and the text generation pipeline. The one that I'm aware of is that the generation pipeline doesn't prepend BOS (#3 (comment)), but it sounds like you're accounting for that already.

SpyrosMouselinos · 2022-06-16T14:35:37Z

In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% pass@100. I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time!

dpfried · 2022-06-16T14:51:56Z

Thanks, yeah that does seem plausible - you may be getting only 10 distinct candidates. I'll report back once I have results!

…

On Thu, Jun 16, 2022, 07:35 Spyros Mouselinos ***@***.***> wrote: In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% ***@***.*** I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time! — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHG2HDGB6EJP7GHH4BFM4LVPM3UJANCNFSM5YT52HNA> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on Human-Eval #6

Running on Human-Eval #6

SpyrosMouselinos commented Jun 13, 2022

dpfried commented Jun 13, 2022 •

edited

Loading

dpfried commented Jun 14, 2022

SpyrosMouselinos commented Jun 16, 2022

dpfried commented Jun 16, 2022

SpyrosMouselinos commented Jun 16, 2022

dpfried commented Jun 16, 2022 via email

Running on Human-Eval #6

Running on Human-Eval #6

Comments

SpyrosMouselinos commented Jun 13, 2022

dpfried commented Jun 13, 2022 • edited Loading

dpfried commented Jun 14, 2022

SpyrosMouselinos commented Jun 16, 2022

dpfried commented Jun 16, 2022

SpyrosMouselinos commented Jun 16, 2022

dpfried commented Jun 16, 2022 via email

dpfried commented Jun 13, 2022 •

edited

Loading