Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on Human-Eval #6

Open
SpyrosMouselinos opened this issue Jun 13, 2022 · 6 comments
Open

Running on Human-Eval #6

SpyrosMouselinos opened this issue Jun 13, 2022 · 6 comments

Comments

@SpyrosMouselinos
Copy link

Hello, I am trying to reproduce the results of your model on the Human-Eval Dataset and so far I am getting a lower-than-expected performance.
To make everything more clear:

  • I load the large 6B model from hugging face on its fp16 version.
  • I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.
  • I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.
  • The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Is there a different procedure/way that the Incoder model solves the Human-Eval dataset? Are the results published assuming the full 32bit weights or a different input format?

@dpfried
Copy link
Owner

dpfried commented Jun 13, 2022

Hi, thanks for your interest. I'm working on cleaning up our human-eval code a bit -- will check in soon. But in the meantime:

I load the large 6B model from hugging face on its fp16 version.

This should be fine - I used fp16 in verification experiments.

I load the Human-Eval dataset and add a BOS = "<|endoftext|>" at the beginning of each code example.

The HuggingFace tokenizer should prepend this automatically when encoding text. You can verify this by calling tokenizer.encode(doc) and checking to see that the first ID in the sequence is 2.

I use the text-generation pipeline at p=0.95 and temp=0.8, creating 100 completions.

This could make a difference -- our pass@1 scores reported in the paper used temp=0.2, while pass@10 and pass@100 used temp=0.8 (following what Chen et al. did in the codex paper).

The post-processing code is relatively simple, I am just looking for typical \def or \n\n tokens to stop generation and have a clean piece of code to pass to the evaluation of human-eval.

Our stop tokens are

HUMAN_EVAL_STOP_WORDS = ["\nclass", "\ndef", "\n#", "\nif"]

i.e. class, def, comment, or if at the beginning of the line (we would have used print as well, following Chen et al., but the codex API can only handle 4 stop words so we did the same across all experiments for compatibility)

I should note too that all the experiments in the paper used a fairseq implementation of our model, but I checked that the scores are similar with the HF-converted version of InCoder 6B on HumanEval left-to-right generation (I haven't gotten the infilling with HF integrated into our eval harness yet): 15 pass@1 for fairseq and HF versions of the 6B model. Code for this coming soon!

@dpfried
Copy link
Owner

dpfried commented Jun 14, 2022

I've checked the code for HumanEval in now to https://github.com/dpfried/incoder/tree/main/evaluation . Please let me know if you try it and run into issues or are still unable to replicate!

@SpyrosMouselinos
Copy link
Author

Thanks for the quick response! I was referring to the pass@100 metric, in my experiments.
I use the text generation pipeline from the transformers library which has a slightly different behavior from the typical
encode --> model.generate() --> decode procedure you use.

It can be broken down to:

  • The inputs to the Incoder model are not simply starting with the BOS Token but also with the prefix "<| file ext=.py|>" before the actual code.
  • The text-generation pipeline seems to be using the same randomness seed in the generation process having an in-batch diverse but lower cross-batch variety.

Have you seen behavior like this in your experiments? How did you handle random seeds in experiments, since you run x10 times a 20-generation for a total of 200 generations per problem?

@dpfried
Copy link
Owner

dpfried commented Jun 16, 2022

Thanks for the info! I'm running replication experiments now with temperature 0.8 to get the pass@100 scores for the HF version of the model. How big of a gap between our reported pass@100 scores and yours are you seeing?

We didn't set the random seed in our experiments, so every sampled generation (of the 200) should be generated independently.

There may be some differences between the inference procedure I used (in the code I checked in) and the text generation pipeline. The one that I'm aware of is that the generation pipeline doesn't prepend BOS (#3 (comment)), but it sounds like you're accounting for that already.

@SpyrosMouselinos
Copy link
Author

In fixed seed format (seed = 1 for both torch / random libs), and using num_generations=10 (repeated 20 times in a loop) i seem to get around 35% pass@100. I think the "fixing the seed" part might be limiting to the expressiveness. Let me know if you find out any discrepancies in the HF version, and thanks for taking the time!

@dpfried
Copy link
Owner

dpfried commented Jun 16, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants