A code base to study the correlation between pre-training loss and downstream performance on synthetic and real datasets. The training code is adapted from Academic Budget Bert. The data generation code is based on nltk.
First install python packages with pip
pip install -r requirements.txt
If you want to use apex layernorm and attention, please install apex for correspoding cudatoolkt version from source.
We provide two datasets generated by generative models and two real datasets. For PCFG, we provide both generation code and processed datasets. The OPT dataset is provided in google drive, and the data generation code will be released soon.
Name | Generation | Download |
---|---|---|
PCFG | generation | download |
OPT | TODO | download |
OpenWebText | --- | download |
BookCorpus | --- | download |
The data for pre-training is generated into text files. Then we preprocess (tokenization and adding masks) the data into h5py. We always use the bert-large-uncased tokenizer and vocabulary.
For text file generation, we need to make sure two different sentences are separated with an empty line.
python generate_pretraining.py \
--rule-path rule_deep_mlm.txt \
--save-path pcfg_dataset
Then each text file is preprocessed to h5py file with the code provided in Academic Budget Bert.
python generate_samples.py --dir pcfg_dataset \
-o pcfg_dataset/pretraining \
--dup_factor 5 \
--vocab_file vocab.txt \
--do_lower_case 1 \
--masked_lm_prob 0.15 --max_seq_length 32 \
--model_name bert-large-uncased \
--max_predictions_per_seq 5 \
--n_processes 32
The downstream task for PCFG is defined with the CYK algorithm to predict the maxlikelihood internal node given a sentence. The data is generated into csv files including input and label to fit in the huggingface fine-tuning standard pipeline.
python generate_downstream.py \
--rule-path rule_deep_mlm.txt \
--save-path pcfg_dataset \
--sentence-num 10000 \
--task-type B
After generating the synthetic data, we can start pre-training. Our experiments have: (1) different number of steps in pre-training (2) different model sizes and (3) different pre-training algorithms. For (1) and (2) we use standard deepspeed training code. An example is provided as follows and the scripts are included in scripts. For (3) (lookup table), we can view it as a blackbox with masked sentences as input and conditional probabilities as output. In this sense we don't need to implement the model representing the conditional probabilities, and compute the conditional probabilities from the generative models instead. See pcfg for details.
deepspeed --master_port 32056 run_pretraining.py \
--model_type bert-mlm --tokenizer_name bert-large-uncased \
--hidden_act gelu \
--hidden_size 1024 \
--num_hidden_layers 16 \
--num_attention_heads 16 \
--intermediate_size 4096 \
--hidden_dropout_prob 0.1 \
--attention_probs_dropout_prob 0.1 \
--encoder_ln_mode pre-ln \
--lr 5e-4 \
--train_batch_size 4096 \
--train_micro_batch_size_per_gpu 1024 \
--lr_schedule step \
--num_warmup_steps 3333 \
--warmup_proportion 0.13333 \
--max_steps 25000 \
--curve linear \
--gradient_clipping 0.0 \
--optimizer_type adamw \
--weight_decay 0.01 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--adam_eps 1e-6 \
--dataset_path pcfg_dataset/pretraining \
--output_dir <output directory> \
--print_steps 100 \
--num_epochs_between_checkpoints 100000 \
--job_name <job name of wandb> \
--project_name <project name of wandb> \
--validation_epochs 3 \
--validation_epochs_begin 1 \
--validation_epochs_end 1 \
--validation_begin_proportion 0.05 \
--validation_end_proportion 0.01 \
--validation_micro_batch 16 \
--deepspeed \
--data_loader_type dist \
--do_validation \
--fp16
We evaluate the models based on (1) pre-training loss, (2) trace of Hessian, (3) linear probe accuracy and (3) fine-tuning accuracy. These can be done as follows.
The pre-training loss logged during pre-training can be inaccurate. We provide a more accurate evaluation script with the generative models. Note that we generated the conditional probability file for the lookup tables. The code can search for all model (pytorch_model.bin) in the path and save the loss to a text file.
python calculate_loss.py \
--model-path <pretrained model path (include output_dir above)> \
--datafile <validation text path (generated by generate_samples.py)> \
--vocabfile vocab.txt \
--condprobfile <conditional prob path>
We evaluate the trace of Hessian with the following code.
python calculate_loss.py \
--loadpath <pretrained model path (include output_dir above)> \
--datafile <validation text path (generated by generate_samples.py)> \
--vocabfile vocab.txt \
--nsamples 50000 \
--bs 128
As we mentioned in the paper, to compare the trace of Hessian of transformers with diffrent sizes fairly, we need to embed a small transformer into a large transformer without changing the functionality. This is implemented with the following code.
python enlarge.py \
--checkpoint-path <small model output dir> \
--large-model-path <large model output dir> \
--enlarged-model-path <enlarged model output dir> \
--time-enlarged <the ratio of hidden size>
The fine-tuning code is adapted from huggingface examples. The model is loaded automatically from the output_dir in the pre-training above. We need to specify the train and validation csv file for downstream tasks. They are generated by generate_downstream.py.
python run_ft.py \
--model_name_or_path <pretrained model path (output_dir above)> \
--tokenizer_name bert-large-uncased \
--train_file <csv file generated by generate_downstream.py> \
--validation_file <csv file generated by generate_downstream.py> \
--max_seq_length 32 \
--do_train --do_eval \
--evaluation_strategy steps \
--per_device_train_batch_size 128 --gradient_accumulation_steps 1 \
--per_device_eval_batch_size 128 \
--learning_rate 1e-4 \
--weight_decay 0.01 \
--eval_steps 200 --evaluation_strategy steps \
--max_grad_norm 1.0 \
--num_train_epochs 50 \
--lr_scheduler_type polynomial \
--warmup_steps 50
We concatenate embeddings of all the tokens as the representations for linear probe and fix the BERT encoder. Apart from that, it is the same as the fine-tuning code.
python run_lp.py \
--model_name_or_path <pretrained model path (output_dir above)> \
--tokenizer_name bert-large-uncased \
--train_file <csv file generated by generate_downstream.py> \
--validation_file <csv file generated by generate_downstream.py> \
--max_seq_length 32 \
--do_train --do_eval \
--evaluation_strategy steps \
--per_device_train_batch_size 512 --gradient_accumulation_steps 1 \
--per_device_eval_batch_size 512 \
--learning_rate 1e-3 \
--weight_decay 0.01 \
--eval_steps 200 --evaluation_strategy steps \
--max_grad_norm 1.0 \
--num_train_epochs 50 \
--lr_scheduler_type polynomial \
--warmup_steps 50