Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: This PR implements SimpleRMSNorm, as proposed in: Scaling TransNormer to 175 Billion Parameters and is: " In TransNormerLLM, we replace the RMSNorm with a new simple normalization function called SimpleRMSNorm, abbreviated as SRMSNorm: SRMSNorm(x) = x / ∥x∥2/√d We empirically find that using SRMSNorm does not lead to any performance loss, as demonstrated in the ablation study [below]: Norm Type Params Updates Loss PPL SRMSNorm 385M 100K 2.247 4.765 RMSNorm 385M 100K 2.247 4.766 LayerNorm 385M 100K 2.247 4.765 " note that their architecture is not a TransFormer but a TransNormer...therefore, I tested this on gpt2 transformer and saw equivalent results between LayerNorm and SimpleRMSNorm as below: <img width="494" alt="simpleRMS_gpt2" src="https://github.com/facebookresearch/multimodal/assets/46302957/7239ed80-60c8-4dec-ad89-62c180bb6b2a"> In addition, SimpleRMSNorm is ~ 34% faster vs regular RMSNorm (eager mode comparison). Pull Request resolved: #465 Test Plan: Tested on GPT2 training as shown above, and have added 4 unit tests (2 for BF16 and 2 for FP32 dtypes). Reviewed By: ebsmothers Differential Revision: D49638459 Pulled By: pbontrager fbshipit-source-id: 203b2bdd95dd79a5817060d85fc5920c6523733a
- Loading branch information