The provided notebook will demonstrate how to leverage Llama 3.1 405B Instruct, and Nemotron-4 340B Reward through build.nvidia.com.
The build will be a demonstration of the following pipeline!
The pipeline is designed to create a preference dataset suitable for training a custom reward model using the SteerLM method, however consecutive responses (e.g. sample 1 with 2, 3 with 4, etc.) share the same prompt so the dataset can also be used for preference pairs for training an RLHF Reward Model or for DPO - using the helpfulness score.