Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial config for partial conv in JET of esm2 in bionemo2 #428

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dorotat-nv
Copy link
Collaborator

@dorotat-nv dorotat-nv commented Nov 13, 2024

Adding the first config of partial convergence training in JET for ESM2 model in BioNeMo 2

The JET pipeline can be submitted for this config by running

python jet/cli.py submit --root-dir ci/benchmarks/partial-conv/ --manifest-template-config-file <PATH_TO_CI_REPO>/jet/manifest_template_config.yaml

where PATH_TO_CI_REPO is an absolute path to the CI repository storing JET CLI tools: https://gitlab-master.nvidia.com/clara-discovery/bionemo-github-ci/-/blob/master/jet/cli.py?ref_type=heads#L78

The PR it is still a draft. I need to adjust number of steps (due to 4h time limit and reducing nodes from 32 to 4). Also, I need to add test section.

In order to make the management of configs more robust, I will add a unit test in BioNeMo2 which checks if the training command can be executed.

@dorotat-nv dorotat-nv changed the title added initial config for partial conv in JET of esm2 in bionemo2 initial config for partial conv in JET of esm2 in bionemo2 Nov 13, 2024
@skothenhill-nv
Copy link
Collaborator

Did we decide to do this with the CLI/argparse entrypoints? This is doable with the pydantic interface as well.

max_steps:
value: 10000
script: |-
export NVTE_FUSED_ATTN=1; export NVTE_FLASH_ATTN=0; python scripts/${variant}/${model}/${model}_${variant}.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export NVTE_FUSED_ATTN=1; export NVTE_FLASH_ATTN=0; python scripts/${variant}/${model}/${model}_${variant}.py \
python scripts/${variant}/${model}/${model}_${variant}.py \

@jstjohn was seeing issues with these ENV variables (at least with our version of CUDNN), so we've removed them for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dorotat-nv yes, also note that the script paths are pretty different now. I believe there's a CLI arg thats like train_esm2 https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-esm2/pyproject.toml#L24

We also have the new cli which in two steps creates a config and runs the config. Perhaps it would be good to save our JET config settings in a .json file and instead just run that: https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-esm2/pyproject.toml#L21 cc @skothenhill-nv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between bionemo-esm2-train and train_esm2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bionemo-esm2-train goes through the pydantic interface and shares a train method with Geneformer. We can do the same thing for inference.

In anycase I'd definitely prefer if we prioritized using the pydantic interface for things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants