Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON -> YAML for CLI #436

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

JSON -> YAML for CLI #436

wants to merge 10 commits into from

Conversation

skothenhill-nv
Copy link
Collaborator

Replaces JSON usage with YAML for the pydantic based CLI.

To do this, we had to add a few custom serializers for types not natively supported by YAML:

  • pathlib.Path dumps directory contents in YAML, so we go to/from string at serialization/validator time
  • Enums arent handled right, so simiarly we do the same thing by just making them strings/enums on serialization/validation.

Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor non-blocking suggestions.

Comment on lines +49 to +62
def deserialize_str_to_path(path: str) -> pathlib.Path:
"""General purpose deserialize for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_validator."""
return pathlib.Path(path)


def serialize_path_or_str(path: str | pathlib.Path) -> str:
"""General purpose serialization for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_serializer."""
if isinstance(path, pathlib.Path):
return str(path)
elif isinstance(path, str):
return path
else:
raise ValueError(f"Expected str or pathlib.Path, got {type(path)}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: these don't necessarily need to be in bionemo-llm. They're good candidates to go into bionemo-core IMO. Not review blocking, but something to consider.

Comment on lines 96 to +97
with open(config_path, "r") as f:
config_dict = json.load(f)
config_dict = yaml.safe_load(f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not revew-blocking: you could consider adding YAML support instead of replacing the existing JSON support w/ YAML support only. For example, something like:

try:
  with open(config_path, 'rt') as rt:
      config_dict = yaml.safe_load(rt)
except <yaml format exception> as error_yaml:
   try:
       with open(config_path, 'rt') as rt:
         config_dict = json.load(rt)
   except JSONDecodeError as error_json:
         try:
             raise ValueError("YAML parsing and JSON-parsing fallback failed for {config_path=}") from error_json
         except Exception as e:
             raise e from error_yaml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.

Copy link
Collaborator

@dorotat-nv dorotat-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @skothenhill-nv ! I left some comments and spot a few places with the reference to json

@@ -158,7 +158,7 @@ TWINE_PASSWORD="<pypi pass>" TWINE_USERNAME="<pypi user>" uvx twine upload /sub-
## Pydantic Configuration

BioNeMo 2 provides two entrypoints for models with both argparse and pydantic. Both documented in the `Models` section below.
Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These JSON configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a JSON configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.
Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These YAML configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a YAML configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml file?

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml?

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--dest my_config.yaml, @skothenhill-nv , I would suggest to search for all occurrences of .json in the codebase

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho it is a bit unclear what those recipes include

Ie

Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes
are available for 8m, 650m, and 3b ESM2 models. You may select which preset config to use by setting the --recipe parameter.

Does it mean that each config includes multiple recipes? From the command line is seems so. What would happen is --recipe is not provided? All the recipes in the config are launched?

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit

> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.

To submit a training job with the passed config, first update the json file with any additional execution parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json -> yaml

those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here.
This allows for mixing/matching Data Modules with various tasks.
- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task,
for example, pretraining ESM2 uses a protein cluster oriented sampling method. In the case of inference or fine-tuning
a pretrained model, a simple fasta file may be sufficient. There is a one-to-one relationship between DataConfig types
and DataModule types.

> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details.
> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config YAML and populate it with your WandB details.

```
bionemo-esm2-train \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my_config.json -> yaml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean "t" in data-config-t?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is t refers to the class? maybe istead of t we can have cls?


> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean to pretrain from the existing ckpt? Ie to resume pretraining or to pretrain from 0 using the other checkpoint config?


> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.

To submit a training job with the passed config, first update the json file with any additional execution parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json -> yaml

@@ -370,7 +371,7 @@ def esm2_tiny_test_recipe(args):

def main(): # noqa: D103
def parse_args():
parser = argparse.ArgumentParser(description="Create ESM2 configuration JSON.")
parser = argparse.ArgumentParser(description="Create ESM2 configuration YAML.")
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of listing recipes, we can use some pydantic class RecipeType for ESM2 choices=RecipeTypePretrain[ESM2].fields.keys() to have pydantic as validator


# Save to file
with open(
args.dest,
"w",
) as f:
f.write(json_str)
f.write(yaml_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to use

with open("config.yaml", "w") as file:
    yaml.dump(config, file, ....)

Comment on lines 96 to +97
with open(config_path, "r") as f:
config_dict = json.load(f)
config_dict = yaml.safe_load(f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants