-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON -> YAML for CLI #436
base: main
Are you sure you want to change the base?
JSON -> YAML for CLI #436
Conversation
…issue in the case of duplicate filenames
…to hillst/async-checkpoint-war
…to hillst/async-checkpoint-war
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Minor non-blocking suggestions.
def deserialize_str_to_path(path: str) -> pathlib.Path: | ||
"""General purpose deserialize for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_validator.""" | ||
return pathlib.Path(path) | ||
|
||
|
||
def serialize_path_or_str(path: str | pathlib.Path) -> str: | ||
"""General purpose serialization for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_serializer.""" | ||
if isinstance(path, pathlib.Path): | ||
return str(path) | ||
elif isinstance(path, str): | ||
return path | ||
else: | ||
raise ValueError(f"Expected str or pathlib.Path, got {type(path)}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: these don't necessarily need to be in bionemo-llm
. They're good candidates to go into bionemo-core
IMO. Not review blocking, but something to consider.
with open(config_path, "r") as f: | ||
config_dict = json.load(f) | ||
config_dict = yaml.safe_load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not revew-blocking: you could consider adding YAML support instead of replacing the existing JSON support w/ YAML support only. For example, something like:
try:
with open(config_path, 'rt') as rt:
config_dict = yaml.safe_load(rt)
except <yaml format exception> as error_yaml:
try:
with open(config_path, 'rt') as rt:
config_dict = json.load(rt)
except JSONDecodeError as error_json:
try:
raise ValueError("YAML parsing and JSON-parsing fallback failed for {config_path=}") from error_json
except Exception as e:
raise e from error_yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @skothenhill-nv ! I left some comments and spot a few places with the reference to json
@@ -158,7 +158,7 @@ TWINE_PASSWORD="<pypi pass>" TWINE_USERNAME="<pypi user>" uvx twine upload /sub- | |||
## Pydantic Configuration | |||
|
|||
BioNeMo 2 provides two entrypoints for models with both argparse and pydantic. Both documented in the `Models` section below. | |||
Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These JSON configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a JSON configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training. | |||
Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These YAML configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a YAML configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml file?
@@ -227,22 +227,22 @@ bionemo-esm2-recipe \ | |||
|
|||
> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml?
@@ -227,22 +227,22 @@ bionemo-esm2-recipe \ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--dest my_config.yaml, @skothenhill-nv , I would suggest to search for all occurrences of .json in the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho it is a bit unclear what those recipes include
Ie
Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes
are available for 8m, 650m, and 3b ESM2 models. You may select which preset config to use by setting the--recipe
parameter.
Does it mean that each config includes multiple recipes? From the command line is seems so. What would happen is --recipe is not provided? All the recipes in the config are launched?
@@ -227,22 +227,22 @@ bionemo-esm2-recipe \ | |||
|
|||
> ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit | |||
|
|||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint. | |||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint. | |||
|
|||
To submit a training job with the passed config, first update the json file with any additional execution parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json -> yaml
those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here. | ||
This allows for mixing/matching Data Modules with various tasks. | ||
- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task, | ||
for example, pretraining ESM2 uses a protein cluster oriented sampling method. In the case of inference or fine-tuning | ||
a pretrained model, a simple fasta file may be sufficient. There is a one-to-one relationship between DataConfig types | ||
and DataModule types. | ||
|
||
> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details. | ||
> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config YAML and populate it with your WandB details. | ||
|
||
``` | ||
bionemo-esm2-train \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my_config.json -> yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean "t" in data-config-t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is t refers to the class? maybe istead of t we can have cls?
|
||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint. | ||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean to pretrain from the existing ckpt? Ie to resume pretraining or to pretrain from 0 using the other checkpoint config?
|
||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint. | ||
> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint. | ||
|
||
To submit a training job with the passed config, first update the json file with any additional execution parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json -> yaml
@@ -370,7 +371,7 @@ def esm2_tiny_test_recipe(args): | |||
|
|||
def main(): # noqa: D103 | |||
def parse_args(): | |||
parser = argparse.ArgumentParser(description="Create ESM2 configuration JSON.") | |||
parser = argparse.ArgumentParser(description="Create ESM2 configuration YAML.") | |||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of listing recipes, we can use some pydantic class RecipeType for ESM2 choices=RecipeTypePretrain[ESM2].fields.keys() to have pydantic as validator
|
||
# Save to file | ||
with open( | ||
args.dest, | ||
"w", | ||
) as f: | ||
f.write(json_str) | ||
f.write(yaml_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to use
with open("config.yaml", "w") as file:
yaml.dump(config, file, ....)
with open(config_path, "r") as f: | ||
config_dict = json.load(f) | ||
config_dict = yaml.safe_load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.
Replaces JSON usage with YAML for the pydantic based CLI.
To do this, we had to add a few custom serializers for types not natively supported by YAML: