JSON -> YAML for CLI #436

skothenhill-nv · 2024-11-14T22:37:11Z

Replaces JSON usage with YAML for the pydantic based CLI.

To do this, we had to add a few custom serializers for types not natively supported by YAML:

pathlib.Path dumps directory contents in YAML, so we go to/from string at serialization/validator time
Enums arent handled right, so simiarly we do the same thing by just making them strings/enums on serialization/validation.

…issue in the case of duplicate filenames

…to hillst/async-checkpoint-war

malcolmgreaves

LGTM. Minor non-blocking suggestions.

malcolmgreaves · 2024-11-14T23:07:44Z

sub-packages/bionemo-llm/src/bionemo/llm/run/config_models.py

+def deserialize_str_to_path(path: str) -> pathlib.Path:
+    """General purpose deserialize for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_validator."""
+    return pathlib.Path(path)
+
+
+def serialize_path_or_str(path: str | pathlib.Path) -> str:
+    """General purpose serialization for string/path objects. Since YAML has no native representation for pathlib.Path, we serialize to strings. Import this method as a @field_serializer."""
+    if isinstance(path, pathlib.Path):
+        return str(path)
+    elif isinstance(path, str):
+        return path
+    else:
+        raise ValueError(f"Expected str or pathlib.Path, got {type(path)}")
+


Minor suggestion: these don't necessarily need to be in bionemo-llm. They're good candidates to go into bionemo-core IMO. Not review blocking, but something to consider.

malcolmgreaves · 2024-11-14T23:11:27Z

sub-packages/bionemo-geneformer/src/bionemo/geneformer/run/main.py

        with open(config_path, "r") as f:
-            config_dict = json.load(f)
+            config_dict = yaml.safe_load(f)


Not revew-blocking: you could consider adding YAML support instead of replacing the existing JSON support w/ YAML support only. For example, something like:

try: with open(config_path, 'rt') as rt: config_dict = yaml.safe_load(rt) except <yaml format exception> as error_yaml: try: with open(config_path, 'rt') as rt: config_dict = json.load(rt) except JSONDecodeError as error_json: try: raise ValueError("YAML parsing and JSON-parsing fallback failed for {config_path=}") from error_json except Exception as e: raise e from error_yaml

imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.

dorotat-nv

thanks @skothenhill-nv ! I left some comments and spot a few places with the reference to json

dorotat-nv · 2024-11-15T09:31:52Z

README.md

@@ -158,7 +158,7 @@ TWINE_PASSWORD="<pypi pass>" TWINE_USERNAME="<pypi user>" uvx twine upload /sub-
 ## Pydantic Configuration

 BioNeMo 2 provides two entrypoints for models with both argparse and pydantic. Both documented in the `Models` section below.
-Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These JSON configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a JSON configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.
+Pydantic based configuration is designed to accept a configuration json file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These YAML configs go through a Pydantic Validator, in this case referred to as `MainConfig`. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a YAML configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.


dorotat-nv · 2024-11-15T09:32:45Z

README.md

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

 > ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit


dorotat-nv · 2024-11-15T09:33:12Z

README.md

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \



--dest my_config.yaml, @skothenhill-nv , I would suggest to search for all occurrences of .json in the codebase

Imho it is a bit unclear what those recipes include

Ie

Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes
are available for 8m, 650m, and 3b ESM2 models. You may select which preset config to use by setting the --recipe parameter.

Does it mean that each config includes multiple recipes? From the command line is seems so. What would happen is --recipe is not provided? All the recipes in the config are launched?

dorotat-nv · 2024-11-15T09:39:23Z

README.md

@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

 > ⚠️ **IMPORTANT:** Inspect and edit the contents of the outputted my_config.json as you see fit

-> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
+> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.

 To submit a training job with the passed config, first update the json file with any additional execution parameters


json -> yaml

dorotat-nv · 2024-11-15T09:40:20Z

README.md

    those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here.
    This allows for mixing/matching Data Modules with various tasks.
 - Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task,
 for example, pretraining ESM2 uses a protein cluster oriented sampling method. In the case of inference or fine-tuning
 a pretrained model, a simple fasta file may be sufficient. There is a one-to-one relationship between DataConfig types
 and DataModule types.

-> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config JSON and populate it with your WandB details.
+> ⚠️ **Warning:** This setup does NO configuration of Weights and Biases. Edit your config YAML and populate it with your WandB details.

 ```
 bionemo-esm2-train \


my_config.json -> yaml

what does it mean "t" in data-config-t?

is t refers to the class? maybe istead of t we can have cls?

dorotat-nv · 2024-11-15T09:43:41Z

README.md


-> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
+> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.


what does it mean to pretrain from the existing ckpt? Ie to resume pretraining or to pretrain from 0 using the other checkpoint config?

dorotat-nv · 2024-11-15T09:44:02Z

README.md


-> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
+> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.

 To submit a training job with the passed config, first update the json file with any additional execution parameters


json -> yaml

dorotat-nv · 2024-11-15T09:48:47Z

sub-packages/bionemo-esm2/src/bionemo/esm2/run/recipes.py

@@ -370,7 +371,7 @@ def esm2_tiny_test_recipe(args):

 def main():  # noqa: D103
    def parse_args():
-        parser = argparse.ArgumentParser(description="Create ESM2 configuration JSON.")
+        parser = argparse.ArgumentParser(description="Create ESM2 configuration YAML.")
        parser.add_argument(


instead of listing recipes, we can use some pydantic class RecipeType for ESM2 choices=RecipeTypePretrain[ESM2].fields.keys() to have pydantic as validator

dorotat-nv · 2024-11-15T09:50:46Z

sub-packages/bionemo-esm2/src/bionemo/esm2/run/recipes.py


    # Save to file
    with open(
        args.dest,
        "w",
    ) as f:
-        f.write(json_str)
+        f.write(yaml_str)


why not to use

with open("config.yaml", "w") as file: yaml.dump(config, file, ....)

dorotat-nv · 2024-11-15T09:53:59Z

sub-packages/bionemo-geneformer/src/bionemo/geneformer/run/main.py

        with open(config_path, "r") as f:
-            config_dict = json.load(f)
+            config_dict = yaml.safe_load(f)


imho, supporting many config formats might introduce some ambiguity, as it would require us to maintain two different configuration formats while our codebase and CLI are still evolving significantly. However, it could be worth considering as a potential feature in the future.

skothenhill-nv added 10 commits November 13, 2024 18:03

modifies checkpoint filename as a workaround for async checkpointing …

af5b457

…issue in the case of duplicate filenames

reverting changes to esm2 pretrain for an easy merge

bfb472e

Merge branch 'main' of https://github.com/NVIDIA/bionemo-framework in…

757e329

…to hillst/async-checkpoint-war

address comments

fed8047

fmt

262121e

Merge branch 'main' of https://github.com/NVIDIA/bionemo-framework in…

13b76cc

…to hillst/async-checkpoint-war

functional changes to use yaml - need to fix enum parsing.

04f5e07

fix how paths are handled

cb960a3

update readme to refer to YAML instead of JSON

086435e

formatting

97bfb19

skothenhill-nv requested review from jstjohn, malcolmgreaves, farhadrgh, dorotat-nv and pstjohn as code owners November 14, 2024 22:37

pstjohn approved these changes Nov 14, 2024

View reviewed changes

jstjohn approved these changes Nov 14, 2024

View reviewed changes

malcolmgreaves approved these changes Nov 14, 2024

View reviewed changes

malcolmgreaves assigned skothenhill-nv Nov 14, 2024

dorotat-nv requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON -> YAML for CLI #436

JSON -> YAML for CLI #436

skothenhill-nv commented Nov 14, 2024

malcolmgreaves left a comment

malcolmgreaves Nov 14, 2024

malcolmgreaves Nov 14, 2024

dorotat-nv Nov 15, 2024

dorotat-nv left a comment

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

dorotat-nv Nov 15, 2024

		@@ -227,22 +227,22 @@ bionemo-esm2-recipe \

		> ⚠️ IMPORTANT: Inspect and edit the contents of the outputted my_config.json as you see fit


		> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the JSON with the correct field to ensure pretraining is initialized from an existing checkpoint.
		> NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.

JSON -> YAML for CLI #436

Are you sure you want to change the base?

JSON -> YAML for CLI #436

Conversation

skothenhill-nv commented Nov 14, 2024

malcolmgreaves left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dorotat-nv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment