Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New biobb_pytorch Molecular dynamics autoencoder wrapper #173

Merged
merged 20 commits into from
Dec 5, 2024

Conversation

PauAndrio
Copy link
Contributor

Wrappers for the new collection of blocks to train and apply autoencoder models to molecular dynamics data.

tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/.shed.yml Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
@PauAndrio PauAndrio requested a review from bgruening June 28, 2024 09:15
tools/biobb_pytorch/.shed.yml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_apply_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_train_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_train_mdae.xml Outdated Show resolved Hide resolved
tools/biobb_pytorch/biobb_train_mdae.xml Outdated Show resolved Hide resolved
@bgruening
Copy link
Contributor

Please have a look at the linting logs: https://github.com/galaxycomputationalchemistry/galaxy-tools-compchem/actions/runs/9906049158/job/27368207101?pr=173

planemo can also test this with planemo lint


<outputs>
<data format="npy" name="output_reconstructed_data_npy_path" label="output_reconstructed_data_npy_path" />
<data format="npy" name="output_latent_space_npy_path" label="output_latent_space_npy_path" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this output is optional you can use a filter here to only define this output when your boolean is true.

Comment on lines 22 to 30
`#Required Outputs using file extension`
touch '$output_reconstructed_data_npy_path' &&
ln -s '$output_reconstructed_data_npy_path' ./output_reconstructed_data_npy_path.$output_reconstructed_data_npy_path.ext &&

`#Optional Outputs using file extension`
#if $output_latent_space_npy_path:
touch '$output_latent_space_npy_path' &&
ln -s '$output_latent_space_npy_path' ./output_latent_space_npy_path.$output_latent_space_npy_path.ext &&
#end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`#Required Outputs using file extension`
touch '$output_reconstructed_data_npy_path' &&
ln -s '$output_reconstructed_data_npy_path' ./output_reconstructed_data_npy_path.$output_reconstructed_data_npy_path.ext &&
`#Optional Outputs using file extension`
#if $output_latent_space_npy_path:
touch '$output_latent_space_npy_path' &&
ln -s '$output_latent_space_npy_path' ./output_latent_space_npy_path.$output_latent_space_npy_path.ext &&
#end if

This looks very complicated, I guess you can remove this and try something like below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. I removed the part you mentioned, used a literal in the command, and utilized the "from_work_dir" attribute. This has significantly simplified things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool glad you found that useful. And sorry this PR takes so long, my hope is that if we do this one properly the rest will be a piece of cake :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have also removed the comments inside the command section to avoid potential issues with quotes or other special characters in the future. Is there anything else that needs to be changed before we merge?

tools/biobb_pytorch/biobb_apply_mdae.xml Show resolved Hide resolved
tools/biobb_pytorch/biobb_train_mdae.xml Outdated Show resolved Hide resolved
@bgruening
Copy link
Contributor

You could also, and that is the simplest way possible, just write in the command section: --output ./foo.npz.

And than in your outputs do <output name="blabla" from_work_dir="foo.npz"/>. The from_work_dir will then pick up the file and you don't need a mv or ln

@bgruening
Copy link
Contributor

Copy link
Contributor

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea. Is that something that is interesting?


<inputs>
<param name="input_data_npy_path" type="data" format="npy" optional="False" label="Input NPY file" help="Input data file"/>
<param name="input_model_pth_path" type="data" format="pth" optional="False" label="Input PTH file" help="Input model file"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickled objects are a bit dangerous, would it be possible to use ONNX https://pytorch.org/docs/stable/onnx.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, any comments here?
Pickles contain arbitrary code and are a bad exchange format, as you need to trust them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ONNX is a great choice for model export, as you suggested, and exporting models to ONNX is straightforward, there is currently no built-in functionality in PyTorch for importing ONNX models back into PyTorch. This has been a long-standing request from the PyTorch community since 2019, but as of now, there's no reliable or officially supported method for importing ONNX models back into PyTorch for further training or fine-tuning.

Given this limitation, switching entirely to ONNX would mean losing the ability to easily retrain, modify, or extend our models within the PyTorch ecosystem. For these reasons, despite the risks associated with pickling, we still favor the use of PyTorch’s torch.save() and torch.load() methods for model serialization. We're aware that these methods are not ideal from a security perspective, but they provide the necessary flexibility for ongoing model development.

Our intent developing this tools is to provide a way to create or improve very specialized models not to deploy them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its a bit of a pain, I had hoped that https://github.com/Talmaj/onnx2pytorch is working for you.

<inputs>
<param name="input_data_npy_path" type="data" format="npy" optional="False" label="Input NPY file" help="Input data file"/>
<param name="input_model_pth_path" type="data" format="pth" optional="False" label="Input PTH file" help="Input model file"/>
<param name="config_json" type="data" format="json" optional="True" label="Configuration file" help="File containing tool settings"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are users supposed to create such JSON files?

How many parameters do such files have in average? Is that 1-2 or more like 20-30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biobb_object has 38 configurable properties, and in addition, train_mdae introduces 16 specific properties, bringing the total to 54 potential parameters that users could modify in the JSON files. However, all of these properties come with default values, so users only need to tweak those that are relevant to their specific use case. We’ve put considerable effort into documenting these parameters and guiding users to focus on the most impactful ones, ensuring they can make adjustments as needed without being overwhelmed.

tools/biobb_pytorch/biobb_train_mdae.xml Outdated Show resolved Hide resolved
;
]]>
</command>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<configfiles>
<configfile name="train_config">
{
"properties": {
"num_epochs": $num_epoch,
"seed": $seed
}
}
</configfile>
</configfiles>

This way you can create those configfiles on the fly and ask your users for the inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s something we’re planning for a future release, where the most relevant properties will be integrated into Galaxy's UI through sliders, multi-select options, number validators, filters, etc., making the configuration process more user-friendly. However, for now, I’d like to keep things as simple as possible and focus on getting my first tool published in the Galaxy Toolshed.

Comment on lines +18 to +20
#if $config_json:
ln -s '$config_json' ./config_json.$config_json.ext &&
#end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if $config_json:
ln -s '$config_json' ./config_json.$config_json.ext &&
#end if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s something we’re planning for a future release, where the most relevant properties will be integrated into Galaxy's UI through sliders, multi-select options, number validators, filters, etc., making the configuration process more user-friendly. However, for now, I’d like to keep things as simple as possible and focus on getting my first tool published in the Galaxy Toolshed.

train_mdae

#if $config_json:
--config ./config_json.$config_json.ext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--config ./config_json.$config_json.ext
--config ./$train_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s something we’re planning for a future release, where the most relevant properties will be integrated into Galaxy's UI through sliders, multi-select options, number validators, filters, etc., making the configuration process more user-friendly. However, for now, I’d like to keep things as simple as possible and focus on getting my first tool published in the Galaxy Toolshed.

Copy link
Contributor

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PauAndrio looks good to me. Just this one security issue with the python pickels.

@PauAndrio
Copy link
Contributor Author

@PauAndrio looks good to me. Just this one security issue with the python pickels.

Sorry I thought this was already answered:

While ONNX is a great choice for model export, as you suggested, and exporting models to ONNX is straightforward, there is currently no built-in functionality in PyTorch for importing ONNX models back into PyTorch. This has been a long-standing request from the PyTorch community since 2019, but as of now, there's no reliable or officially supported method for importing ONNX models back into PyTorch for further training or fine-tuning.

Given this limitation, switching entirely to ONNX would mean losing the ability to easily retrain, modify, or extend our models within the PyTorch ecosystem. For these reasons, despite the risks associated with pickling, we still favor the use of PyTorch’s torch.save() and torch.load() methods for model serialization. We're aware that these methods are not ideal from a security perspective, but they provide the necessary flexibility for ongoing model development.

Our intent developing this tools is to provide a way to create or improve very specialized models not to deploy them.

@PauAndrio PauAndrio requested a review from bgruening October 7, 2024 13:27
Copy link
Contributor

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PauAndrio and sorry for the long turn-around.

Please change the shed.yml and then we merge.


<inputs>
<param name="input_data_npy_path" type="data" format="npy" optional="False" label="Input NPY file" help="Input data file"/>
<param name="input_model_pth_path" type="data" format="pth" optional="False" label="Input PTH file" help="Input model file"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its a bit of a pain, I had hoped that https://github.com/Talmaj/onnx2pytorch is working for you.

owner: chemteam
description: "biobb_pytorch is the Biobb module collection to create and train ML & DL models using the popular [PyTorch](https://pytorch.org/) Python library."
homepage_url: https://github.com/bioexcel/biobb_pytorch
long_description: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you convert this file into a suite-based file? https://github.com/galaxyproject/tools-iuc/blob/main/tools/semibin/.shed.yml

#if $output_performance_npz_path:
--output_performance_npz_path ./output_performance_npz_path.npz
#end if
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
;

@PauAndrio
Copy link
Contributor Author

PauAndrio commented Dec 2, 2024

Hi, @bgruening,

Please, let me know if there is anything else that I have to do to resolve and merge this pull request?

Regards,
Pau

@bgruening
Copy link
Contributor

Let's get this in. Thanks a lot @PauAndrio!

@bgruening bgruening merged commit 891dd7d into galaxycomputationalchemistry:master Dec 5, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants