-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of loggers and dist strategy #180
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice integration, just a few comments in addition to the ones left inline:
train.py
is not needed and should be replaced with a command like:
itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline
There are some new changes in main
, especially the TrainingConfiguration
has been improved to provide more values by default (although it can support any new user-defined fields).
BCE = bce_loss | ||
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) | ||
return BCE + beta*KLD | ||
|
||
|
||
@monitor_exec | ||
def execute(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to separate the boilerplate code (e.g., distributed strategy setup, loggers setup) from the actual training code, if possible. Ideally, the use cases should not worry about the setup and should only override the train
method from the TorchTrainer
class. In other words, use cases can get the best out of the TorchTrainer
when reusing its execute
method and overriding only the train
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we had discussed, in this case, the preprocessing part first writes the files to disk and then the trainer reads the data from disk. Hence execute
is not used, rather @monitor_exec
is employed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I fully understood this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar comments as for the execute
method in the XTClimPredictor
use-cases/xtclim/src/trainer.py
Outdated
Parameters: | ||
bce_loss: recontruction loss | ||
mu: the mean from the latent vector | ||
logvar: log variance from the latent vector | ||
beta: weight over the KL-Divergence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part of the doctstring is not compliant with the Google docstring format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should look more like this: https://itwinai.readthedocs.io/latest/_modules/itwinai/torch/trainer.html#TorchTrainer
Parameters
->Args
- indent argument descriptions
- provide arg types in parentheses
batch_size: int, | ||
lr: float | ||
lr: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lr
and batch_size
should be embedded in config
(needs to be added to the constructor). The training config contains all the hyperparams and user-defined params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that in the trainer constructur there should be an argument called config
which accepts a dictionary object, which in turn contains a set of hyperparameters (lr, batch_size, etc.).
This way we simplify the trainer constructor, grouping all the hyperparams under config
. See this: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#module-itwinai.torch.trainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be unresolved yet. Also, try as much as possible to reuse the existing field names from the training configuration: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#itwinai.torch.config.TrainingConfiguration
Example, in this case lr
would become optim_lr
.
It is also possible to create a custom configuration for the use case to include use case hyperparams, as we did for Virgo:
itwinai/use-cases/virgo/trainer.py
Line 36 in af68d90
class VirgoTrainingConfiguration(TrainingConfiguration): |
use-cases/xtclim/src/trainer.py
Outdated
# initialize the model | ||
cvae_model = model.ConvVAE().to(device) | ||
cvae_model = model.ConvVAE() | ||
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lr should be retrieved from self.config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be self.config.optim_lr
I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season. Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: # Do some pre-processing
...
# Training
for season in seasons:
net = MyModel()
config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
trainer = XTClimTrainer(config=config, model=net)
trainer.execute()
# Do some prediction
... In this case, we would not use the itwinai Pipeline. Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)... |
|
'train.py' now takes care of the launch of each season separately. In the current implementation, we can still maintain the pipelines, while running each season separately. |
This has now been implemented in a slightly different manner. The |
I would suggest to rebase onto main before proceeding given the latest changes |
2c26e1c
to
afa2765
Compare
Branch has been rebased onto main. |
batch_size: int, | ||
lr: float | ||
lr: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that in the trainer constructur there should be an argument called config
which accepts a dictionary object, which in turn contains a set of hyperparameters (lr, batch_size, etc.).
This way we simplify the trainer constructor, grouping all the hyperparams under config
. See this: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#module-itwinai.torch.trainer
use-cases/xtclim/src/trainer.py
Outdated
Parameters: | ||
bce_loss: recontruction loss | ||
mu: the mean from the latent vector | ||
logvar: log variance from the latent vector | ||
beta: weight over the KL-Divergence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should look more like this: https://itwinai.readthedocs.io/latest/_modules/itwinai/torch/trainer.html#TorchTrainer
Parameters
->Args
- indent argument descriptions
- provide arg types in parentheses
use-cases/xtclim/src/trainer.py
Outdated
# initialize the model | ||
cvae_model = model.ConvVAE().to(device) | ||
cvae_model = model.ConvVAE() | ||
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be self.config.optim_lr
batch_size: int, | ||
lr: float | ||
lr: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be unresolved yet. Also, try as much as possible to reuse the existing field names from the training configuration: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#itwinai.torch.config.TrainingConfiguration
Example, in this case lr
would become optim_lr
.
It is also possible to create a custom configuration for the use case to include use case hyperparams, as we did for Virgo:
itwinai/use-cases/virgo/trainer.py
Line 36 in af68d90
class VirgoTrainingConfiguration(TrainingConfiguration): |
BCE = bce_loss | ||
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) | ||
return BCE + beta*KLD | ||
|
||
|
||
@monitor_exec | ||
def execute(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar comments as for the execute
method in the XTClimPredictor
def read_config(file_path): | ||
with open(file_path, 'r') as f: | ||
config = yaml.safe_load(f) | ||
return config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be replaced by itwinai.utils.load_yaml
function
config['pipeline']['init_args']['steps']['training-step']['init_args']['seasons'] = season | ||
model_uri = f"outputs/cvae_model_{season}1d_1memb.pth" | ||
config['pipeline']['init_args']['steps']['evaluation-step']['init_args']['model_uri'] = model_uri | ||
config['pipeline']['init_args']['steps']['evaluation-step']['init_args']['seasons'] = season |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments in hpo.py on dynamic override of config fields
) | ||
pipeline = pipe_parser.parse_pipeline() | ||
print(f"Running pipeline for season: {season}") | ||
pipeline.execute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment in hpo.py on breaking down pipeline into preprocessing steps (potentially always the same across seasons) and training steps, to avoid re-doing the preprocessing. However, I am not an expert of this use case and this may not be feasible for some reason
self.logger.create_logger_context() | ||
|
||
for scenario in ['585', '245']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are magic "numbers". Make them into constants or at least some variable that tells us what they mean.
proj_data = np.load(f"input/preprocessed_1d_proj{scenario}_{season}data_1memb.npy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib
proj_time['0'][i] ) for i in range(n_proj) ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newlines (see earlier dicussion)
projset, self.device, criterion, | ||
pixel_wise_criterion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newlines (see earlier dicussions) and num parameters per line
projset, self.device, criterion, | ||
pixel_wise_criterion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
projset, self.device, criterion, | ||
pixel_wise_criterion) | ||
tot_proj_losses = list(map(add, tot_proj_losses, proj_losses)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of "map" (see earlier discussions)
proj_avg_losses = proj_avg_losses/n_avg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space between operators
|
||
# save the losses time series | ||
pd.DataFrame(tot_proj_losses).to_csv(f"outputs/proj{scenario}_losses_{season}1d_allssp.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib (see earlier discussions)
# save the losses time series | ||
pd.DataFrame(tot_proj_losses).to_csv(f"outputs/proj{scenario}_losses_{season}1d_allssp.csv") | ||
print(f'SSP{scenario} Projection average loss:', proj_avg_losses, 'for', season[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use an f-string but then also use commas to separate statements? You should stick to one: either use "{}" for all, or commas for all.
'Average projection loss', | ||
kind='metric') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably fits on one line + newline stuff discussed earlier
Returns: | ||
- total loss (torch.Tensor) | ||
""" | ||
BCE = bce_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed?
- bce_loss (torch.Tensor): recontruction loss | ||
- mu (torch.Tensor): the mean from the latent vector | ||
- logvar (torch.Tensor): log variance from the latent vector | ||
- beta (torch.Tensor): weight over the KL-Divergence | ||
|
||
Returns: | ||
- total loss (torch.Tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to follow Google style guide for Python docstrings.
train_time = pd.read_csv(f"input/dates_train_{season}data_{self.config.n_memb}memb.csv") | ||
train_data = np.load( | ||
f"input/preprocessed_1d_train_{season}data_{self.config.n_memb}memb.npy" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib (see earlier discussion)
trainloader = self.strategy.create_dataloader(trainset, batch_size=self.config.batch_size, | ||
shuffle=True, pin_memory=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newlines wrt parenthesis (see earlier discussions)
test_time = pd.read_csv(f"input/dates_test_{season}data_{self.config.n_memb}memb.csv") | ||
test_data = np.load(f"input/preprocessed_1d_test_{season}data_{self.config.n_memb}memb.npy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib (see earlier discussions)
testloader = self.strategy.create_dataloader(testset, batch_size=self.config.batch_size, | ||
shuffle=False, pin_memory=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newlines and parentheses
self.log(train_epoch_loss, | ||
'epoch_train_loss', | ||
kind='metric' | ||
) | ||
self.log(valid_epoch_loss, | ||
'epoch_valid_loss', | ||
kind='metric' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline after opening parenthesis (might fit on one line?)
# save_ex(recon_images[0], epoch, season) | ||
|
||
# decreasing learning rate | ||
if (epoch + 1) % 20 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 is a magic number. Put it in a constant or some variable
|
||
# decreasing learning rate | ||
if (epoch + 1) % 20 == 0: | ||
self.config.lr = self.config.lr / 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 is a magic number. Put it in a constant or some variable
epoch > 1 | ||
and (self.config.old_valid_loss - valid_epoch_loss) / self.config.old_valid_loss < self.config.stop_delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be extracted to two variables for readability
} | ||
# save checkpoint only if it is better than | ||
# the previous ones | ||
checkpoint_filename = f"outputs/cvae_model_{season}1d_{self.config.n_memb}memb.pth" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib
self.log(checkpoint_filename, | ||
os.path.basename(checkpoint_filename), | ||
kind='artifact') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parentheses and pathlib
pd.DataFrame(train_loss).to_csv( | ||
f"outputs/train_loss_indiv_{season}1d_{self.config.n_memb}memb.csv" | ||
) | ||
pd.DataFrame(valid_loss).to_csv( | ||
f"outputs/test_loss_indiv_{season}1d_{self.config.n_memb}memb.csv" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib
{"loss": train_epoch_loss, | ||
"valid_loss": valid_epoch_loss} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newlines and parentheses
echo "DEBUG: TIME: $(date)" | ||
sysN="$(uname -n | cut -f2- -d.)" | ||
sysN="${sysN%%[0-9]*}" | ||
echo "Running on system: $sysN" | ||
echo "DEBUG: EXECUTE: $EXEC" | ||
echo "DEBUG: SLURM_SUBMIT_DIR: $SLURM_SUBMIT_DIR" | ||
echo "DEBUG: SLURM_JOB_ID: $SLURM_JOB_ID" | ||
echo "DEBUG: SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" | ||
echo "DEBUG: SLURM_NNODES: $SLURM_NNODES" | ||
echo "DEBUG: SLURM_NTASKS: $SLURM_NTASKS" | ||
echo "DEBUG: SLURM_TASKS_PER_NODE: $SLURM_TASKS_PER_NODE" | ||
echo "DEBUG: SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST" | ||
echo "DEBUG: SLURMD_NODENAME: $SLURMD_NODENAME" | ||
echo "DEBUG: CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" | ||
if [ "$DEBUG" = true ] ; then | ||
echo "DEBUG: NCCL_DEBUG=INFO" | ||
export NCCL_DEBUG=INFO | ||
fi | ||
echo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a variable called "DEBUG", yet the debugging statements are printed regardless of its value. Seems counterintuitive to me.
COMMAND="train.py" | ||
EXEC="$COMMAND" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the same thing twice, here?
if [ "$SLURM_CPUS_PER_TASK" > 0 ] ; then | ||
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK | ||
fi | ||
export CUDA_VISIBLE_DEVICES="0,1,2,3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is only used for Horovod
elif [ "$DIST_MODE" == "horovod" ] ; then | ||
echo "HOROVOD training" | ||
srun --cpu-bind=none python -u $EXEC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not also have to set "ntasks" for this to properly distribute?
help='Configuration file to the pipeline to execute.' | ||
) | ||
args = parser.parse_args() | ||
def read_config(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints
pipe_parser = ConfigParser( | ||
config=config, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be one line
Summary
Distributed Strategy and loggers are added
Related issue : #160