Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of loggers and dist strategy #180

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Conversation

r-sarma
Copy link
Collaborator

@r-sarma r-sarma commented Jul 2, 2024

Summary

Distributed Strategy and loggers are added

Related issue : #160

@r-sarma r-sarma requested a review from matbun July 2, 2024 10:36
@r-sarma r-sarma linked an issue Jul 2, 2024 that may be closed by this pull request
Copy link
Collaborator

@matbun matbun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

BCE = bce_loss
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return BCE + beta*KLD


@monitor_exec
def execute(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to separate the boilerplate code (e.g., distributed strategy setup, loggers setup) from the actual training code, if possible. Ideally, the use cases should not worry about the setup and should only override the train method from the TorchTrainer class. In other words, use cases can get the best out of the TorchTrainer when reusing its execute method and overriding only the train method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we had discussed, in this case, the preprocessing part first writes the files to disk and then the trainer reads the data from disk. Hence execute is not used, rather @monitor_exec is employed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I fully understood this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar comments as for the execute method in the XTClimPredictor

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved
Comment on lines 75 to 79
Parameters:
bce_loss: recontruction loss
mu: the mean from the latent vector
logvar: log variance from the latent vector
beta: weight over the KL-Divergence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the doctstring is not compliant with the Google docstring format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should look more like this: https://itwinai.readthedocs.io/latest/_modules/itwinai/torch/trainer.html#TorchTrainer

  • Parameters -> Args
  • indent argument descriptions
  • provide arg types in parentheses

Comment on lines 49 to +51
batch_size: int,
lr: float
lr: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lr and batch_size should be embedded in config (needs to be added to the constructor). The training config contains all the hyperparams and user-defined params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in the trainer constructur there should be an argument called config which accepts a dictionary object, which in turn contains a set of hyperparameters (lr, batch_size, etc.).
This way we simplify the trainer constructor, grouping all the hyperparams under config. See this: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#module-itwinai.torch.trainer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be unresolved yet. Also, try as much as possible to reuse the existing field names from the training configuration: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#itwinai.torch.config.TrainingConfiguration
Example, in this case lr would become optim_lr .

It is also possible to create a custom configuration for the use case to include use case hyperparams, as we did for Virgo:

class VirgoTrainingConfiguration(TrainingConfiguration):

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved
# initialize the model
cvae_model = model.ConvVAE().to(device)
cvae_model = model.ConvVAE()
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lr should be retrieved from self.config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matbun
Copy link
Collaborator

matbun commented Jul 2, 2024

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season.
I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively.
I mean something like this, in the train.py file:

# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...

In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

@r-sarma r-sarma closed this Jul 29, 2024
@r-sarma r-sarma reopened this Jul 29, 2024
@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:
train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

'train.py' now takes care of the launch of each season separately. In the current implementation, we can still maintain the pipelines, while running each season separately.

@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season. I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively. I mean something like this, in the train.py file:

# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...

In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

This has now been implemented in a slightly different manner. The train.py now reads the configuration file, which contains the list of the seasons. Then it loops over the seasons, and dynamically adjust the seasons value to launch a pipeline iteratively for each season.

@matbun
Copy link
Collaborator

matbun commented Oct 16, 2024

I would suggest to rebase onto main before proceeding given the latest changes

@r-sarma r-sarma force-pushed the cerfacs-integration branch from 2c26e1c to afa2765 Compare October 16, 2024 09:33
@r-sarma
Copy link
Collaborator Author

r-sarma commented Oct 16, 2024

Branch has been rebased onto main.

@r-sarma r-sarma linked an issue Dec 3, 2024 that may be closed by this pull request
@matbun matbun self-requested a review December 4, 2024 12:48
Comment on lines 49 to +51
batch_size: int,
lr: float
lr: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that in the trainer constructur there should be an argument called config which accepts a dictionary object, which in turn contains a set of hyperparameters (lr, batch_size, etc.).
This way we simplify the trainer constructor, grouping all the hyperparams under config. See this: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#module-itwinai.torch.trainer

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved
Comment on lines 75 to 79
Parameters:
bce_loss: recontruction loss
mu: the mean from the latent vector
logvar: log variance from the latent vector
beta: weight over the KL-Divergence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should look more like this: https://itwinai.readthedocs.io/latest/_modules/itwinai/torch/trainer.html#TorchTrainer

  • Parameters -> Args
  • indent argument descriptions
  • provide arg types in parentheses

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved
# initialize the model
cvae_model = model.ConvVAE().to(device)
cvae_model = model.ConvVAE()
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 49 to +51
batch_size: int,
lr: float
lr: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be unresolved yet. Also, try as much as possible to reuse the existing field names from the training configuration: https://itwinai.readthedocs.io/latest/python-api/itwinai.torch.modules.html#itwinai.torch.config.TrainingConfiguration
Example, in this case lr would become optim_lr .

It is also possible to create a custom configuration for the use case to include use case hyperparams, as we did for Virgo:

class VirgoTrainingConfiguration(TrainingConfiguration):

BCE = bce_loss
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return BCE + beta*KLD


@monitor_exec
def execute(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar comments as for the execute method in the XTClimPredictor

Comment on lines +17 to +20
def read_config(file_path):
with open(file_path, 'r') as f:
config = yaml.safe_load(f)
return config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be replaced by itwinai.utils.load_yaml function

Comment on lines +27 to +30
config['pipeline']['init_args']['steps']['training-step']['init_args']['seasons'] = season
model_uri = f"outputs/cvae_model_{season}1d_1memb.pth"
config['pipeline']['init_args']['steps']['evaluation-step']['init_args']['model_uri'] = model_uri
config['pipeline']['init_args']['steps']['evaluation-step']['init_args']['seasons'] = season
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in hpo.py on dynamic override of config fields

)
pipeline = pipe_parser.parse_pipeline()
print(f"Running pipeline for season: {season}")
pipeline.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment in hpo.py on breaking down pipeline into preprocessing steps (potentially always the same across seasons) and training steps, to avoid re-doing the preprocessing. However, I am not an expert of this use case and this may not be feasible for some reason

@matbun matbun requested a review from jarlsondre December 4, 2024 14:03
self.logger.create_logger_context()

for scenario in ['585', '245']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are magic "numbers". Make them into constants or at least some variable that tells us what they mean.

Comment on lines +145 to +146
proj_data = np.load(f"input/preprocessed_1d_proj{scenario}_{season}data_1memb.npy")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib

Comment on lines +148 to +149
proj_time['0'][i] ) for i in range(n_proj) ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newlines (see earlier dicussion)

Comment on lines +156 to +158
projset, self.device, criterion,
pixel_wise_criterion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newlines (see earlier dicussions) and num parameters per line

Comment on lines +161 to +163
projset, self.device, criterion,
pixel_wise_criterion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

projset, self.device, criterion,
pixel_wise_criterion)
tot_proj_losses = list(map(add, tot_proj_losses, proj_losses))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use of "map" (see earlier discussions)

Comment on lines +167 to +168
proj_avg_losses = proj_avg_losses/n_avg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between operators


# save the losses time series
pd.DataFrame(tot_proj_losses).to_csv(f"outputs/proj{scenario}_losses_{season}1d_allssp.csv")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib (see earlier discussions)

# save the losses time series
pd.DataFrame(tot_proj_losses).to_csv(f"outputs/proj{scenario}_losses_{season}1d_allssp.csv")
print(f'SSP{scenario} Projection average loss:', proj_avg_losses, 'for', season[:-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use an f-string but then also use commas to separate statements? You should stick to one: either use "{}" for all, or commas for all.

Comment on lines +174 to +176
'Average projection loss',
kind='metric')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably fits on one line + newline stuff discussed earlier

Returns:
- total loss (torch.Tensor)
"""
BCE = bce_loss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed?

Comment on lines +83 to +89
- bce_loss (torch.Tensor): recontruction loss
- mu (torch.Tensor): the mean from the latent vector
- logvar (torch.Tensor): log variance from the latent vector
- beta (torch.Tensor): weight over the KL-Divergence

Returns:
- total loss (torch.Tensor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to follow Google style guide for Python docstrings.

Comment on lines +124 to +127
train_time = pd.read_csv(f"input/dates_train_{season}data_{self.config.n_memb}memb.csv")
train_data = np.load(
f"input/preprocessed_1d_train_{season}data_{self.config.n_memb}memb.npy"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib (see earlier discussion)

Comment on lines +134 to +135
trainloader = self.strategy.create_dataloader(trainset, batch_size=self.config.batch_size,
shuffle=True, pin_memory=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newlines wrt parenthesis (see earlier discussions)

Comment on lines +138 to +139
test_time = pd.read_csv(f"input/dates_test_{season}data_{self.config.n_memb}memb.csv")
test_data = np.load(f"input/preprocessed_1d_test_{season}data_{self.config.n_memb}memb.npy")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib (see earlier discussions)

Comment on lines +145 to +146
testloader = self.strategy.create_dataloader(testset, batch_size=self.config.batch_size,
shuffle=False, pin_memory=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newlines and parentheses

Comment on lines +178 to +185
self.log(train_epoch_loss,
'epoch_train_loss',
kind='metric'
)
self.log(valid_epoch_loss,
'epoch_valid_loss',
kind='metric'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline after opening parenthesis (might fit on one line?)

# save_ex(recon_images[0], epoch, season)

# decreasing learning rate
if (epoch + 1) % 20 == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 is a magic number. Put it in a constant or some variable


# decreasing learning rate
if (epoch + 1) % 20 == 0:
self.config.lr = self.config.lr / 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 is a magic number. Put it in a constant or some variable

Comment on lines +208 to +209
epoch > 1
and (self.config.old_valid_loss - valid_epoch_loss) / self.config.old_valid_loss < self.config.stop_delta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be extracted to two variables for readability

}
# save checkpoint only if it is better than
# the previous ones
checkpoint_filename = f"outputs/cvae_model_{season}1d_{self.config.n_memb}memb.pth"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib

Comment on lines +242 to +244
self.log(checkpoint_filename,
os.path.basename(checkpoint_filename),
kind='artifact')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentheses and pathlib

Comment on lines +257 to +262
pd.DataFrame(train_loss).to_csv(
f"outputs/train_loss_indiv_{season}1d_{self.config.n_memb}memb.csv"
)
pd.DataFrame(valid_loss).to_csv(
f"outputs/test_loss_indiv_{season}1d_{self.config.n_memb}memb.csv"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib

Comment on lines +251 to +252
{"loss": train_epoch_loss,
"valid_loss": valid_epoch_loss}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newlines and parentheses

Comment on lines +25 to +43
echo "DEBUG: TIME: $(date)"
sysN="$(uname -n | cut -f2- -d.)"
sysN="${sysN%%[0-9]*}"
echo "Running on system: $sysN"
echo "DEBUG: EXECUTE: $EXEC"
echo "DEBUG: SLURM_SUBMIT_DIR: $SLURM_SUBMIT_DIR"
echo "DEBUG: SLURM_JOB_ID: $SLURM_JOB_ID"
echo "DEBUG: SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "DEBUG: SLURM_NNODES: $SLURM_NNODES"
echo "DEBUG: SLURM_NTASKS: $SLURM_NTASKS"
echo "DEBUG: SLURM_TASKS_PER_NODE: $SLURM_TASKS_PER_NODE"
echo "DEBUG: SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST"
echo "DEBUG: SLURMD_NODENAME: $SLURMD_NODENAME"
echo "DEBUG: CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
if [ "$DEBUG" = true ] ; then
echo "DEBUG: NCCL_DEBUG=INFO"
export NCCL_DEBUG=INFO
fi
echo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a variable called "DEBUG", yet the debugging statements are printed regardless of its value. Seems counterintuitive to me.

Comment on lines +17 to +18
COMMAND="train.py"
EXEC="$COMMAND"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the same thing twice, here?

if [ "$SLURM_CPUS_PER_TASK" > 0 ] ; then
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
fi
export CUDA_VISIBLE_DEVICES="0,1,2,3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is only used for Horovod

Comment on lines +91 to +93
elif [ "$DIST_MODE" == "horovod" ] ; then
echo "HOROVOD training"
srun --cpu-bind=none python -u $EXEC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not also have to set "ntasks" for this to properly distribute?

help='Configuration file to the pipeline to execute.'
)
args = parser.parse_args()
def read_config(file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type hints

Comment on lines +31 to +33
pipe_parser = ConfigParser(
config=config,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be one line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed Torch Predictor Distributed ML for CERFACS
3 participants