You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model Fit deadlocks when training on SageMaker with PipeModeDataset. CPUUtilization, MemoryUtilization, DiskUtilization all drop to 0 on the training instance. The model works fine when you swap out PipeModeDataset with tf.data.TFRecordDataset. The for loop proves that the dataset batch has been downloaded.
@murphycrosby Have you tried swapping out the TCN for another model to see if the issue persists? I'd be curious to see if it were actually related to the TCN training in particular or if it's an oddity related to PipeModeDataset. I would also like to see how you are setting up model training in SageMaker and what instance, options, etc. you are using.
It sounds sort of similar to this issue here which is external to the TCN portion of the code. I don't know that it matters much for a normal TFRecordsDataset but order of operations might matter on the PipeModeDataset. You could try rearranging the parse/prefetch/batch ops to match the AWS example here.
Lastly, you could swap to the new FastFile Mode with something like this example here.
Model Fit deadlocks when training on SageMaker with PipeModeDataset. CPUUtilization, MemoryUtilization, DiskUtilization all drop to 0 on the training instance. The model works fine when you swap out PipeModeDataset with tf.data.TFRecordDataset. The for loop proves that the dataset batch has been downloaded.
Example TCN Model
tensorflow==2.3.1
Output
The text was updated successfully, but these errors were encountered: