[DOCS] Troubleshooting Steps for MPI Operator #1756
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
References Issue- flyteorg/flyte#4464
Reference PR- flyteorg/flyte#5857
This PR addresses the troubleshooting steps for the MPI operator as suggested in the feedback in [Reference PR]. I shifted the MPI common issues and troubleshooting steps for them to MPI Guide itself.
This is a documentation update based on [Issue in flyteorg/flyte ] and [Flyte Slack's conversation] about MPI operator installation issues.
The suggestions to move the troubleshooting content came from the review of my previous PR in the flyteorg/flyte repository, which is linked here: [PR MPI Operator].
Why are the changes needed?
These changes provide important troubleshooting steps for users setting up the MPI operator for distributed training jobs on Flyte. The steps cover common issues like insufficient resource allocation for worker pods and clarify the correct workflow registration method, especially when using custom images. This will help users avoid common pitfalls and streamline the MPI operator setup process.
What changes were proposed in this pull request?
This pull request adds troubleshooting steps to the MPI operator installation documentation. Specifically, the following issues are addressed:
• Insufficient Resources for Worker Pods: Steps to ensure sufficient CPU and memory requests are set.
• Workflow Registration Method Errors: Clarification on using the correct registration method when working with custom images, with a link to the relevant Flyte documentation.
Check all the applicable boxes
I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.