Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Troubleshooting Steps for MPI Operator #1756

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

10sharmashivam
Copy link
Contributor

References Issue- flyteorg/flyte#4464
Reference PR- flyteorg/flyte#5857

This PR addresses the troubleshooting steps for the MPI operator as suggested in the feedback in [Reference PR]. I shifted the MPI common issues and troubleshooting steps for them to MPI Guide itself.

This is a documentation update based on [Issue in flyteorg/flyte ] and [Flyte Slack's conversation] about MPI operator installation issues.

The suggestions to move the troubleshooting content came from the review of my previous PR in the flyteorg/flyte repository, which is linked here: [PR MPI Operator].

Why are the changes needed?
These changes provide important troubleshooting steps for users setting up the MPI operator for distributed training jobs on Flyte. The steps cover common issues like insufficient resource allocation for worker pods and clarify the correct workflow registration method, especially when using custom images. This will help users avoid common pitfalls and streamline the MPI operator setup process.

What changes were proposed in this pull request?
This pull request adds troubleshooting steps to the MPI operator installation documentation. Specifically, the following issues are addressed:

• Insufficient Resources for Worker Pods: Steps to ensure sufficient CPU and memory requests are set.
• Workflow Registration Method Errors: Clarification on using the correct registration method when working with custom images, with a link to the relevant Flyte documentation.

Check all the applicable boxes
I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Signed-off-by: 10sharmashivam <[email protected]>
@10sharmashivam
Copy link
Contributor Author

@davidmirror-ops, I’ve submitted a new PR addressing the suggestions made in the previous review for PR. Waiting for your thoughts on it!

@10sharmashivam
Copy link
Contributor Author

@davidmirror-ops davidmirror-ops enabled auto-merge (squash) October 21, 2024 18:30
Copy link
Contributor

@davidmirror-ops davidmirror-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@davidmirror-ops davidmirror-ops merged commit c602b8e into flyteorg:master Oct 21, 2024
6 of 7 checks passed
@10sharmashivam 10sharmashivam deleted the doc_mpi branch October 22, 2024 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants