-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC MPI Operator troubleshooting steps added #5857
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Signed-off-by: 10sharmashivam <[email protected]>
5150c99
to
84de4b1
Compare
@@ -1035,3 +1035,39 @@ Wait for the upgrade to complete. You can check the status of the deployment pod | |||
kubectl get pods -n flyte | |||
|
|||
Once all the components are up and running, go to the `examples section <https://docs.flyte.org/en/latest/flytesnacks/integrations.html#native-backend-plugins>`__ to learn more about how to use Flyte backend plugins. | |||
|
|||
Troubleshooting MPI Operator Installation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to be a heading under a new section called "Common issues" or something along those lines.
Then use the same structure the above section follow to have one tab for each plugin. For now it would be only "MPI", but that would leave the structure for others to add troubleshooting steps for other plugins as well. Let me know if there are questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! I’ve updated the documentation by moving the troubleshooting steps under a new section titled "Troubleshooting Plugin Deployments" and organized the troubleshooting steps for MPI under a dedicated group tab, which can later be used for each plugin. Please let me know if any further adjustments are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super sorry about this but looking closer, I think putting this section on the MPI guide itself would make it easier to discover and use, especially because the issues on the Slack thread are not about the plugin deployment itself but on using it. Let me know if that works for you and how I can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. I’ve made the suggested changes and submitted a new PR that includes the troubleshooting steps for the MPI operator, which have been shifted to the MPI Guide itself. You can find the new PR here: [Link to my new PR].
I appreciate your guidance, and I’m looking forward to your thoughts on the updates!
Just wanted to know one thing, the slack conversation also mentions the Horovod installation issues. Should I address that as well, along with other two issues which I already addressed? If yes, where should I add them- MPI guide or in Distributed training using Horovod. or Horovod's documentation would be referred for that.
Signed-off-by: 10sharmashivam <[email protected]>
a201025
to
ecf4594
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5857 +/- ##
==========================================
+ Coverage 36.49% 36.72% +0.22%
==========================================
Files 1296 1304 +8
Lines 109571 130081 +20510
==========================================
+ Hits 39988 47768 +7780
- Misses 65426 78143 +12717
- Partials 4157 4170 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Tracking issue
References #4464
Why are the changes needed?
These changes provide important troubleshooting steps for users setting up the MPI operator for distributed training jobs on Flyte. The steps cover common issues like insufficient resource allocation for worker pods and clarify the correct workflow registration method, especially when using custom images. This will help users avoid common pitfalls and streamline the MPI operator setup process.
What changes were proposed in this pull request?
This pull request adds troubleshooting steps to the MPI operator installation documentation. Specifically, the following issues are addressed:
How was this patch tested?
This is a documentation update based on Flyte Slack's conversation about MPI operator installation issues.
Check all the applicable boxes