Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC MPI Operator troubleshooting steps added #5857

Closed
wants to merge 2 commits into from

Conversation

10sharmashivam
Copy link
Contributor

Tracking issue

References #4464

Why are the changes needed?

These changes provide important troubleshooting steps for users setting up the MPI operator for distributed training jobs on Flyte. The steps cover common issues like insufficient resource allocation for worker pods and clarify the correct workflow registration method, especially when using custom images. This will help users avoid common pitfalls and streamline the MPI operator setup process.

What changes were proposed in this pull request?

This pull request adds troubleshooting steps to the MPI operator installation documentation. Specifically, the following issues are addressed:

•	Insufficient Resources for Worker Pods: Steps to ensure sufficient CPU and memory requests are set.
•	Workflow Registration Method Errors: Clarification on using the correct registration method when working with custom images, with a link to the relevant Flyte documentation.

How was this patch tested?

This is a documentation update based on Flyte Slack's conversation about MPI operator installation issues.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Copy link

welcome bot commented Oct 18, 2024

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@@ -1035,3 +1035,39 @@ Wait for the upgrade to complete. You can check the status of the deployment pod
kubectl get pods -n flyte

Once all the components are up and running, go to the `examples section <https://docs.flyte.org/en/latest/flytesnacks/integrations.html#native-backend-plugins>`__ to learn more about how to use Flyte backend plugins.

Troubleshooting MPI Operator Installation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to be a heading under a new section called "Common issues" or something along those lines.
Then use the same structure the above section follow to have one tab for each plugin. For now it would be only "MPI", but that would leave the structure for others to add troubleshooting steps for other plugins as well. Let me know if there are questions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! I’ve updated the documentation by moving the troubleshooting steps under a new section titled "Troubleshooting Plugin Deployments" and organized the troubleshooting steps for MPI under a dedicated group tab, which can later be used for each plugin. Please let me know if any further adjustments are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super sorry about this but looking closer, I think putting this section on the MPI guide itself would make it easier to discover and use, especially because the issues on the Slack thread are not about the plugin deployment itself but on using it. Let me know if that works for you and how I can help.

Copy link
Contributor Author

@10sharmashivam 10sharmashivam Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. I’ve made the suggested changes and submitted a new PR that includes the troubleshooting steps for the MPI operator, which have been shifted to the MPI Guide itself. You can find the new PR here: [Link to my new PR].

I appreciate your guidance, and I’m looking forward to your thoughts on the updates!

Just wanted to know one thing, the slack conversation also mentions the Horovod installation issues. Should I address that as well, along with other two issues which I already addressed? If yes, where should I add them- MPI guide or in Distributed training using Horovod. or Horovod's documentation would be referred for that.

Signed-off-by: 10sharmashivam <[email protected]>
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.72%. Comparing base (6c4f8db) to head (ecf4594).
Report is 36 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5857      +/-   ##
==========================================
+ Coverage   36.49%   36.72%   +0.22%     
==========================================
  Files        1296     1304       +8     
  Lines      109571   130081   +20510     
==========================================
+ Hits        39988    47768    +7780     
- Misses      65426    78143   +12717     
- Partials     4157     4170      +13     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (+0.21%) ⬆️
unittests-flyteadmin 54.41% <ø> (-1.16%) ⬇️
unittests-flytecopilot 11.73% <ø> (?)
unittests-flytectl 62.45% <ø> (+0.19%) ⬆️
unittests-flyteidl 6.89% <ø> (-0.26%) ⬇️
unittests-flyteplugins 53.62% <ø> (+0.15%) ⬆️
unittests-flytepropeller 42.84% <ø> (+0.80%) ⬆️
unittests-flytestdlib 54.78% <ø> (-0.57%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants