-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Refactor to Introduce Backend Abstraction #2011
base: main
Are you sure you want to change the base?
[WIP] Refactor to Introduce Backend Abstraction #2011
Conversation
…parallelization_strategy
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice first draft.
I was wondering: do you think we should keep the 2 methods for row and column separated? Just opening the question, I do not have a strong opinion on that.
Mark tie information right before we run passes because dynamo tracing will alter the parameter name while our | ||
passes don't. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark tie information right before we run passes because dynamo tracing will alter the parameter name while our | |
passes don't. | |
Mark information about tied parameters right before running passes because dynamo tracing alters the names of the parameters while our passes do not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure naming this pre_process
is the most adapted considered it just marks for tied weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally it means something that needs done before we run passes, and weights tying info marking happens to be one of them
) -> nn.Module: | ||
raise NotImplementedError | ||
|
||
def pre_process(self, graph_module: GraphModule, ctx: "ParallelExecutionCtx", config: "Config") -> GraphModule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Config
is a data class which records static configurations during the whole process
raise ValueError( | ||
"`sequence_parallel` can not be activated when `tp_mode` is not set to `REDUCE_SCATTER` in nanotron backend" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it best to fail like that or change the setting in Nanotron ourselves?
@zhenglongjiepheonix what's the status? |
the nanotron backend is not tested, for the default backend, everything works fine, it contains the newest update and has addressed comments |
Ok so ready for final review? |
Basically It's for reference, if someone is working on the support for nanotron indeed, then the correctness of nanotronbackend needs to be verified and additional tests are needed, but in my opinion this PR marks the boundary of optimum and nanotron, the rest of work should be implemented inside nanotron and optimum just expose |
This PR has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs. |
What does this PR do?
The
NanotronBackend
is still WIP and untested, but it would be nice to get some feedbacks first.