-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-flavor-naming-refinement #546
Conversation
…dds a flavor naming version 4 document.
After some research, see here and below, I came to following conclusions: Derivation:current standard:
Problem 1 - transparency What GPU exactly is a GNa-14h?
Solution of Problem 1
but there is more... Problem 2 - inconsistency
for Nvidia examples, see here:
for AMD examples, see here:
Different Performance Benchmarks (for more details see here and here):
Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance. Problem 3 - other factors
Core Counts and Types:
Specialized Units:
Memory Bandwidth and Cache:
Software and Optimization:
Problem 4 - high performance indicator
The problem with this is, that "high performance" should indicate just what it says, but H100, A100 are a lot faster than V100 or P100. The same applies for MI100 vs. MI50 & MI60. That can lead to confusion on what "high performance" really means. Those lower end GPUs mentioned are not really comparable using a single "h" to indicate high performance. But where to draw the line here? Also, what if new generations are released, where the performance of the new GPUs x-folds in comparison to the older generation. This approach is imo inconsistent as well, but at least can be confusing for the user and/or the ones responsible for billing those flavors. Proposals:get rid of SMs, CUs etc. and include the GPU model in the flavor name
High performance indicators
Virtualized GPUs
vRAM:
|
For the record, since I presented the current state in todays (10.07.24) IaaS call.
That means we would get something like: SCS-16V-64-500s_GNa-A100-40g for passthrough GPUs or SCS-16V-64-500s_3gNa-A100-40g for virtualized GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just a draft, but for the sake of completeness, let me say it nonetheless:
- the new version would start out with
status: Proposal
and without the fieldsstabilized_at
andreplaces
(I think, you can verify using scs-0001-v1) - naturally, the introduction needs to be updated as well
It would be good to see a tentative list of GPU models. Speaking of a decision record: frankly, your large comment above (phrased a bit more soberly) should probably be turned into one, in order to document the development process for posterity, but maybe also to structure the process even better. |
- number (M) of processing units that are exposed (for pass-through) or assigned; see table below for vendor-specific terminology | ||
- high-performance indicator (`h`) | ||
|
||
Note that the vendor letter X is mandatory, generation and processing units are optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: judging from the format above, the G/g
marker and the high-perf indicator h
are also optional right?
So SCS-16V-64-500s_N
would be a valid minimal example just stating the existence of an nVidia GPU, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even that section is so far unchanged compared to v3. As Patrick stated
Note: The initial commit just added the flavor naming document in version 4.
(Which admits multiple readings, but my interpretation was that he merely copied the file with an increased version number.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware. As stated this question is just for my understanding. The format syntax of the original standard suggests that the G/g
might be optional but I can't find any examples or statements confirming this.
I wanted to make sure that just in case this is unintended we could address this potential inaccuracy when @cah-patrickthiem will be editing this section anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think the usage of square brackets in "[G/g
]" of the format specification is wrong and the brackets should be removed.
We could adjust this in the course of this refinement PR while we're at it since the syntax and explanations will most likely need to be updated anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this format specification can be "wrong" it that sense, because it's not that precise. We would have to use some more precise syntax (such as EBNF or something). We might do that, but then we should change the standard throughout to use this syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: judging from the format above, the
G/g
marker and the high-perf indicatorh
are also optional right?So
SCS-16V-64-500s_N
would be a valid minimal example just stating the existence of an nVidia GPU, correct?
So we break the compatibility with the old naming and make life harder for the parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think the usage of square brackets in "[
G/g
]" of the format specification is wrong and the brackets should be removed.
True. Either G
or g
is needed to indicate a GPU and indicates PT or Virt
Not a mistake: nVidia allows one physical GPU to be partitioned and then be exposed directly via pass-through as several PCIe devices for direct access by several VMs. |
Within one vendor and one generation of GPUs (e.g. nVidia Ampere), this number (SMs in case of nVidia) does have a well-defined meaning and allows you to see how much GPU compute performance you get. A |
That was exactly the way it was meant to be used. Within one vendor and one generation, there may be variants with specially high frequency or specially high bandwidth memory (HBM). This indicator was meant to allow a provider that uses both variants and wants to also indicate this in the flavor name. That said, I agree that this is not specific enough and maybe we should just say that vendors can add these modifiers in case they have several variants of GPUs from the same vendor and same generation with significantly different performance and in general discourage its use otherwise. |
General comments:
As user I know what I need:
If I look at the original spec, we have a few shortcomings and a few things done well.
Here would be my suggestion:
Users and vendors may want marketing names. We could always encode them in extra specs, such as scs:gpu-vendor="nVidia", scs:gpu-generation="AdaLovelace", scs:gpu-model="L40", scs:gpu-fraction="1/4", scs:gpu-vram="20". Just my 0.02€. |
I took some time thinking about your comment. I am not strictly against the SM/CU/EU thing, but let me say that we need to really think about the specifics of different GPUs if we go that way. The connection between performance of a GPU and its Processing-Unit count is NOT linear. One can assume or build a rough estimate of the performance, yes, but twice the unit count does not make it twice as fast. Other aspects, such as memory (bandwidth), clock, drivers or even the core type (cuda, raytracing etc.) can massively impact the computing power of a GPU. Example: FP16 Tensor Core Performance:
INT8 Tensor Core Performance:
FP64 CUDA Core Performance:
Streaming Multiprocessors (SMs):
--> both cards do not have ray tracing cores Another example: FP16 Tensor Core Performance:
INT8 Tensor Core Performance:
FP64 CUDA Core Performance:
Streaming Multiprocessors (SMs):
Ray Tracing and 3D Rendering:
Last example: FP16 Tensor Core Performance:
INT8 Tensor Core Performance:
FP64 CUDA Core Performance:
SM Count:
Ray Tracing and 3D Rendering:
As you can see the key thing here is the non-linearity + some GPUs have capabilities which other do not have, e.g. A40 vs A100 & A10. These things make it hard for me to be purely positiv about the SM/CU/EU topic. Imo we need to conclude here soon, maybe I forgot something to take in consideration, but from my side right now, I find it rather difficult to work with these numbers. Please tell me what you think about these new statements from my side. |
Hi Patrick, thanks for the very detailed analysis. No, performance is not linear, as it depends on the bottleneck. I probably triggered another wrong assumption when saying that you got 4x as much compute power from a Nevertheless, I stand by design idea that we should tell a user how much resources she gets. In one case it's 14 SMs (896 Cuda Cores for Ampere) for a quarter A30 and in the other it's 56 SMs (3584 Cuda Cores) for a full one. It's like vCPUs in instances -- 2 vCPU does not necessarily make your workload twice as fast, but you know how much CPU power you roughly get ... |
Another note: |
That means you would be still in favor for the SM/CU etc. way, or? |
I close this PR because we finally came to a common conclusion, see this PR #780, which by now already was merged. |
This PR handles the refinement of GPU flavor naming. It clarifies things and overhauls some possible inconsistencies in the current naming convention as well as the description. Therefore, this PR introduces an update to the document: scs-0100-v3-flavor-naming.md.
For reference see issue 366-GPU naming convention needs further refinements.
Note: The initial commit just added the flavor naming document in version 4.