GPU-flavor-naming-refinement #546

cah-patrickthiem · 2024-04-03T09:58:14Z

This PR handles the refinement of GPU flavor naming. It clarifies things and overhauls some possible inconsistencies in the current naming convention as well as the description. Therefore, this PR introduces an update to the document: scs-0100-v3-flavor-naming.md.
For reference see issue 366-GPU naming convention needs further refinements.

Note: The initial commit just added the flavor naming document in version 4.

…dds a flavor naming version 4 document.

cah-patrickthiem · 2024-05-21T15:24:18Z

After some research, see here and below, I came to following conclusions:
The GPU flavor naming should be changed in a way that is more clear what to expect and most important unfortunately we cannot use a general performance indicator, which was introduced in the prior flavor naming standard. In general the current standard has 4 major problems I want to tackle here.

Derivation:

current standard:

right now the standard suggests something like this: SCS-16V-64-500s_GNa-14h
lets translate it, this flavor indicates a 16 vCPU core flavor with 64gb RAM, 500gb disk and a passthrough Nvidia GPU (N) from the "Ampere" (a) generation with 14 "streaming multiprocessors (sm)" and an "h" for "high-performance"
so far so good

Problem 1 - transparency

What GPU exactly is a GNa-14h?

in regard to the corresponding comment in the standard, this flavor should imply 1/4 of a Nvidia A30 GPU with the SM number of "14" (besides that, the capital "G" implies that it is a passthrough GPU rather than a virtual GPU and thus the "1/4" indicator does not really make sense - just a mistake in the standard)
the user would just not know that it is an A30 GPU, even if we make it "14*4=56" to get the real number of SMs for this GPU, the user would still not know what to get

Solution of Problem 1

make a GPU list for SCS with all corresponding specs and maybe even extend it to vGPUs where fractions of the real SM numbers are given
"There we go, fixed!" - you might think

but there is more...

Problem 2 - inconsistency

a "streaming multiprocessor" for Nvidia and a "computing unit" for AMD CANNOT be compared, at least not in a clear and logical way...
Why?
there are discrepancies between the performance of a GPU and the number of SMs or CUs or whatever
this applies for GPUs from the same vendor as well as between different vendors

for Nvidia examples, see here:

the Nvidia A100 has 108 SMs
the Nvidia H100 has 114 SMs
the Nvidia L40 has 142 SMs
ranked by performance: H100 > A100 > L40
Note: H100 is 4x faster than the A100 even though it just has 6 SMs more, the L40 is significantly slower than the A100 but has more SMs

for AMD examples, see here:

AMD MI100 has 120 CUs
AMD MI250 has 208 CUs
AMD MI250x has 220 CUs

Different Performance Benchmarks (for more details see here and here):

FP64 Performance: H100 > MI250(x) > A100 > MI100
Memory Bandwidth: H100 > MI250(x) > A100 > MI100
Tensor Performance (FP16): H100 > MI250(x) > A100 > MI100

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Problem 3 - other factors
Architectural Differences:

Nvidia SMs: Each streaming multiprocessor (SM) in an Nvidia GPU contains multiple CUDA cores, Tensor cores, memory caches, and other components that handle different types of workloads. The exact configuration and capabilities of an SM can vary significantly between different Nvidia architectures (e.g., Ampere vs. Hopper).
AMD CUs: Each compute unit (CU) in an AMD GPU contains multiple stream processors (SPs), along with texture units, memory caches, and other components. The design and capabilities of CUs also vary between AMD architectures (e.g., CDNA vs. RDNA).

Core Counts and Types:

Nvidia: the number of CUDA cores per SM can vary. For example, Ampere architecture has 64 CUDA cores per SM, while Turing has 64 CUDA cores per SM as well, but with different performance characteristics
AMD: the number of stream processors per CU can also vary. For instance, RDNA2 architecture has 64 stream processors per CU, but the performance per stream processor can differ based on architectural enhancements

Specialized Units:

both Nvidia and AMD include specialized units in their architectures such as Tensor Cores in Nvidia GPUs for AI tasks or Ray Accelerators in AMD GPUs for ray tracing.
the presence and performance of these units can significantly affect overall GPU performance in specific workloads

Memory Bandwidth and Cache:

the memory architecture, including the type and amount of memory (HBM2, GDDR6, etc.), memory bandwidth, and cache sizes, can greatly influence performance
high memory bandwidth and large caches can improve performance for memory-intensive tasks

Software and Optimization:

the performance also depends on software, drivers, and how well applications are optimized for a specific GPU architecture
certain workloads may run more efficiently on one architecture due to better optimization and support in the software stack
AMD ROCm vs. NVIDIA Data Center GPU Driver & CUDA Toolkit

Problem 4 - high performance indicator
As indicated in the current standard, the "h" is a "high performance indicator", quote: "The optional h suffix to the compute unit count indicates high-performance (e.g. high freq or special high bandwidth gfx memory such as HBM);".
This reads reasonable but has some flaws.
For example: What GPUs can come with HBM memory?
To name some:

AMD MI100
AMD MI50
AMD MI60
Nvidia P100
Nvidia V100
Nvidia A100
Nvidia H100

The problem with this is, that "high performance" should indicate just what it says, but H100, A100 are a lot faster than V100 or P100. The same applies for MI100 vs. MI50 & MI60. That can lead to confusion on what "high performance" really means. Those lower end GPUs mentioned are not really comparable using a single "h" to indicate high performance.
It could maybe help to triple the "h" indicator, meaning something like: P100 and V100 get no "h", A40 would get one "h", A100 two "hh" and H100 three "hhh".

But where to draw the line here? Also, what if new generations are released, where the performance of the new GPUs x-folds in comparison to the older generation.
Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

This approach is imo inconsistent as well, but at least can be confusing for the user and/or the ones responsible for billing those flavors.

Proposals:

get rid of SMs, CUs etc. and include the GPU model in the flavor name

only accept living GPUs with ongoing support for those flavors: https://endoflife.date/nvidia-gpu
- include a list in the flavor naming standard
- problem: currently no available list for AMD or Intel GPUs
  - maybe we can assume that their server GPUs are all not end of life yet, since they entered the market much later then Nvidia

High performance indicators

not sure how to handle "h", "hh" or "hhh" indicators for high performance
- proposals:
  - 1. exclude entirely
  - 2. only mark "h" indicators in one GPU generation
    - e.g. for Nvidia Ampere generation:
      - A10 gets one "h"
      - A30 and A40 get two "hh"
      - A100 gets three "hhh"
        
        SCS-16V-64-500s_GN-A100-hhh
  - 3. mark "h" across generations
    - A10 gets no "h"
    - A30 and A40 get one "h"
    - A100 gets two "hh"
    - H100 gets three "hhh"
      - SCS-16V-64-500s_GN-A30-h
- I have no strong opinion here, but tending to exclude it entirely

Virtualized GPUs

not sure how to handle vGPU in the flavor naming since there can be up to 7 fractions for vGPUs, meaning you can slice e.g. a Nvidia A100 into e.g. 5 parts, 6 parts or 7 parts
- proposal:
  - SCS-16V-64-500s_7gN-A100 would mean this flavor is one part out of 7 parts in a A100
  - SCS-16V-64-500s_5gN-A100 would mean this flavor is one part out of 5 parts in a A100
  - concern: for virtualising an A100 GPU, we would need 6-7 flavors:
    - with 2gN-A100, 3gN-A100, 4gN-A100, 5gN-A100, 6gN-A100, 7gN-A100
    - _1gN-A100 could also be needed because of virtualized passthrough

vRAM:

not sure how to handle vRAM, at least there are some models, like the A100 with two configurations, there is the A100 with 40gb vRAM and also the A100 with 80 GB vRAM
- proposal:
- proposals:
  - 1. always include vRAM, also for vGPUs
    - SCS-16V-64-500s_GN-A100-40g
    - SCS-16V-64-500s_GN-A100-80g
    - SCS-16V-64-500s_2gN-A100-20g
    - SCS-16V-64-500s_3gN-A100-26,7g <-- not sure here since e.g. 1/3 of 80gb vRAM is ugly, maybe just always round off to the lesser number?
      - --> SCS-16V-64-500s_3gN-A100-26g
    - advantage: see what you get
  - 2. always include vRAM, don't split for vGPUs:
    - SCS-16V-64-500s_GN-A100-40g
    - SCS-16V-64-500s_GN-A100-80g
    - SCS-16V-64-500s_2gN-A100-80g
    - SCS-16V-64-500s_2gN-A100-40g
    - SCS-16V-64-500s_GN-A10-24g
  - 3. only include vRAM in non-base-models, don't split for vGPUs:
    - SCS-16V-64-500s_GN-A100-80g
    - SCS-16V-64-500s_2gN-A100
    - SCS-16V-64-500s_2gN-A100-80g
    - advantage: less probability of error in flavor definition

cah-patrickthiem · 2024-07-10T11:11:33Z

For the record, since I presented the current state in todays (10.07.24) IaaS call.
My favorite way of gpu flavor naming looks like this:

do not use SMs, CUs etc. and even do not use "h" indicators for high performance
handle vGPUs as described above, meaning: ...-5gNa-A100 translates to: you get 1 part out of 5 parts of a Nvidia A100
for vRAM I would go with the second proposal: "always include vRAM, don't split for vGPUs", but I think the third proposal would work as well

That means we would get something like: SCS-16V-64-500s_GNa-A100-40g for passthrough GPUs or SCS-16V-64-500s_3gNa-A100-40g for virtualized GPUs

mbuechse

I know this is just a draft, but for the sake of completeness, let me say it nonetheless:

the new version would start out with status: Proposal and without the fields stabilized_at and replaces (I think, you can verify using scs-0001-v1)
naturally, the introduction needs to be updated as well

mbuechse · 2024-07-17T10:09:07Z

It would be good to see a tentative list of GPU models.
I think the vendor and generation letters would become superfluous, depending on whether the model identifier is required or not (but the generation is most definitely redundant). So this part should probably be adjusted.
It would also be good to see a completely spelled out syntax proposal, maybe for your "favorite" way that you described in your most recent comment above.
Finally, it would be good to have some background info regarding virtualization vs. passthrough. How does virtualization work (or rather, what is posible and what isn't), and why is passthrough limited to the whole GPU (I could imagine that the GPU might have independent units?), stuff like that. This explanation could go into a non-authoritative section of the standard, or it could go into a decision record.

Speaking of a decision record: frankly, your large comment above (phrased a bit more soberly) should probably be turned into one, in order to document the development process for posterity, but maybe also to structure the process even better.

Standards/scs-0100-v4-flavor-naming.md

markus-hentsch · 2024-07-24T12:38:08Z

Standards/scs-0100-v4-flavor-naming.md

+- number (M) of processing units that are exposed (for pass-through) or assigned; see table below for vendor-specific terminology
+- high-performance indicator (`h`)
+
+Note that the vendor letter X is mandatory, generation and processing units are optional.


Just for my understanding: judging from the format above, the G/g marker and the high-perf indicator h are also optional right?

So SCS-16V-64-500s_N would be a valid minimal example just stating the existence of an nVidia GPU, correct?

I think even that section is so far unchanged compared to v3. As Patrick stated

Note: The initial commit just added the flavor naming document in version 4.

(Which admits multiple readings, but my interpretation was that he merely copied the file with an increased version number.)

I am aware. As stated this question is just for my understanding. The format syntax of the original standard suggests that the G/g might be optional but I can't find any examples or statements confirming this.
I wanted to make sure that just in case this is unintended we could address this potential inaccuracy when @cah-patrickthiem will be editing this section anyway.

It's not optional AFAICT
https://github.com/SovereignCloudStack/standards/blob/main/Tests/iaas/flavor-naming/flavor_names.py#L358

Then I think the usage of square brackets in "[G/g]" of the format specification is wrong and the brackets should be removed.

We could adjust this in the course of this refinement PR while we're at it since the syntax and explanations will most likely need to be updated anyway.

I don't think this format specification can be "wrong" it that sense, because it's not that precise. We would have to use some more precise syntax (such as EBNF or something). We might do that, but then we should change the standard throughout to use this syntax.

Just for my understanding: judging from the format above, the G/g marker and the high-perf indicator h are also optional right?

So SCS-16V-64-500s_N would be a valid minimal example just stating the existence of an nVidia GPU, correct?

So we break the compatibility with the old naming and make life harder for the parser?

Then I think the usage of square brackets in "[G/g]" of the format specification is wrong and the brackets should be removed.

True. Either G or g is needed to indicate a GPU and indicates PT or Virt

garloff · 2024-08-23T06:13:18Z

in regard to the corresponding comment in the standard, this flavor should imply 1/4 of a Nvidia A30 GPU with the SM number of "14" (besides that, the capital "G" implies that it is a passthrough GPU rather than a virtual GPU and thus the "1/4" indicator does not really make sense - just a mistake in the standard)

Not a mistake: nVidia allows one physical GPU to be partitioned and then be exposed directly via pass-through as several PCIe devices for direct access by several VMs.

garloff · 2024-08-23T06:18:30Z

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Within one vendor and one generation of GPUs (e.g. nVidia Ampere), this number (SMs in case of nVidia) does have a well-defined meaning and allows you to see how much GPU compute performance you get. A _GNa-56 is roughly 4x faster than a GNa-14.
This number is in no way meant to compare GPUs from different generations let alone different vendors.

garloff · 2024-08-23T06:22:21Z

Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

That was exactly the way it was meant to be used. Within one vendor and one generation, there may be variants with specially high frequency or specially high bandwidth memory (HBM). This indicator was meant to allow a provider that uses both variants and wants to also indicate this in the flavor name.

That said, I agree that this is not specific enough and maybe we should just say that vendors can add these modifiers in case they have several variants of GPUs from the same vendor and same generation with significantly different performance and in general discourage its use otherwise.

garloff · 2024-08-23T06:59:35Z

General comments:

I think it's an excellent idea to include the amount of VRAM that the VM will have access to.
To get an indication of how much work you can do and whether your workload even functions, you need
- Vendor and generation
- Amount of compute power (SMs/CUs/EUs/ ... - YES, this is vendor and generation dependent) exposed to the VM
- Amount of VRAM exposed to the VM

As user I know what I need:

I have a model that e.g. runs only on nVidia as it uses CUDA. (This is the case for quite some GPU workloads.), so I need _GN or _gN
I'm looking for recent generation, ideally AdaLovelace, so looking for _GNl
Like most things done on GPUs, my model is massively parallel, so I'm looking for a flavor with lots of SMs, the more the better.
I now my workloads needs at least 16GiB VRAM to perform well, so I need a flavor that has at least this

If I look at the original spec, we have a few shortcomings and a few things done well.

There seems to be a misunderstanding that SMs/CUs/EUs (nVidia/AMD/intel) can be compared across vendors or even generations within a vendor. That is not the case, was never intended and should be clarified.
I would argue that this number is absolutely important. 1/4 of a GPU inside one generation has significantly less compute power than the full one. Inside a generation, the hardware vendors have several cards, ranging from small to medium to large, so even if the full card is always exposed, the performance can vary by a factor or 10 or so (!). IMVHO, we definitely need a size indicator.
Note that at least nVidia has the capability of partitioning their cards, so they can be pass-through exposed to several VMs. Some allow quarters, some allow for 1/7 ... This is not the same as exposing a virtualized GPU, where the hypervisor would do it arbitrarily and not just expose a PCIe device.
We had completely missed the aspect of VRAM. This is significant and knowing whether your LLM fits into VRAM decides whether or not your workload even works. So this absolutely is needed, IMVHO.
The high-performance indicators h are indeed fuzzy. I'm unsure what to do with them. Sometimes we have several variants of a card, one with significantly higher frequencies, which we somehow would want to differentiate.
The memory could also be high-performance. If we have an HBM variant and a GDDR variant, this does make a significant difference ... (that would be an h qualifier for the VRAM).
I prefer NOT using model names from GPUs -- the GPU vendors may or may not make a mess out of it (such as the CPU vendors have where you can't easily deduce the generation/microarchitecture from the name any more).

Here would be my suggestion:

We keep the # of SM/CU/EU, with clarifying words
We add VRAM
We create a table of existing GPUs and their names
- The table should also contain the commonly used hardware partitioned options
- Names for new hardware could still be derived systematically, so if we don't update this table every month, it still makes sense ...
If there are variants with signifcantly different frequencies within one generation, we can have a well-defined h or even hh modifiers
Same for VRAM -- if there are GDDR and HBM variants within one generation, we can have a well defined h modifier for the HBM variant
If we would keep VRAM optional, this would even be backwards compatible (though on the long run, we'd want to mandate VRAM)

_<G/g><Vendor>[<Gen>-<SM/CU/EU>[h[h]][-<VRAM>[h[h]]]] would be my choice ...
This has the advantage of being rather straight forward to implement in the code (flavor name generator, parser) and also is backwards compatible, i.e. old names would still be compliant. We would of course encourage the indication of VRAM and possibly mandate it in a later version of the spec.

Users and vendors may want marketing names. We could always encode them in extra specs, such as scs:gpu-vendor="nVidia", scs:gpu-generation="AdaLovelace", scs:gpu-model="L40", scs:gpu-fraction="1/4", scs:gpu-vram="20".
Long-term, this may be all we need and we stop compressing this information into a name. As long as we do, I would refrain from reflecting vendor-chosen names in our flavor names. Short-term, I would point them to the tables. These IMVHO should be Annexes to the naming spec, so we can have a different update schedule and don't need to revise the standard just because there is a new piece of hardware available.

Just my 0.02€.

cah-patrickthiem · 2024-09-04T11:33:38Z

General comments:

I think it's an excellent idea to include the amount of VRAM that the VM will have access to.

To get an indication of how much work you can do and whether your workload even functions, you need

Vendor and generation

Amount of compute power (SMs/CUs/EUs/ ... - YES, this is vendor and generation dependent) exposed to the VM

Amount of VRAM exposed to the VM

As user I know what I need:

I have a model that e.g. runs only on nVidia as it uses CUDA. (This is the case for quite some GPU workloads.), so I need _GN or _gN

I'm looking for recent generation, ideally AdaLovelace, so looking for _GNl

Like most things done on GPUs, my model is massively parallel, so I'm looking for a flavor with lots of SMs, the more the better.

I now my workloads needs at least 16GiB VRAM to perform well, so I need a flavor that has at least this

If I look at the original spec, we have a few shortcomings and a few things done well.

There seems to be a misunderstanding that SMs/CUs/EUs (nVidia/AMD/intel) can be compared across vendors or even generations within a vendor. That is not the case, was never intended and should be clarified.

I would argue that this number is absolutely important. 1/4 of a GPU inside one generation has significantly less compute power than the full one. Inside a generation, the hardware vendors have several cards, ranging from small to medium to large, so even if the full card is always exposed, the performance can vary by a factor or 10 or so (!). IMVHO, we definitely need a size indicator.

Note that at least nVidia has the capability of partitioning their cards, so they can be pass-through exposed to several VMs. Some allow quarters, some allow for 1/7 ... This is not the same as exposing a virtualized GPU, where the hypervisor would do it arbitrarily and not just expose a PCIe device.

We had completely missed the aspect of VRAM. This is significant and knowing whether your LLM fits into VRAM decides whether or not your workload even works. So this absolutely is needed, IMVHO.

The high-performance indicators h are indeed fuzzy. I'm unsure what to do with them. Sometimes we have several variants of a card, one with significantly higher frequencies, which we somehow would want to differentiate.

The memory could also be high-performance. If we have an HBM variant and a GDDR variant, this does make a significant difference ... (that would be an h qualifier for the VRAM).

I prefer NOT using model names from GPUs -- the GPU vendors may or may not make a mess out of it (such as the CPU vendors have where you can't easily deduce the generation/microarchitecture from the name any more).

Here would be my suggestion:

We keep the # of SM/CU/EU, with clarifying words

We add VRAM

We create a table of existing GPUs and their names

The table should also contain the commonly used hardware partitioned options

Names for new hardware could still be derived systematically, so if we don't update this table every month, it still makes sense ...

If there are variants with signifcantly different frequencies within one generation, we can have a well-defined h or even hh modifiers

Same for VRAM -- if there are GDDR and HBM variants within one generation, we can have a well defined h modifier for the HBM variant

If we would keep VRAM optional, this would even be backwards compatible (though on the long run, we'd want to mandate VRAM)

_<G/g><Vendor>[<Gen>-<SM/CU/EU>[h[h]][-<VRAM>[h[h]]]] would be my choice ... This has the advantage of being rather straight forward to implement in the code (flavor name generator, parser) and also is backwards compatible, i.e. old names would still be compliant. We would of course encourage the indication of VRAM and possibly mandate it in a later version of the spec.

Users and vendors may want marketing names. We could always encode them in extra specs, such as scs:gpu-vendor="nVidia", scs:gpu-generation="AdaLovelace", scs:gpu-model="L40", scs:gpu-fraction="1/4", scs:gpu-vram="20". Long-term, this may be all we need and we stop compressing this information into a name. As long as we do, I would refrain from reflecting vendor-chosen names in our flavor names. Short-term, I would point them to the tables. These IMVHO should be Annexes to the naming spec, so we can have a different update schedule and don't need to revise the standard just because there is a new piece of hardware available.

Just my 0.02€.

I took some time thinking about your comment.
First of all I absolutely agree on the VRAM aspect. I also agree on the memory type with "h" marker.

I am not strictly against the SM/CU/EU thing, but let me say that we need to really think about the specifics of different GPUs if we go that way. The connection between performance of a GPU and its Processing-Unit count is NOT linear. One can assume or build a rough estimate of the performance, yes, but twice the unit count does not make it twice as fast. Other aspects, such as memory (bandwidth), clock, drivers or even the core type (cuda, raytracing etc.) can massively impact the computing power of a GPU.

Example:
A100 vs A10:
The Nvidia A100 is the top performing allrounder GPU in the Ampere generation, especially when it comes to high performance computing/AI workloads etc.
The A100 has 108 SMs, the A10 has 72 SMs.

FP16 Tensor Core Performance:

A100: Up to 312 teraFLOPS
A10: Up to 148 teraFLOPS
Factor: 312/148 ≈ 2.11

INT8 Tensor Core Performance:

A100: Up to 624 TOPS
A10: Up to 148 TOPS
Factor: 624/148 ≈ 4.22

FP64 CUDA Core Performance:

A100: Up to 19.5 teraFLOPS
A10: Up to 7.4 teraFLOPS
Factor: 19.5/7.4 ≈ 2.63

Streaming Multiprocessors (SMs):

A100 SMs: 108
A10 SMs: 72
Factor: 108/72 = 1.5

--> both cards do not have ray tracing cores

Another example:
A100 vs A40:

FP16 Tensor Core Performance:

A100: Up to 312 teraFLOPS
A40: Up to 156 teraFLOPS
Factor: 312/156 = 2.0

INT8 Tensor Core Performance:

Factor: 624/312 = 2.0

FP64 CUDA Core Performance:

Factor: 19.5/9.7 ≈ 2.01

Streaming Multiprocessors (SMs):

A100 SMs: 108
A40 SMs: 84
Factor: 108/72 = 1.29

Ray Tracing and 3D Rendering:
Ray Tracing Cores:

A100: No ray tracing cores
A40: Includes ray tracing cores
Ratio: the A100 lacks ray tracing cores, whereas the A40 is designed with them, making the A40 better suited for ray tracing and 3D rendering tasks
3D Rendering Performance:
the A40, with its ray tracing cores and architecture, is significantly better suited for real-time 3D rendering tasks compared to the A100, which is more focused on computational and AI workloads

Last example:
NVIDIA A40 vs. A10:

FP16 Tensor Core Performance:

A40: 156 teraFLOPS
A10: 148 teraFLOPS
Factor: 156/148 ≈ 1.05

INT8 Tensor Core Performance:

Factor: 312/148 ≈ 2.11

FP64 CUDA Core Performance:

Factor: 9.7/7.4 ≈ 1.31

SM Count:

A40 SMs: 84
A10 SMs: 72
Ratio: 84/72 = 1.17

Ray Tracing and 3D Rendering:
Ray Tracing Cores:

A40: Includes ray tracing cores
A10: No ray tracing cores

As you can see the key thing here is the non-linearity + some GPUs have capabilities which other do not have, e.g. A40 vs A100 & A10.

These things make it hard for me to be purely positiv about the SM/CU/EU topic.
And for now I did not do research or calculations for the virtual GPUs, I could imagine there also will be some conflicts or at least more confusion regarding the assumed performance. Besides that, I just researched it more detailed for Nvidia, not Intel or AMD.

Imo we need to conclude here soon, maybe I forgot something to take in consideration, but from my side right now, I find it rather difficult to work with these numbers.

Please tell me what you think about these new statements from my side.

garloff · 2024-10-09T11:21:43Z

Hi Patrick,

thanks for the very detailed analysis.

No, performance is not linear, as it depends on the bottleneck.
Sometimes memory bandwidth is the limiting factor, sometimes the amount of compute resources (SMs/CudaCores), sometimes your code has synchronization mechanisms that limit the parallelism, ...

I probably triggered another wrong assumption when saying that you got 4x as much compute power from a GNa-56h (A30) compared to a GNa-14h (a quarter A30). This is very much dependent on your workload, as you correctly say. You may even create workloads, where GNa-56h is not much faster than GNa-14h.

Nevertheless, I stand by design idea that we should tell a user how much resources she gets. In one case it's 14 SMs (896 Cuda Cores for Ampere) for a quarter A30 and in the other it's 56 SMs (3584 Cuda Cores) for a full one.
For a H100, this could be 132SMs (18896 Cuda Cores - full H100) vs 18SMs (2304 Cuda Cores - 1/7 H100) or 2/7 or ...

It's like vCPUs in instances -- 2 vCPU does not necessarily make your workload twice as fast, but you know how much CPU power you roughly get ...

garloff · 2024-10-09T11:23:15Z

Another note:
Let's please call the Multi-Instance-GPU thing from nVidia partitioning and NOT virtualization.

cah-patrickthiem · 2024-10-09T11:30:40Z

Hi Patrick,

thanks for the very detailed analysis.

No, performance is not linear, as it depends on the bottleneck. Sometimes memory bandwidth is the limiting factor, sometimes the amount of compute resources (SMs/CudaCores), sometimes your code has synchronization mechanisms that limit the parallelism, ...

I probably triggered another wrong assumption when saying that you got 4x as much compute power from a GNa-56h (A30) compared to a GNa-14h (a quarter A30). This is very much dependent on your workload, as you correctly say. You may even create workloads, where GNa-56h is not much faster than FNa-14h.

Nevertheless, I stand by design idea that we should tell a user how much resources she gets. In one case it's 14 SMs (896 Cuda Cores for Ampere) for a quarter A30 and in the other it's 56 SMs (3584 Cuda Cores) for a full one. For a H100, this could be 132SMs (18896 Cuda Cores - full H100) vs 18SMs (2304 Cuda Cores - 1/7 H100) or 2/7 or ...

It's like vCPUs in instances -- 2 vCPU does not necessarily make your workload twice as fast, but you know how much CPU power you roughly get ...

That means you would be still in favor for the SM/CU etc. way, or?

cah-patrickthiem · 2024-10-18T08:29:20Z

I close this PR because we finally came to a common conclusion, see this PR #780, which by now already was merged.

Initial commit to refine GPU flavor naming conventions. This commit a…

9da0755

…dds a flavor naming version 4 document.

cah-patrickthiem added standards Issues / ADR / pull requests relevant for standardization & certification SCS-VP10 Related to tender lot SCS-VP10 labels Apr 3, 2024

cah-patrickthiem self-assigned this Apr 3, 2024

mbuechse reviewed Jul 17, 2024

View reviewed changes

markus-hentsch reviewed Jul 24, 2024

View reviewed changes

Standards/scs-0100-v4-flavor-naming.md Show resolved Hide resolved

markus-hentsch reviewed Jul 24, 2024

View reviewed changes

Standards/scs-0100-v4-flavor-naming.md Show resolved Hide resolved

markus-hentsch reviewed Jul 24, 2024

View reviewed changes

Merge branch 'main' into 366-v4-gpu-flavor-naming-refinement

cf5d8bf

cah-patrickthiem mentioned this pull request Oct 15, 2024

Feat/add gpu vram #780

Merged

garloff mentioned this pull request Oct 15, 2024

[Standardization] GPU naming convention needs further refinements #366

Closed

6 tasks

cah-patrickthiem closed this Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-flavor-naming-refinement #546

GPU-flavor-naming-refinement #546

cah-patrickthiem commented Apr 3, 2024 •

edited

Loading

cah-patrickthiem commented May 21, 2024 •

edited

Loading

cah-patrickthiem commented Jul 10, 2024

mbuechse left a comment

mbuechse commented Jul 17, 2024

markus-hentsch Jul 24, 2024

mbuechse Jul 24, 2024

markus-hentsch Jul 24, 2024

mbuechse Jul 24, 2024

markus-hentsch Jul 24, 2024

mbuechse Jul 25, 2024

garloff Aug 23, 2024

garloff Aug 23, 2024

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024 •

edited

Loading

cah-patrickthiem commented Sep 4, 2024 •

edited

Loading

garloff commented Oct 9, 2024 •

edited

Loading

garloff commented Oct 9, 2024

cah-patrickthiem commented Oct 9, 2024

cah-patrickthiem commented Oct 18, 2024

GPU-flavor-naming-refinement #546

GPU-flavor-naming-refinement #546

Conversation

cah-patrickthiem commented Apr 3, 2024 • edited Loading

cah-patrickthiem commented May 21, 2024 • edited Loading

Derivation:

Proposals:

cah-patrickthiem commented Jul 10, 2024

mbuechse left a comment

Choose a reason for hiding this comment

mbuechse commented Jul 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024

garloff commented Aug 23, 2024 • edited Loading

cah-patrickthiem commented Sep 4, 2024 • edited Loading

garloff commented Oct 9, 2024 • edited Loading

garloff commented Oct 9, 2024

cah-patrickthiem commented Oct 9, 2024

cah-patrickthiem commented Oct 18, 2024

cah-patrickthiem commented Apr 3, 2024 •

edited

Loading

cah-patrickthiem commented May 21, 2024 •

edited

Loading

garloff commented Aug 23, 2024 •

edited

Loading

cah-patrickthiem commented Sep 4, 2024 •

edited

Loading

garloff commented Oct 9, 2024 •

edited

Loading