Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-flavor-naming-refinement #546

Closed
wants to merge 2 commits into from

Conversation

cah-patrickthiem
Copy link

@cah-patrickthiem cah-patrickthiem commented Apr 3, 2024

This PR handles the refinement of GPU flavor naming. It clarifies things and overhauls some possible inconsistencies in the current naming convention as well as the description. Therefore, this PR introduces an update to the document: scs-0100-v3-flavor-naming.md.
For reference see issue 366-GPU naming convention needs further refinements.

Note: The initial commit just added the flavor naming document in version 4.

@cah-patrickthiem cah-patrickthiem added standards Issues / ADR / pull requests relevant for standardization & certification SCS-VP10 Related to tender lot SCS-VP10 labels Apr 3, 2024
@cah-patrickthiem cah-patrickthiem self-assigned this Apr 3, 2024
@cah-patrickthiem
Copy link
Author

cah-patrickthiem commented May 21, 2024

After some research, see here and below, I came to following conclusions:
The GPU flavor naming should be changed in a way that is more clear what to expect and most important unfortunately we cannot use a general performance indicator, which was introduced in the prior flavor naming standard. In general the current standard has 4 major problems I want to tackle here.

Derivation:

current standard:

  • right now the standard suggests something like this: SCS-16V-64-500s_GNa-14h
  • lets translate it, this flavor indicates a 16 vCPU core flavor with 64gb RAM, 500gb disk and a passthrough Nvidia GPU (N) from the "Ampere" (a) generation with 14 "streaming multiprocessors (sm)" and an "h" for "high-performance"
  • so far so good

Problem 1 - transparency

What GPU exactly is a GNa-14h?

  • in regard to the corresponding comment in the standard, this flavor should imply 1/4 of a Nvidia A30 GPU with the SM number of "14" (besides that, the capital "G" implies that it is a passthrough GPU rather than a virtual GPU and thus the "1/4" indicator does not really make sense - just a mistake in the standard)
  • the user would just not know that it is an A30 GPU, even if we make it "14*4=56" to get the real number of SMs for this GPU, the user would still not know what to get

Solution of Problem 1

  • make a GPU list for SCS with all corresponding specs and maybe even extend it to vGPUs where fractions of the real SM numbers are given
  • "There we go, fixed!" - you might think

but there is more...

Problem 2 - inconsistency

  • a "streaming multiprocessor" for Nvidia and a "computing unit" for AMD CANNOT be compared, at least not in a clear and logical way...
    Why?

  • there are discrepancies between the performance of a GPU and the number of SMs or CUs or whatever

  • this applies for GPUs from the same vendor as well as between different vendors

for Nvidia examples, see here:

  • the Nvidia A100 has 108 SMs
  • the Nvidia H100 has 114 SMs
  • the Nvidia L40 has 142 SMs
  • ranked by performance: H100 > A100 > L40
  • Note: H100 is 4x faster than the A100 even though it just has 6 SMs more, the L40 is significantly slower than the A100 but has more SMs

for AMD examples, see here:

  • AMD MI100 has 120 CUs
  • AMD MI250 has 208 CUs
  • AMD MI250x has 220 CUs

Different Performance Benchmarks (for more details see here and here):

  • FP64 Performance: H100 > MI250(x) > A100 > MI100
  • Memory Bandwidth: H100 > MI250(x) > A100 > MI100
  • Tensor Performance (FP16): H100 > MI250(x) > A100 > MI100

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Problem 3 - other factors
Architectural Differences:

  • Nvidia SMs: Each streaming multiprocessor (SM) in an Nvidia GPU contains multiple CUDA cores, Tensor cores, memory caches, and other components that handle different types of workloads. The exact configuration and capabilities of an SM can vary significantly between different Nvidia architectures (e.g., Ampere vs. Hopper).
  • AMD CUs: Each compute unit (CU) in an AMD GPU contains multiple stream processors (SPs), along with texture units, memory caches, and other components. The design and capabilities of CUs also vary between AMD architectures (e.g., CDNA vs. RDNA).

Core Counts and Types:

  • Nvidia: the number of CUDA cores per SM can vary. For example, Ampere architecture has 64 CUDA cores per SM, while Turing has 64 CUDA cores per SM as well, but with different performance characteristics
  • AMD: the number of stream processors per CU can also vary. For instance, RDNA2 architecture has 64 stream processors per CU, but the performance per stream processor can differ based on architectural enhancements

Specialized Units:

  • both Nvidia and AMD include specialized units in their architectures such as Tensor Cores in Nvidia GPUs for AI tasks or Ray Accelerators in AMD GPUs for ray tracing.
  • the presence and performance of these units can significantly affect overall GPU performance in specific workloads

Memory Bandwidth and Cache:

  • the memory architecture, including the type and amount of memory (HBM2, GDDR6, etc.), memory bandwidth, and cache sizes, can greatly influence performance
  • high memory bandwidth and large caches can improve performance for memory-intensive tasks

Software and Optimization:

  • the performance also depends on software, drivers, and how well applications are optimized for a specific GPU architecture
  • certain workloads may run more efficiently on one architecture due to better optimization and support in the software stack
  • AMD ROCm vs. NVIDIA Data Center GPU Driver & CUDA Toolkit

Problem 4 - high performance indicator
As indicated in the current standard, the "h" is a "high performance indicator", quote: "The optional h suffix to the compute unit count indicates high-performance (e.g. high freq or special high bandwidth gfx memory such as HBM);".
This reads reasonable but has some flaws.
For example: What GPUs can come with HBM memory?
To name some:

  • AMD MI100
  • AMD MI50
  • AMD MI60
  • Nvidia P100
  • Nvidia V100
  • Nvidia A100
  • Nvidia H100

The problem with this is, that "high performance" should indicate just what it says, but H100, A100 are a lot faster than V100 or P100. The same applies for MI100 vs. MI50 & MI60. That can lead to confusion on what "high performance" really means. Those lower end GPUs mentioned are not really comparable using a single "h" to indicate high performance.
It could maybe help to triple the "h" indicator, meaning something like: P100 and V100 get no "h", A40 would get one "h", A100 two "hh" and H100 three "hhh".

But where to draw the line here? Also, what if new generations are released, where the performance of the new GPUs x-folds in comparison to the older generation.
Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

This approach is imo inconsistent as well, but at least can be confusing for the user and/or the ones responsible for billing those flavors.

Proposals:

get rid of SMs, CUs etc. and include the GPU model in the flavor name

  • only accept living GPUs with ongoing support for those flavors: https://endoflife.date/nvidia-gpu
    • include a list in the flavor naming standard
    • problem: currently no available list for AMD or Intel GPUs
      • maybe we can assume that their server GPUs are all not end of life yet, since they entered the market much later then Nvidia

High performance indicators

  • not sure how to handle "h", "hh" or "hhh" indicators for high performance
    • proposals:
      • 1. exclude entirely
      • 2. only mark "h" indicators in one GPU generation
        • e.g. for Nvidia Ampere generation:
          • A10 gets one "h"
          • A30 and A40 get two "hh"
          • A100 gets three "hhh"
            • SCS-16V-64-500s_GN-A100-hhh
      • 3. mark "h" across generations
        • A10 gets no "h"
        • A30 and A40 get one "h"
        • A100 gets two "hh"
        • H100 gets three "hhh"
          • SCS-16V-64-500s_GN-A30-h
    • I have no strong opinion here, but tending to exclude it entirely

Virtualized GPUs

  • not sure how to handle vGPU in the flavor naming since there can be up to 7 fractions for vGPUs, meaning you can slice e.g. a Nvidia A100 into e.g. 5 parts, 6 parts or 7 parts
    • proposal:
      • SCS-16V-64-500s_7gN-A100 would mean this flavor is one part out of 7 parts in a A100
      • SCS-16V-64-500s_5gN-A100 would mean this flavor is one part out of 5 parts in a A100
      • concern: for virtualising an A100 GPU, we would need 6-7 flavors:
        • with 2gN-A100, 3gN-A100, 4gN-A100, 5gN-A100, 6gN-A100, 7gN-A100
        • _1gN-A100 could also be needed because of virtualized passthrough

vRAM:

  • not sure how to handle vRAM, at least there are some models, like the A100 with two configurations, there is the A100 with 40gb vRAM and also the A100 with 80 GB vRAM
    • proposal:
    • proposals:
      • 1. always include vRAM, also for vGPUs
        • SCS-16V-64-500s_GN-A100-40g
        • SCS-16V-64-500s_GN-A100-80g
        • SCS-16V-64-500s_2gN-A100-20g
        • SCS-16V-64-500s_3gN-A100-26,7g <-- not sure here since e.g. 1/3 of 80gb vRAM is ugly, maybe just always round off to the lesser number?
          • --> SCS-16V-64-500s_3gN-A100-26g
        • advantage: see what you get
      • 2. always include vRAM, don't split for vGPUs:
        • SCS-16V-64-500s_GN-A100-40g
        • SCS-16V-64-500s_GN-A100-80g
        • SCS-16V-64-500s_2gN-A100-80g
        • SCS-16V-64-500s_2gN-A100-40g
        • SCS-16V-64-500s_GN-A10-24g
      • 3. only include vRAM in non-base-models, don't split for vGPUs:
        • SCS-16V-64-500s_GN-A100-80g
        • SCS-16V-64-500s_2gN-A100
        • SCS-16V-64-500s_2gN-A100-80g
        • advantage: less probability of error in flavor definition

@cah-patrickthiem
Copy link
Author

For the record, since I presented the current state in todays (10.07.24) IaaS call.
My favorite way of gpu flavor naming looks like this:

  1. do not use SMs, CUs etc. and even do not use "h" indicators for high performance
  2. handle vGPUs as described above, meaning: ...-5gNa-A100 translates to: you get 1 part out of 5 parts of a Nvidia A100
  3. for vRAM I would go with the second proposal: "always include vRAM, don't split for vGPUs", but I think the third proposal would work as well

That means we would get something like: SCS-16V-64-500s_GNa-A100-40g for passthrough GPUs or SCS-16V-64-500s_3gNa-A100-40g for virtualized GPUs

Copy link
Contributor

@mbuechse mbuechse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a draft, but for the sake of completeness, let me say it nonetheless:

  • the new version would start out with status: Proposal and without the fields stabilized_at and replaces (I think, you can verify using scs-0001-v1)
  • naturally, the introduction needs to be updated as well

@mbuechse
Copy link
Contributor

It would be good to see a tentative list of GPU models.
I think the vendor and generation letters would become superfluous, depending on whether the model identifier is required or not (but the generation is most definitely redundant). So this part should probably be adjusted.
It would also be good to see a completely spelled out syntax proposal, maybe for your "favorite" way that you described in your most recent comment above.
Finally, it would be good to have some background info regarding virtualization vs. passthrough. How does virtualization work (or rather, what is posible and what isn't), and why is passthrough limited to the whole GPU (I could imagine that the GPU might have independent units?), stuff like that. This explanation could go into a non-authoritative section of the standard, or it could go into a decision record.

Speaking of a decision record: frankly, your large comment above (phrased a bit more soberly) should probably be turned into one, in order to document the development process for posterity, but maybe also to structure the process even better.

- number (M) of processing units that are exposed (for pass-through) or assigned; see table below for vendor-specific terminology
- high-performance indicator (`h`)

Note that the vendor letter X is mandatory, generation and processing units are optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: judging from the format above, the G/g marker and the high-perf indicator h are also optional right?

So SCS-16V-64-500s_N would be a valid minimal example just stating the existence of an nVidia GPU, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even that section is so far unchanged compared to v3. As Patrick stated

Note: The initial commit just added the flavor naming document in version 4.

(Which admits multiple readings, but my interpretation was that he merely copied the file with an increased version number.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware. As stated this question is just for my understanding. The format syntax of the original standard suggests that the G/g might be optional but I can't find any examples or statements confirming this.
I wanted to make sure that just in case this is unintended we could address this potential inaccuracy when @cah-patrickthiem will be editing this section anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think the usage of square brackets in "[G/g]" of the format specification is wrong and the brackets should be removed.

We could adjust this in the course of this refinement PR while we're at it since the syntax and explanations will most likely need to be updated anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this format specification can be "wrong" it that sense, because it's not that precise. We would have to use some more precise syntax (such as EBNF or something). We might do that, but then we should change the standard throughout to use this syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: judging from the format above, the G/g marker and the high-perf indicator h are also optional right?

So SCS-16V-64-500s_N would be a valid minimal example just stating the existence of an nVidia GPU, correct?

So we break the compatibility with the old naming and make life harder for the parser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think the usage of square brackets in "[G/g]" of the format specification is wrong and the brackets should be removed.

True. Either G or g is needed to indicate a GPU and indicates PT or Virt

@garloff
Copy link
Member

garloff commented Aug 23, 2024

  • in regard to the corresponding comment in the standard, this flavor should imply 1/4 of a Nvidia A30 GPU with the SM number of "14" (besides that, the capital "G" implies that it is a passthrough GPU rather than a virtual GPU and thus the "1/4" indicator does not really make sense - just a mistake in the standard)

Not a mistake: nVidia allows one physical GPU to be partitioned and then be exposed directly via pass-through as several PCIe devices for direct access by several VMs.

@garloff
Copy link
Member

garloff commented Aug 23, 2024

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Within one vendor and one generation of GPUs (e.g. nVidia Ampere), this number (SMs in case of nVidia) does have a well-defined meaning and allows you to see how much GPU compute performance you get. A _GNa-56 is roughly 4x faster than a GNa-14.
This number is in no way meant to compare GPUs from different generations let alone different vendors.

@garloff
Copy link
Member

garloff commented Aug 23, 2024

Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

That was exactly the way it was meant to be used. Within one vendor and one generation, there may be variants with specially high frequency or specially high bandwidth memory (HBM). This indicator was meant to allow a provider that uses both variants and wants to also indicate this in the flavor name.

That said, I agree that this is not specific enough and maybe we should just say that vendors can add these modifiers in case they have several variants of GPUs from the same vendor and same generation with significantly different performance and in general discourage its use otherwise.

@garloff
Copy link
Member

garloff commented Aug 23, 2024

General comments:

  • I think it's an excellent idea to include the amount of VRAM that the VM will have access to.
  • To get an indication of how much work you can do and whether your workload even functions, you need
    • Vendor and generation
    • Amount of compute power (SMs/CUs/EUs/ ... - YES, this is vendor and generation dependent) exposed to the VM
    • Amount of VRAM exposed to the VM

As user I know what I need:

  1. I have a model that e.g. runs only on nVidia as it uses CUDA. (This is the case for quite some GPU workloads.), so I need _GN or _gN
  2. I'm looking for recent generation, ideally AdaLovelace, so looking for _GNl
  3. Like most things done on GPUs, my model is massively parallel, so I'm looking for a flavor with lots of SMs, the more the better.
  4. I now my workloads needs at least 16GiB VRAM to perform well, so I need a flavor that has at least this

If I look at the original spec, we have a few shortcomings and a few things done well.

  • There seems to be a misunderstanding that SMs/CUs/EUs (nVidia/AMD/intel) can be compared across vendors or even generations within a vendor. That is not the case, was never intended and should be clarified.
  • I would argue that this number is absolutely important. 1/4 of a GPU inside one generation has significantly less compute power than the full one. Inside a generation, the hardware vendors have several cards, ranging from small to medium to large, so even if the full card is always exposed, the performance can vary by a factor or 10 or so (!). IMVHO, we definitely need a size indicator.
  • Note that at least nVidia has the capability of partitioning their cards, so they can be pass-through exposed to several VMs. Some allow quarters, some allow for 1/7 ... This is not the same as exposing a virtualized GPU, where the hypervisor would do it arbitrarily and not just expose a PCIe device.
  • We had completely missed the aspect of VRAM. This is significant and knowing whether your LLM fits into VRAM decides whether or not your workload even works. So this absolutely is needed, IMVHO.
  • The high-performance indicators h are indeed fuzzy. I'm unsure what to do with them. Sometimes we have several variants of a card, one with significantly higher frequencies, which we somehow would want to differentiate.
  • The memory could also be high-performance. If we have an HBM variant and a GDDR variant, this does make a significant difference ... (that would be an h qualifier for the VRAM).
  • I prefer NOT using model names from GPUs -- the GPU vendors may or may not make a mess out of it (such as the CPU vendors have where you can't easily deduce the generation/microarchitecture from the name any more).

Here would be my suggestion:

  • We keep the # of SM/CU/EU, with clarifying words
  • We add VRAM
  • We create a table of existing GPUs and their names
    • The table should also contain the commonly used hardware partitioned options
    • Names for new hardware could still be derived systematically, so if we don't update this table every month, it still makes sense ...
  • If there are variants with signifcantly different frequencies within one generation, we can have a well-defined h or even hh modifiers
  • Same for VRAM -- if there are GDDR and HBM variants within one generation, we can have a well defined h modifier for the HBM variant
  • If we would keep VRAM optional, this would even be backwards compatible (though on the long run, we'd want to mandate VRAM)

_<G/g><Vendor>[<Gen>-<SM/CU/EU>[h[h]][-<VRAM>[h[h]]]] would be my choice ...
This has the advantage of being rather straight forward to implement in the code (flavor name generator, parser) and also is backwards compatible, i.e. old names would still be compliant. We would of course encourage the indication of VRAM and possibly mandate it in a later version of the spec.

Users and vendors may want marketing names. We could always encode them in extra specs, such as scs:gpu-vendor="nVidia", scs:gpu-generation="AdaLovelace", scs:gpu-model="L40", scs:gpu-fraction="1/4", scs:gpu-vram="20".
Long-term, this may be all we need and we stop compressing this information into a name. As long as we do, I would refrain from reflecting vendor-chosen names in our flavor names. Short-term, I would point them to the tables. These IMVHO should be Annexes to the naming spec, so we can have a different update schedule and don't need to revise the standard just because there is a new piece of hardware available.

Just my 0.02€.

@cah-patrickthiem
Copy link
Author

cah-patrickthiem commented Sep 4, 2024

General comments:

  • I think it's an excellent idea to include the amount of VRAM that the VM will have access to.

  • To get an indication of how much work you can do and whether your workload even functions, you need

    • Vendor and generation
    • Amount of compute power (SMs/CUs/EUs/ ... - YES, this is vendor and generation dependent) exposed to the VM
    • Amount of VRAM exposed to the VM

As user I know what I need:

  1. I have a model that e.g. runs only on nVidia as it uses CUDA. (This is the case for quite some GPU workloads.), so I need _GN or _gN
  2. I'm looking for recent generation, ideally AdaLovelace, so looking for _GNl
  3. Like most things done on GPUs, my model is massively parallel, so I'm looking for a flavor with lots of SMs, the more the better.
  4. I now my workloads needs at least 16GiB VRAM to perform well, so I need a flavor that has at least this

If I look at the original spec, we have a few shortcomings and a few things done well.

  • There seems to be a misunderstanding that SMs/CUs/EUs (nVidia/AMD/intel) can be compared across vendors or even generations within a vendor. That is not the case, was never intended and should be clarified.
  • I would argue that this number is absolutely important. 1/4 of a GPU inside one generation has significantly less compute power than the full one. Inside a generation, the hardware vendors have several cards, ranging from small to medium to large, so even if the full card is always exposed, the performance can vary by a factor or 10 or so (!). IMVHO, we definitely need a size indicator.
  • Note that at least nVidia has the capability of partitioning their cards, so they can be pass-through exposed to several VMs. Some allow quarters, some allow for 1/7 ... This is not the same as exposing a virtualized GPU, where the hypervisor would do it arbitrarily and not just expose a PCIe device.
  • We had completely missed the aspect of VRAM. This is significant and knowing whether your LLM fits into VRAM decides whether or not your workload even works. So this absolutely is needed, IMVHO.
  • The high-performance indicators h are indeed fuzzy. I'm unsure what to do with them. Sometimes we have several variants of a card, one with significantly higher frequencies, which we somehow would want to differentiate.
  • The memory could also be high-performance. If we have an HBM variant and a GDDR variant, this does make a significant difference ... (that would be an h qualifier for the VRAM).
  • I prefer NOT using model names from GPUs -- the GPU vendors may or may not make a mess out of it (such as the CPU vendors have where you can't easily deduce the generation/microarchitecture from the name any more).

Here would be my suggestion:

  • We keep the # of SM/CU/EU, with clarifying words

  • We add VRAM

  • We create a table of existing GPUs and their names

    • The table should also contain the commonly used hardware partitioned options
    • Names for new hardware could still be derived systematically, so if we don't update this table every month, it still makes sense ...
  • If there are variants with signifcantly different frequencies within one generation, we can have a well-defined h or even hh modifiers

  • Same for VRAM -- if there are GDDR and HBM variants within one generation, we can have a well defined h modifier for the HBM variant

  • If we would keep VRAM optional, this would even be backwards compatible (though on the long run, we'd want to mandate VRAM)

_<G/g><Vendor>[<Gen>-<SM/CU/EU>[h[h]][-<VRAM>[h[h]]]] would be my choice ... This has the advantage of being rather straight forward to implement in the code (flavor name generator, parser) and also is backwards compatible, i.e. old names would still be compliant. We would of course encourage the indication of VRAM and possibly mandate it in a later version of the spec.

Users and vendors may want marketing names. We could always encode them in extra specs, such as scs:gpu-vendor="nVidia", scs:gpu-generation="AdaLovelace", scs:gpu-model="L40", scs:gpu-fraction="1/4", scs:gpu-vram="20". Long-term, this may be all we need and we stop compressing this information into a name. As long as we do, I would refrain from reflecting vendor-chosen names in our flavor names. Short-term, I would point them to the tables. These IMVHO should be Annexes to the naming spec, so we can have a different update schedule and don't need to revise the standard just because there is a new piece of hardware available.

Just my 0.02€.

I took some time thinking about your comment.
First of all I absolutely agree on the VRAM aspect. I also agree on the memory type with "h" marker.

I am not strictly against the SM/CU/EU thing, but let me say that we need to really think about the specifics of different GPUs if we go that way. The connection between performance of a GPU and its Processing-Unit count is NOT linear. One can assume or build a rough estimate of the performance, yes, but twice the unit count does not make it twice as fast. Other aspects, such as memory (bandwidth), clock, drivers or even the core type (cuda, raytracing etc.) can massively impact the computing power of a GPU.

Example:
A100 vs A10:
The Nvidia A100 is the top performing allrounder GPU in the Ampere generation, especially when it comes to high performance computing/AI workloads etc.
The A100 has 108 SMs, the A10 has 72 SMs.

FP16 Tensor Core Performance:

  • A100: Up to 312 teraFLOPS
  • A10: Up to 148 teraFLOPS
  • Factor: 312/148 ≈ 2.11

INT8 Tensor Core Performance:

  • A100: Up to 624 TOPS
  • A10: Up to 148 TOPS
  • Factor: 624/148 ≈ 4.22

FP64 CUDA Core Performance:

  • A100: Up to 19.5 teraFLOPS
  • A10: Up to 7.4 teraFLOPS
  • Factor: 19.5/7.4 ≈ 2.63

Streaming Multiprocessors (SMs):

  • A100 SMs: 108
  • A10 SMs: 72
  • Factor: 108/72 = 1.5

--> both cards do not have ray tracing cores

Another example:
A100 vs A40:

FP16 Tensor Core Performance:

  • A100: Up to 312 teraFLOPS
  • A40: Up to 156 teraFLOPS
  • Factor: 312/156 = 2.0

INT8 Tensor Core Performance:

  • Factor: 624/312 = 2.0

FP64 CUDA Core Performance:

  • Factor: 19.5/9.7 ≈ 2.01

Streaming Multiprocessors (SMs):

  • A100 SMs: 108
  • A40 SMs: 84
  • Factor: 108/72 = 1.29

Ray Tracing and 3D Rendering:
Ray Tracing Cores:

  • A100: No ray tracing cores
  • A40: Includes ray tracing cores
  • Ratio: the A100 lacks ray tracing cores, whereas the A40 is designed with them, making the A40 better suited for ray tracing and 3D rendering tasks
  • 3D Rendering Performance:
  • the A40, with its ray tracing cores and architecture, is significantly better suited for real-time 3D rendering tasks compared to the A100, which is more focused on computational and AI workloads

Last example:
NVIDIA A40 vs. A10:

FP16 Tensor Core Performance:

  • A40: 156 teraFLOPS
  • A10: 148 teraFLOPS
  • Factor: 156/148 ≈ 1.05

INT8 Tensor Core Performance:

  • Factor: 312/148 ≈ 2.11

FP64 CUDA Core Performance:

  • Factor: 9.7/7.4 ≈ 1.31

SM Count:

  • A40 SMs: 84
  • A10 SMs: 72
  • Ratio: 84/72 = 1.17

Ray Tracing and 3D Rendering:
Ray Tracing Cores:

  • A40: Includes ray tracing cores
  • A10: No ray tracing cores

As you can see the key thing here is the non-linearity + some GPUs have capabilities which other do not have, e.g. A40 vs A100 & A10.

These things make it hard for me to be purely positiv about the SM/CU/EU topic.
And for now I did not do research or calculations for the virtual GPUs, I could imagine there also will be some conflicts or at least more confusion regarding the assumed performance. Besides that, I just researched it more detailed for Nvidia, not Intel or AMD.

Imo we need to conclude here soon, maybe I forgot something to take in consideration, but from my side right now, I find it rather difficult to work with these numbers.

Please tell me what you think about these new statements from my side.

@garloff
Copy link
Member

garloff commented Oct 9, 2024

Hi Patrick,

thanks for the very detailed analysis.

No, performance is not linear, as it depends on the bottleneck.
Sometimes memory bandwidth is the limiting factor, sometimes the amount of compute resources (SMs/CudaCores), sometimes your code has synchronization mechanisms that limit the parallelism, ...

I probably triggered another wrong assumption when saying that you got 4x as much compute power from a GNa-56h (A30) compared to a GNa-14h (a quarter A30). This is very much dependent on your workload, as you correctly say. You may even create workloads, where GNa-56h is not much faster than GNa-14h.

Nevertheless, I stand by design idea that we should tell a user how much resources she gets. In one case it's 14 SMs (896 Cuda Cores for Ampere) for a quarter A30 and in the other it's 56 SMs (3584 Cuda Cores) for a full one.
For a H100, this could be 132SMs (18896 Cuda Cores - full H100) vs 18SMs (2304 Cuda Cores - 1/7 H100) or 2/7 or ...

It's like vCPUs in instances -- 2 vCPU does not necessarily make your workload twice as fast, but you know how much CPU power you roughly get ...

@garloff
Copy link
Member

garloff commented Oct 9, 2024

Another note:
Let's please call the Multi-Instance-GPU thing from nVidia partitioning and NOT virtualization.

@cah-patrickthiem
Copy link
Author

Hi Patrick,

thanks for the very detailed analysis.

No, performance is not linear, as it depends on the bottleneck. Sometimes memory bandwidth is the limiting factor, sometimes the amount of compute resources (SMs/CudaCores), sometimes your code has synchronization mechanisms that limit the parallelism, ...

I probably triggered another wrong assumption when saying that you got 4x as much compute power from a GNa-56h (A30) compared to a GNa-14h (a quarter A30). This is very much dependent on your workload, as you correctly say. You may even create workloads, where GNa-56h is not much faster than FNa-14h.

Nevertheless, I stand by design idea that we should tell a user how much resources she gets. In one case it's 14 SMs (896 Cuda Cores for Ampere) for a quarter A30 and in the other it's 56 SMs (3584 Cuda Cores) for a full one. For a H100, this could be 132SMs (18896 Cuda Cores - full H100) vs 18SMs (2304 Cuda Cores - 1/7 H100) or 2/7 or ...

It's like vCPUs in instances -- 2 vCPU does not necessarily make your workload twice as fast, but you know how much CPU power you roughly get ...

That means you would be still in favor for the SM/CU etc. way, or?

@cah-patrickthiem
Copy link
Author

I close this PR because we finally came to a common conclusion, see this PR #780, which by now already was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SCS-VP10 Related to tender lot SCS-VP10 standards Issues / ADR / pull requests relevant for standardization & certification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants