-
Notifications
You must be signed in to change notification settings - Fork 51
PAPI Flops
- Counting Floating Point Operations on Intel Sandy Bridge and Ivy Bridge
- Counting Floating Point Operations on Intel Haswell
Intel's Sandy Bridge and Ivy Bridge cpu architectures provide a rich computing environment and a comprehensive performance monitoring unit with which to measure performance. These processors support 11 hardware performance counters per core: 3 fixed counters for core cycles, reference cycles and core instructions executed, in addition to 8 programmable counters with minimal restrictions. That's the good news. The bad news starts to show up when you actually use these counters in real situations. Most environments run with hyperthreading enabled. This allows each core to run two simultaneous interleaved threads, and hopefully keep the functional units filled to higher capacity. Those 8 programmable counters suddenly turn into 4, since each thread must maintain its own hardware counters. Further, most environments also run with a non-maskable interrupt (nmi) timer active. This can be implemented in a variety of ways, but cannot be guaranteed NOT to use one of the four remaining counters. That leaves 3 per thread. So this means that PAPI is only guaranteed 3 programmable counters at a given time, in addition to the 3 fixed counters mentioned earlier. The corollary is that any single PAPI derived event can consist of at most 3 programmable terms to be assured that it can always be counted. This is generally enough for most, but not all, situations.
Sandy Bridge and Ivy Bridge introduce a set of more powerful AVX assembly instructions. These are vector floating point instructions that operate on up to 256 bits of information at a time. That's 4 simultaneous double precision operations, or 8 parallel single precision operations. You can't guarantee all 256 bits are always in use, so counting floating point operations can be a bit tricky. Because of this and the need for backwards compatibility, these chips continue to support earlier floating point instructions and hardware as well, including 128 bit SSE instructions, MMX instructions, and even the venerable x87 instructions. In both single and double precision versions. That makes 8 different flavors of floating point, and raises the potential need for as many as 8 events to count them all.
For the last several generations, one of the performance events provided by Intel to count floating point instructions has been called FP_COMP_OS_EXE. This event name is generally associated with one or more umasks, or attributes, to further define what kinds of floating point instructions are being counted. For Sandy Bridge, the available attributes include the following:
Attribute | Description |
---|---|
X87 | Number of X87 uops executed |
SSE_FP_PACKED_DOUBLE | Number of SSE double precision FP packed uops executed |
SSE_FP_SCALAR_SINGLE | Number of SSE single precision FP scalar uops executed |
SSE_PACKED_SINGLE | Number of SSE single precision FP packed uops executed |
SSE_SCALAR_DOUBLE | Number of SSE double precision FP scalar uops executed |
Although in theory it should be possible to combine all five of these attributes in a single event to count all variations of x87 and SSE floating point instructions, in practice these attributes are found to interact with each other in non-linear ways and must be empirically tested before they can be combined in a single counter. Further, the PACKED versions of these instructions represent more than one floating point operation each, and so can't simply be added to produce a meaningful result.
Intel engineers have verified that variations of this event count speculatively, leading to variable amounts of overcounting, depending on the algorithm. Further, as is discussed later in this article, speculative retries during resource stalls are also counted. Knowing this, it may be possible to use the excess counts as a way to monitor resource inefficiency.
To make matters more confusing, it appears that combining multiple attributes in a single counter produces a result that resembles total cycles more that combined floating point operations.
You may have noticed that the event attributes shown above don't reference AVX instructions. That requires a separate event in another counter. The name of this event is SIMD_FP_256, and it supports two attributes: PACKED_SINGLE and PACKED_DOUBLE. As in the case of FP_COMP_OPS_EXE, these two attributes cannot be combined in practice without silently producing anomalous results.
Counter to the situation with FP_COMP_OPS_EXE, SIMD_FP_256 counts instructions retired rather than speculative instructions executed. That's a good thing, but overcounts are still observed, because this event also counts AVX operations that are not floating point, such as register loads and stores, and various logical operations. Since such data movement operations will generally be proportional to actual work, for a given algorithm, these counts, while theoretically inaccurate, should still prove to be useful as a measure of relative code performance.
The above discussion also does not mention MMX. There are no events available to Sandy Bridge that reference MMX. One can assume that MMX operations are being processed through SSE instructions and are counted as such.
Neither the FP_COMP_OPS_EXE nor the SIMD_FP_256 were originally documented on Ivy Bridge. Although rumor was that these events still existed, they were not exposed through the documentation. Due to user demand (thank you) as of late 2013 Intel has now exposed these events in their documentation. We support these events beginning with PAPI version 5.3, released December 2013. All experimentation for this white paper was done on Sandy Bridge. We expect similar results to hold for Ivy Bridge as well.
In order to develop a feel for counting floating point events on the Sandy and Ivy Bridge architectures, we present a series of tables below that collect a number of different events from several different computational kernels, including a multiply-add, a simple matrix multiply, and optimized GEMMs for both single and double precision. We also show results from several events with multiple attributes. Results with an error of < 5% are shown in green; errors < 15% are in orange; errors > 15% are red. Results that look suspiciously similar to PAPI_TOT_CYC are shown in blue. All these results were collected on Sandy Bridge; similar results should be expected on Ivy Bridge.
The table above illustrates unoptimized arithmetic operations. There is apparently no use of packed SSE instructions, and no evidence of x87 or AVX instructions. All the operations counted here are scalar. The double precision counts are within 15% of the theoretically expected value, while one single precision count deviates by almost 35% and the other is high by about 3.5%. All attempts at combining more than one unit mask, or attribute, resulted in counts that look surprisingly similar to cycle counts. This was also true for unreported attribute combinations, suggesting that attribute bits cannot be combined.
This table shows a pattern similar to the one in the table above. Packed single and double precision counts show up in the right places and quantities for both the SSE optimized and AVX optimized GEMMs. There are a small number of scalar and packed SSE operations that show up in the SGEMM case, possibly a result of incomplete AVX packing. There are also a very small number of x87 instructions that are counted in each case. Since these are negligible, they are ignored. As in the previous table, events with multiple attributes produce counts that are surprisingly similar to the equivalent cycle count.
From the observations in the previous two tables, it becomes clear that no single definition can encompass all variations of floating point operations on Sandy and Ivy Bridge. The table below defines PAPI Preset event definitions that encompass a range of cases with reasonable predictability while remaining within the constraint of using three counters or less. PAPI_FP_INS and _OPS are defined identically to include scalar operations only. This is a significant deviation from traditional definitions of these events, because all packed instructions are ignored. PAPI_SP_OPS and _DP_OPS count single and double precision events respectively. They each consist of three terms including scalar and packed SSE, and packed AVX, with terms appropriately scaled to represent operations rather than instructions. PAPI_VEC_SP and _DP count vector instructions in single and double precision using appropriately scaled SSE and AVX instructions.
PRESET Event | Definition |
---|---|
PAPI_FP_INS | SSE_SCALAR_DOUBLE + SSE_FP_SCALAR_SINGLE |
PAPI_FP_OPS | same as above |
PAPI_SP_OPS | FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE + 4*(FP_COMP_OPS_EXE:SSE_PACKED_SINGLE) + 8*(SIMD_FP_256:PACKED_SINGLE) |
PAPI_DP_OPS | FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE + 2*(FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE) + 4*(SIMD_FP_256:PACKED_DOUBLE) |
PAPI_VEC_SP | 4*(FP_COMP_OPS_EXE:SSE_PACKED_SINGLE) + 8*(SIMD_FP_256:PACKED_SINGLE) |
PAPI_VEC_DP | 2*(FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE) + 4*(SIMD_FP_256:PACKED_DOUBLE) |
The table below shows measurements taken on a Sandy Bridge processor. Similar results should be expected on Ivy Bridge. In all cases where values are reported, the numbers have positive deviations from theoretical of varying magnitudes. The majority of counts are high by < 15%, which could be attributable to speculative execution. The deviations between measured FP_INS and FP_OPS offer an indication of run-to-run variability in a range from 0.2% to 8 or 9%. Highly optimized operations such as the GEMMs, actually show the best accuracy for both SSE and AVX versions, with deviation from theoretical on the order of 1 to 2%.
John McCalpin at TACC has observed that in general Intel performance counters increment at instruction issue unless the event name specifies "retired". This can lead to overcounting if an instruction is reissued, for example, while waiting for a cache miss to be satisfied. Further experiments appear to verify this hypothesis, with overcount rates directly correlated to cache miss rates in ratios for the STREAM benchmark of anywhere from 2.8 to 6.5 x the theoretical flop count and operation measured. Specifically in the case of AVX floating point instructions, it appears that overcounts can be explained by this instruction re-issue phenomenon. John has done some tests with the STREAM benchmark suggesting a strong correlation between overcounting and average cache latency. This also suggests an explanation for the relatively small error in AVX DGEMM and SGEMM results, since these algorithms have been optimized to minimize cache misses, and thus retries. Once again the user caveat is that while flop counts and rates for Sandy and Ivy Bridge may be valuable as a relative proxy for code and cache efficiency, they should not be assumed to be an absolute measure of the amount of work done.
Sandy Bridge and Ivy Bridge are powerful processors in the Intel lineage. Both offer a wealth of opportunities for performance measurement. However, measuring the traditional standby floating point metric must be done with care. Be forewarned that although accurate measurements can be made, particularly for highly optimized code, no single PAPI metric is likely to capture all floating point operations. Remember the error bars. Some measurements will be less accurate than others, and the errors will almost always be positive (overcounting) due to speculative execution. Since speculation is likely to be proportional to the amount of floating point work done, even these inaccurate measurements should provide insight when used within the same codes.
If these numbers inspire or challenge you to make more detailed observations with this hardware, please share your conclusions with us. We'd be happy to add further insight into the above report.
As pointed out by John McCalpin at TACC, the floating point counters have been disabled in the Intel Haswell cpu architecture. For Sandy Bridge and Ivy Bridge products, these counters were mainly useful for determining what kinds of floating-point instructions were being executed (scalar, 128-bit vector, 256-bit vector) and in what precision (single, double) by different jobs on the system. We are waiting on Intel to provide accurate floating-point counters (preferably counted at retirement to eliminate the over-counting problem that makes the counters less useful (quantitatively) on Sandy Bridge and Ivy Bridge.