Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance degradation in 4.9.3-rc2 #3067

Closed
xavierabellan opened this issue Dec 20, 2024 · 12 comments · Fixed by #3077
Closed

Severe performance degradation in 4.9.3-rc2 #3067

xavierabellan opened this issue Dec 20, 2024 · 12 comments · Fixed by #3077

Comments

@xavierabellan
Copy link

This issue is related to Unidata/netcdf4-python#1393 but narrowed down to netcdf-c library in it's 4.9.3-rc2 version.

We have noticed a very significant degradation in read performance when accessing a variable in a loop.

As a reproducer, the International Best Track Archive for Climate Stewardship (IBTrACS) data can be used:

wget https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r01/access/netcdf/IBTrACS.ALL.v04r01.nc

With the following minimal C code:

#include <netcdf.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define FILE_NAME "IBTrACS.ALL.v04r00.nc"
#define DIM_STORM "storm"
#define VAR_LAT "lat"

int main() {
    int ncid;                 // File ID
    int storm_dimid;          // ID for the "storm" dimension
    int lat_varid;            // ID for the "lat" variable
    size_t nstorms;           // Total number of storms
    size_t start[2], count[2]; // Start and count arrays for reading lat data
    float lat_value;          // Latitude value to read
    double total_time = 0.0;  // Total time of getting lat value
    int iterations = 0;       // Number of iterations

    // Open the NetCDF file
    nc_open(FILE_NAME, NC_NOWRITE, &ncid);

    // Get the "storm" dimension ID and size
    nc_inq_dimid(ncid, DIM_STORM, &storm_dimid);
    nc_inq_dimlen(ncid, storm_dimid, &nstorms);

    // Get the "lat" variable ID
    nc_inq_varid(ncid, VAR_LAT, &lat_varid);

    // Iterate over the last 10 storms
    for (size_t ns = nstorms - 10; ns < nstorms; ns++) {
        // Set start and count for reading a single value
        start[0] = ns;  // Index of the storm
        start[1] = 0;   // First time step
        count[0] = 1;   // Read one storm
        count[1] = 1;   // Read one time step

        clock_t cstart, cend;
        cstart = clock();

        // Read the latitude value
        nc_get_vara_float(ncid, lat_varid, start, count, &lat_value);

        cend = clock();
        double time_taken = ((double)(cend - cstart)) * 1000/ CLOCKS_PER_SEC; // Convert to ms
        total_time += time_taken;
	iterations++;

        // Print the latitude value
        printf("Storm %zu: Latitude is %f, Time taken = %.3f ms\n", ns, lat_value, time_taken);

    }

    double average_time = total_time / iterations;
    printf("\nAverage runtime per iteration: %.3f ms\n", average_time);

    // Close the NetCDF file
    nc_close(ncid);

    return 0;
}

iterating over the last 10 storms in the input file and accessing the corresponding lat value. With version 4.9.2 and below only the first read takes significant time, while the rest take virtually nothing:

Storm 13801: Latitude is -12.000000, Time taken = 80.000 ms
Storm 13802: Latitude is -13.400000, Time taken = 0.000 ms
Storm 13803: Latitude is -12.700001, Time taken = 0.000 ms
Storm 13804: Latitude is -13.100001, Time taken = 0.000 ms
Storm 13805: Latitude is -13.800000, Time taken = 0.000 ms
Storm 13806: Latitude is -8.600000, Time taken = 0.000 ms
Storm 13807: Latitude is -9.100000, Time taken = 0.000 ms
Storm 13808: Latitude is -2.300000, Time taken = 0.000 ms
Storm 13809: Latitude is 11.199999, Time taken = 0.000 ms
Storm 13810: Latitude is 18.800001, Time taken = 0.000 ms

Average runtime per iteration: 8.000 ms

Where as using 4.9.3-rc2, every iteration takes a similar amount of time:

Storm 13801: Latitude is -12.000000, Time taken = 80.000 ms
Storm 13802: Latitude is -13.400000, Time taken = 90.000 ms
Storm 13803: Latitude is -12.700001, Time taken = 50.000 ms
Storm 13804: Latitude is -13.100001, Time taken = 60.000 ms
Storm 13805: Latitude is -13.800000, Time taken = 40.000 ms
Storm 13806: Latitude is -8.600000, Time taken = 50.000 ms
Storm 13807: Latitude is -9.100000, Time taken = 50.000 ms
Storm 13808: Latitude is -2.300000, Time taken = 50.000 ms
Storm 13809: Latitude is 11.199999, Time taken = 50.000 ms
Storm 13810: Latitude is 18.800001, Time taken = 50.000 ms

Average runtime per iteration: 57.000 ms

For certain programs, this may result in a very noticeable slowdown.

This has been reproduced in Linux and Mac.

@WardF
Copy link
Member

WardF commented Dec 21, 2024

Thank you for reporting this, and thank you for providing the test program. I'll take a look at this and see if I can sort it out. Out of curiosity, do you know which version of libhdf5 is being linked against?

@xavierabellan
Copy link
Author

The results pasted above were using hdf5 1.14.5, but I could also observe the same behaviour with a 1.12.X

@edwardhartnett
Copy link
Contributor

OK, this is very significant and I need to take a look at this. @WardF I don't think you should release 4.9.3 until we get to the bottom of this problem.

The access pattern in use here is very common indeed, so this is not an edge case. We can't just slow down everyone's supercomputer programs for more I/O! There will be a crowd of angry scientists outside Unidata offices with pitchforks and torches...

@edwardhartnett
Copy link
Contributor

Also it would be great if we could work out some basic performance testing in CI. It's a difficult problem, but surely we can try to catch such an obvious slowdown somehow...

@DennisHeimbigner
Copy link
Collaborator

My first thought was that the implementation of get_vara/get_vars had changed.
So I took the get_vars code from 4.9.2 and inserted it into the main branch.
It still shows the same perf. problem. So with high probability, it is not get_vars
(or at least only get_vars). BTW, I was using HDF5 1.12.3.

@WardF
Copy link
Member

WardF commented Jan 13, 2025

Using git bisect, initial indicators are that the issue was introduced in f1a3a64. Description of commit as follows:

    Cleanup the handling of cache parameters.
    
    re: https://github.com/Unidata/netcdf-c/issues/2733
    
    When addressing the above issue, I noticed that there was a disconnect
    in NCZarr between nc_set_chunk_cache and nc_set_var_chunk cache.
    Specifically, setting nc_set_chunk_cache had no impact on the per-variable cache parameters when nc_set_var_chunk_cache was not used.
    
    So, modified the NCZarr code so that the per-variable cache parameters are set in this order (#1 is first choice):
    1. The values set by nc_set_var_chunk_cache
    2. The values set by nc_set_chunk_cache
    3. The defaults set by configure.ac

This seems like it could be related (or at least is not obviously unrelated), I will dig into it.

@WardF
Copy link
Member

WardF commented Jan 13, 2025

The issue appears to be related to the following change in logic in libhdf5/nc4hdf.c (around lines 1147) and default chunk-cache related default values:

Prior to the first bad commit, chunkcache.size was checked against (and set to if exceeding) MAX_DEFAULT_CACHE_SIZE, which was defined as 67108864. In the first bad commit, this has been changed to check against (and set to if exceeding) DEFAULT_CHUNK_CACHE_SIZE, 16777216 by default, with MAX_DEFAULT_CACHE_SIZE having been removed altogether. This quartering of the default cache size is why we are seeing this tenfold (on my system, at least) increases in access time.

The naive fix on our end is to quadruple the DEFAULT_CHUNK_CACHE_SIZE, but I want to wrap my head around why we removed MAX_DEFAULT_CACHE_SIZE in the first place. @DennisHeimbigner do you have any recollection? I'll work through the code change on my end.

@WardF
Copy link
Member

WardF commented Jan 13, 2025

At first read-through, it appears that prior to this change, MAX_DEFAULT_CACHE_SIZE was used as the per-variable chunk cache size, and CHUNK_CACHE_SIZE was stored as part of the global state, but wasn't referenced(?) I need to double check that last part, since I'm only looking at the diff logs right now. But it appears that perhaps in cleaning up this discrepancy, we inadvertently quartered the size of the per-variable chunk cache, which would explain what we're observing.

@DennisHeimbigner
Copy link
Collaborator

If memory serves, this change was in response to a user posted issue about cache size.
Possibilities:

@WardF
Copy link
Member

WardF commented Jan 13, 2025

Thanks for reminding me, @DennisHeimbigner. That makes sense. The question now is how to proceed, and I think the answer may be as much about education as it is setting default cache size values. As @edwardhartnett points out, this is a pretty common use case (as reported in this issue), so I am inclined to lean towards default values that ameliorate the observed jump in access time. This might result in issues being opened related to increased memory usage, but ultimately these are parameters that the end user has control over, if I can remember to reference them.

@DennisHeimbigner
Copy link
Collaborator

I found the original pull request: #2734
It refers to issue #2733.

@edwardhartnett
Copy link
Contributor

I commented in #2734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants