Tools

Below is a broad overview of the languages and tools used in the benchmarks, and some possibly contentious opinions on how these languages compare.

Note that all benchmarks described here involve reading and writing to the NetCDF file format, which is used for storage of multi-dimensional scientific datasets. There are two major versions of this file format: version 3 and version 4. The performance differences were always minor, and since many general circulation models still use the older-format NetCDF3 files, this type is used for all tests.

Fortran

Fortran is the only low-level, high-performance language tested in these benchmarks. Fortran may seem like an anachronism to outsiders, but there are perfectly valid reasons scientists still prefer it. The most important of these is that a very powerful parallelization tool, MPI, only works with C++ and Fortran. Of these two, Fortran is the more array-friendly, easier-to-learn language. Fortran is also at least as fast as C++. While C++ is certainly the right tool for the type of object-oriented programming required for software engineering and professional applications, Fortran is a scientific computing workhorse, used for high-performance numerical algorithms and geophysical modeling.

NetCDF Operators (NCO)

The NetCDF operators are a series of command-line tools developed by Unidata for working with NetCDF files, released alongside the original file format. They command names are ncks (NetCDF kitchen sink), ncbo (NetCDF binary operator), ncwa (NetCDF weighted averager), ncrcat (NetCDF record concatenator), ncecat (NetCDF ensemble concatenator), ncra (NetCDF record averager), nces (NetCDF ensemble statistics), ncremap (NetCDF remaper), ncflint (NetCDF file interpolator), ncclimo (NetCDF climatology generator), and ncap2 (NetCDF arithmetic processor). The documentation can be found here.

Since the NetCDF operators were released by the creators of NetCDF, one might think they would always be the fastest tool for manipulating NetCDF data. But as it turns out, other tools are often much faster.

Climate Data Operators (CDO)

The climate data operators (CDO) are another series of command-line tools for manipulating NetCDF files. But CDO commands are invoked with any of several hundred "subcommands" -- e.g. cdo timmean file.nc out.nc. "Operator chaining" is a notable improvement over NCO -- e.g. cdo -timmean -zonmean file.nc out.nc. The documentation can be found here.

The functionality of CDO overlaps somewhat with NCO, and CDO places restrictions on the dataset format: All variables must have 2 horizontal "spatial" dimensions, an optional height dimension, and an optional time dimension. This can be frustrating (but probably necessary), and at first glance CDO may seem redundant. However, CDO can be much, much faster than NCO, is more flexible, and is generally easier and more intuitive to use.

NCAR Command Language (NCL)

NCAR command language is a favorite among many atmospheric scientists, myself included. It is certainly not the fastest language -- in fact, it is usually the slowest after the native NetCDF operators -- but it is relatively easy-to-use, concise, and provides specialized tools for atmospheric scientists. Just like everything in MATLAB is an array, and everything in Python is an "object", everything in NCL is a dataset with named dimensions. This is very handy for us geophysical scientists! The NCL documentation can be found here.

Unfortunately, with the recent end-of-life announcement, it may be necessary to move away from NCL over the coming years.

MATLAB

MATLAB (MATrix LABoratory) is a tried-and-tested, proprietary, high-level data science language -- the language of choice for engineers and scientists over the last few decades. But with the emergence of the free, open-source programming language "Python" as a scientific computing workhorse, scientists have been slowly making the switch. And with the massive amount of collaborative work put into scientific computing Python packages, Python has become (for the most part) a superset of MATLAB, and seems to have overtaken MATLAB in terms of performance and speed.

MATLAB has one major sticking point (well, it actually has a bunch, but this is the one that bothers me the most). Even when the Java Virtual Manager and GUI display are disabled (the -nojvm -nodisplay flags), MATLAB scripts run from the command line are delayed by several seconds! Thus, running a series of MATLAB commands on small files for small tasks becomes quickly impractical. To give MATLAB the best chance, the times shown in the benchmarks below omit the startup time.

Python

Python is the high-level, expressive, object-oriented programming language that is quickly becoming the favorite of academics and data scientists everywhere. Almost all scientific computing Python tools are based around the array manipulation package "numpy". There are two packages (at least two well-known ones) for reading NetCDF files: netCDF4 (which confusingly, also works with version 3 NetCDF files), and XArray. The former is rather low-level and fast, the latter is high-level, powerful, and very flexible. XArray is also closely integrated with Dask, which supports extremely high-performance array computations with hidden parallelization and super fancy algorithms designed by some super smart people. Dask is truly a game-changer, and with the proper "chunking" it can result in code as fast as compiled, serially executed Fortran code.

Julia

Julia is the new kid on the block, and tries to combine the best-of-both worlds from MATLAB (e.g. the everything-is-an-array syntax) and Python. The Julia workflow is quite different -- you cannot simply make repeated calls to some script on the command line, because this means the JIT compilation kicks in every time, and becomes a huge bottleneck. Instead you have two options:

Run things from a persistent notebook or REPL, making repeated calls to some function so that the JIT compilation can speed things up. For example, a simple julia script that iterates over 1000 NetCDF files and performs the same operation on them.
Compile to machine executable with PackageCompiler and a "snoop" script, which lets you call a binary executable. This is perhaps how numerical models written in Julia can be used. Since it is faster, unless stated otherwise, this is the approach used for benchmarks.

Julia is advertised for its raw speed in numerical computations. But for simple data analysis tasks, and especially when working with large arrays, Julia seems to perform no better than MATLAB or python. With the NetCDF.jl package (which mimics MATLAB's NetCDF utilities), Julia is somewhat slower than MATLAB. With the NCDatasets.jl package (which mimics the python xarray package), Julia is somewhat slower than python. For me, there doesn't seem to be a compelling reason to switch from either of these tools to Julia yet. Perhaps as these packages are updated and I/O performance is improved, there will be one day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOOLS.md

TOOLS.md

Tools

Fortran

NetCDF Operators (NCO)

Climate Data Operators (CDO)

NCAR Command Language (NCL)

MATLAB

Python

Julia

Files

TOOLS.md

Latest commit

History

TOOLS.md

File metadata and controls

Tools

Fortran

NetCDF Operators (NCO)

Climate Data Operators (CDO)

NCAR Command Language (NCL)

MATLAB

Python

Julia