Below is a broad overview of the languages and tools used in the benchmarks, and some possibly contentious opinions on how these languages compare.
Note that all benchmarks described here involve reading and writing to the NetCDF file format, which is used for storage of multi-dimensional scientific datasets. There are two major versions of this file format: version 3 and version 4. The performance differences were always minor, and since many general circulation models still use the older-format NetCDF3 files, this type is used for all tests.
Fortran is the only low-level, high-performance language tested in these benchmarks. Fortran may seem like an anachronism to outsiders, but there are perfectly valid reasons scientists still prefer it. The most important of these is that a very powerful parallelization tool, MPI, only works with C++ and Fortran. Of these two, Fortran is the more array-friendly, easier-to-learn language. Fortran is also at least as fast as C++. While C++ is certainly the right tool for the type of object-oriented programming required for software engineering and professional applications, Fortran is a scientific computing workhorse, used for high-performance numerical algorithms and geophysical modeling.
The NetCDF operators are a series of command-line tools developed by Unidata for working
with NetCDF files, released alongside the original file format. They command names are
ncks
(NetCDF kitchen sink), ncbo
(NetCDF binary operator), ncwa
(NetCDF weighted
averager), ncrcat
(NetCDF record concatenator), ncecat
(NetCDF ensemble
concatenator), ncra
(NetCDF record averager), nces
(NetCDF ensemble statistics),
ncremap
(NetCDF remaper), ncflint
(NetCDF file interpolator), ncclimo
(NetCDF
climatology generator), and ncap2
(NetCDF arithmetic processor). The documentation can
be found here.
Since the NetCDF operators were released by the creators of NetCDF, one might think they would always be the fastest tool for manipulating NetCDF data. But as it turns out, other tools are often much faster.
The climate data operators (CDO) are another series of command-line tools for
manipulating NetCDF files. But CDO commands are invoked with any of several hundred
"subcommands" -- e.g. cdo timmean file.nc out.nc
. "Operator chaining" is a notable
improvement over NCO -- e.g. cdo -timmean -zonmean file.nc out.nc
. The documentation
can be found here.
The functionality of CDO overlaps somewhat with NCO, and CDO places restrictions on the dataset format: All variables must have 2 horizontal "spatial" dimensions, an optional height dimension, and an optional time dimension. This can be frustrating (but probably necessary), and at first glance CDO may seem redundant. However, CDO can be much, much faster than NCO, is more flexible, and is generally easier and more intuitive to use.
NCAR command language is a favorite among many atmospheric scientists, myself included. It is certainly not the fastest language -- in fact, it is usually the slowest after the native NetCDF operators -- but it is relatively easy-to-use, concise, and provides specialized tools for atmospheric scientists. Just like everything in MATLAB is an array, and everything in Python is an "object", everything in NCL is a dataset with named dimensions. This is very handy for us geophysical scientists! The NCL documentation can be found here.
Unfortunately, with the recent end-of-life announcement, it may be necessary to move away from NCL over the coming years.
MATLAB (MATrix LABoratory) is a tried-and-tested, proprietary, high-level data science language -- the language of choice for engineers and scientists over the last few decades. But with the emergence of the free, open-source programming language "Python" as a scientific computing workhorse, scientists have been slowly making the switch. And with the massive amount of collaborative work put into scientific computing Python packages, Python has become (for the most part) a superset of MATLAB, and seems to have overtaken MATLAB in terms of performance and speed.
MATLAB has one major sticking point (well, it actually has a bunch, but this is the one
that bothers me the most). Even when the Java Virtual Manager and GUI display are
disabled (the -nojvm -nodisplay
flags), MATLAB scripts run from the command line are
delayed by several seconds! Thus, running a series of MATLAB commands on small files for
small tasks becomes quickly impractical. To give MATLAB the best chance, the times shown
in the benchmarks below omit the startup time.
Python is the high-level, expressive, object-oriented programming language that is quickly becoming the favorite of academics and data scientists everywhere. Almost all scientific computing Python tools are based around the array manipulation package "numpy". There are two packages (at least two well-known ones) for reading NetCDF files: netCDF4 (which confusingly, also works with version 3 NetCDF files), and XArray. The former is rather low-level and fast, the latter is high-level, powerful, and very flexible. XArray is also closely integrated with Dask, which supports extremely high-performance array computations with hidden parallelization and super fancy algorithms designed by some super smart people. Dask is truly a game-changer, and with the proper "chunking" it can result in code as fast as compiled, serially executed Fortran code.
Julia is the new kid on the block, and tries to combine the best-of-both worlds from MATLAB (e.g. the everything-is-an-array syntax) and Python. The Julia workflow is quite different -- you cannot simply make repeated calls to some script on the command line, because this means the JIT compilation kicks in every time, and becomes a huge bottleneck. Instead you have two options:
- Run things from a persistent notebook or REPL, making repeated calls to some function so that the JIT compilation can speed things up. For example, a simple julia script that iterates over 1000 NetCDF files and performs the same operation on them.
- Compile to machine executable with
PackageCompiler
and a "snoop" script, which lets you call a binary executable. This is perhaps how numerical models written in Julia can be used. Since it is faster, unless stated otherwise, this is the approach used for benchmarks.
Julia is advertised for its raw speed in numerical computations. But for simple data
analysis tasks, and especially when working with large arrays, Julia seems to perform no
better than MATLAB or python. With the NetCDF.jl
package (which mimics MATLAB's NetCDF
utilities), Julia is somewhat slower than MATLAB. With the NCDatasets.jl
package
(which mimics the python xarray package), Julia is somewhat slower than python. For me,
there doesn't seem to be a compelling reason to switch from either of these tools to
Julia yet. Perhaps as these packages are updated and I/O performance is improved, there
will be one day.