Skip to content

Commit

Permalink
Merge pull request #189 from NREL/gb/s3docs
Browse files Browse the repository at this point in the history
Gb/s3docs
  • Loading branch information
grantbuster authored Dec 26, 2024
2 parents 5b3eca0 + 17820a2 commit c05f5d9
Show file tree
Hide file tree
Showing 15 changed files with 479 additions and 156 deletions.
38 changes: 38 additions & 0 deletions .github/workflows/s3_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: s3 fsspec tests

on: pull_request

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.10"]
include:
- os: ubuntu-latest
python-version: 3.9
- os: ubuntu-latest
python-version: 3.8

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install pytest
pip install pytest-cov
pip install -e .[s3]
- name: Run tests
shell: bash
run: |
pytest -v tests/s3_tests.py
76 changes: 36 additions & 40 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,34 +33,25 @@ Welcome to The REsource eXtraction (rex) tool!

.. inclusion-intro
rex command line tools
======================

- `rex <https://nrel.github.io/rex/_cli/rex.html#rex>`_
- `NSRDBX <https://nrel.github.io/rex/_cli/NSRDBX.html#NSRDBX>`_
- `WINDX <https://nrel.github.io/rex/_cli/WINDX.html#WINDX>`_
- `US-wave <https://nrel.github.io/rex/_cli/US-wave.html#US-wave>`_
- `WaveX <https://nrel.github.io/rex/_cli/WaveX.html#Wavex>`_
- `MultiYearX <https://nrel.github.io/rex/_cli/MultiYearX.html#MultiYearX>`_
- `rechunk <https://nrel.github.io/rex/_cli/rechunk.html#rechunk>`_
- `temporal-stats <https://nrel.github.io/rex/_cli/temporal-stats.html#temporal-stats>`_
- `wind-rose <https://nrel.github.io/rex/_cli/wind-rose.html#wind-rose>`_

Using Eagle Env
===============

If you would like to run `rex` on Eagle (NREL's HPC) you can use a pre-compiled
conda env:

.. code-block:: bash
conda activate /shared-projects/rev/modulefiles/conda/envs/rev/
or

.. code-block:: bash
source activate /shared-projects/rev/modulefiles/conda/envs/rev/
What is rex?
=============
``rex`` stands for **REsource eXtraciton** tool.

``rex`` enables the efficient and scalable extraction, manipulation, and
computation with NRELs flagship renewable resource datasets such as: the Wind
Integration National Dataset (WIND Toolkit), the National Solar Radiation
Database (NSRDB), the Ocean Surface Wave Hindcast (US Wave) Data, and the
High-resolution downscaled climate change data (Sup3rCC).

To get started accessing NREL's datasets, see the primer on `NREL Renewable
Energy Resource Data
<https://nrel.github.io/rex/misc/examples.nrel_data.html>`_ or the
`installation instructions <https://nrel.github.io/rex/#installing-rex>`_.

You might also want to check out the basic `Resource Class
<https://nrel.github.io/rex/_autosummary/rex.resource.Resource.html>`_ that
can be used to efficiently query NREL data, or our various `example use cases
<https://nrel.github.io/rex/misc/examples.html>`_.

Installing rex
==============
Expand All @@ -78,12 +69,13 @@ Option 1: Install from PIP or Conda (recommended for analysts):
2. Activate directory:
``conda activate rex``

3. Install rex:
1) ``pip install NREL-rex`` or
2) ``conda install nrel-rex --channel=nrel``
3. Basic ``rex`` install:
1) ``pip install NREL-rex``
2) or ``conda install nrel-rex --channel=nrel``

- NOTE: If you install using conda and want to use `HSDS <https://github.com/NREL/hsds-examples>`_
you will also need to install h5pyd manually: ``pip install h5pyd``
4. Install for users outside of NREL that want to access data via HSDS or S3 as per the instructions `here <https://nrel.github.io/rex/misc/examples.nrel_data.html#data-location-external-users>`_:
1) ``pip install NREL-rex[s3]`` for easy no-setup direct access of the data on S3 via ``fsspec`` as per `this example <https://nrel.github.io/rex/misc/examples.fsspec.html>`_
2) or ``pip install NREL-rex[hsds]`` for more performant access of the data on HSDS with slightly more setup as per `this example <https://nrel.github.io/rex/misc/examples.hsds.html>`_

Option 2: Clone repo (recommended for developers)
-------------------------------------------------
Expand All @@ -109,11 +101,15 @@ Option 2: Clone repo (recommended for developers)
- ``WINDX``
- ``US-wave``

Recommended Citation
====================

Update with current version and DOI:
rex command line tools
======================

Michael Rossol, Grant Buster. The REsource Extraction Tool (rex).
https://github.com/NREL/rex (version v0.2.43), 2021.
https://doi.org/10.5281/zenodo.4499033.
- `rex <https://nrel.github.io/rex/_cli/rex.html#rex>`_
- `NSRDBX <https://nrel.github.io/rex/_cli/NSRDBX.html#NSRDBX>`_
- `WINDX <https://nrel.github.io/rex/_cli/WINDX.html#WINDX>`_
- `US-wave <https://nrel.github.io/rex/_cli/US-wave.html#US-wave>`_
- `WaveX <https://nrel.github.io/rex/_cli/WaveX.html#Wavex>`_
- `MultiYearX <https://nrel.github.io/rex/_cli/MultiYearX.html#MultiYearX>`_
- `rechunk <https://nrel.github.io/rex/_cli/rechunk.html#rechunk>`_
- `temporal-stats <https://nrel.github.io/rex/_cli/temporal-stats.html#temporal-stats>`_
- `wind-rose <https://nrel.github.io/rex/_cli/wind-rose.html#wind-rose>`_
10 changes: 0 additions & 10 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,4 @@
rex documentation
*******************

What is rex?
=============
rex stands for **REsource eXtraciton** tool.

rex enables the efficient and scalable extraction, manipulation, and
computation with NRELs flagship renewable resource datasets:
the Wind Integration National Dataset (WIND Toolkit), and the National Solar
Radiation Database (NSRDB)

.. include:: ../../README.rst
:start-after: inclusion-intro
7 changes: 0 additions & 7 deletions docs/source/misc/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,6 @@ Installation
:start-after: Installing rex
:end-before: Recommended Citation

Usage on Eagle
==============

.. include:: ../../../README.rst
:start-after: Using Eagle Env
:end-before: Installing rex

Command Line Tools
==================

Expand Down
21 changes: 20 additions & 1 deletion examples/HSDS/README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,26 @@
Highly Scalable Data Service (HSDS)
===================================

`The Highly Scalable Data Service (HSDS) <https://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/>`_ is a cloud-optimized solution for storing and accessing HDF5 files, e.g. the NREL wind and solar datasets. You can access NREL data via HSDS in a few ways. Read below to find out more.
`The Highly Scalable Data Service (HSDS)
<https://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/>`_ is a
cloud-optimized solution for storing and accessing HDF5 files, e.g. the NREL
wind and solar datasets. You can access NREL data via HSDS in a few ways. Read
below to find out more.

Note that raw NREL .h5 data files are hosted on AWS S3. In contrast, the files
on HSDS are not real "files". They are just domains that you can access with
h5pyd or rex tools to stream small chunks of the files stored on S3. The
multi-terabyte .h5 files on S3 would be incredibly cumbersome to access
otherwise.

Extra Requirements
------------------

You may need some additional software beyond the basic ``rex`` install to run this example:

.. code-block:: bash
pip install NREL-rex[hsds]
NREL Developer API
------------------
Expand Down
64 changes: 37 additions & 27 deletions examples/NREL_Data/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Definitions
- ``h5pyd`` - The python library that provides the HDF REST interface to NREL data hosted on the cloud. This allows for the public to access small parts of large cloud-hosted datasets. See the `h5pyd <https://github.com/HDFGroup/h5pyd>`_ library for more details.
- ``hsds`` - The highly scalable data service (HSDS) that we recommend to access small chunks of very large cloud-hosted NREL datasets. See the `hsds <https://github.com/HDFGroup/hsds>`_ library for more details.
- ``meta`` - The ``dataset`` in an NREL h5 file that contains information about the spatial axis. This is typically a `pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ with columns such as "latitude", "longitude", "state", etc... The DataFrame is typically converted to a records array for storage in an h5 ``dataset``. The length of the meta data should match the length of axis 1 of a 2D spatiotemporal ``dataset``.
- ``S3`` - Amazon Simple Storage Service (S3) is a basic cloud file storage system we use to store raw .h5 files in their full volume. Downloading files directly from S3 may not be the easiest way to access the data because each file tends to be multiple terabytes. Instead, you can stream small chunks of the files via HSDS.
- ``S3`` - Amazon Simple Storage Service (S3) is a basic cloud file storage system we use to store raw .h5 files in their full volume. Downloading files directly from S3 may not be the easiest way to access the data because each file tends to be multiple terabytes. Instead, you can stream small chunks of the files via HSDS.
- ``scale_factor`` - We frequently scale data by a multiplicative factor, round the data to integer precision, and store the data in integer arrays. The ``scale_factor`` is an attribute associated with the relevant h5 ``dataset`` that defines the multiplicative factor required to unscale the data from integer storage to the original physical units.
- ``time_index`` - The ``dataset`` in an NREL h5 file that contains information about the temporal axis. This is typically a `pandas DatetimeIndex <https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html>`_ that has been converted to a string array for storage in an h5 ``dataset``. The length of this ``dataset`` should match the length of axis 0 of a 2D spatiotemporal ``dataset``.

Expand Down Expand Up @@ -56,44 +56,54 @@ This datasets directory should not be confused with a ``dataset`` from an h5
file.

When using the ``rex`` examples below, update the file paths with the relevant
NREL HPC file paths in ``/datasets/`` and set ``hsds=False``.
NREL HPC file paths in ``/datasets/``.

Data Location - External Users
------------------------------

If you are not at NREL, the easiest way to access this data is via HSDS. These
files are massive and downloading the full files would crash your computer.
HSDS provides a solution to stream small chunks of the data to your laptop or
server for just the time or space domain you're interested in.
If you are not at NREL, you can't just download these files. They are massive
and downloading the full files would crash your computer. The easiest way to
access this data is probably with ``fsspec``, which allows you to access files
directly on S3 with only one additional installation and no server setup.
However, this method is slow. The most performant method is via ``HSDS``.
``HSDS`` provides a solution to stream small chunks of the data to your laptop
or server for just the time or space domain you're interested in.

See `this docs page <https://nrel.github.io/rex/misc/examples.fsspec.html>`_
for easy (but slow) access of the source .h5 files on s3 with ``fsspec`` that
requires basically zero setup. To find relevant S3 files, you can explore the
S3 directory structure on `OEDI <https://openei.org/wiki/Main_Page>`_ or
with the `AWS CLI <https://aws.amazon.com/cli/>`_

See `this docs page <https://nrel.github.io/rex/misc/examples.hsds.html>`_ for
instructions on how to set up HSDS and then continue on to the Data Access
Examples section below.

To find relevant HSDS files, you can use HSDS and h5pyd to explore the NREL
public data directory listings. For example, if you are running an HSDS local
server, you can use the CLI utility ``hsls``, for example, run: ``$ hsls
/nrel/`` or ``$ hsls /nrel/nsrdb/v3/``. You can also use h5pyd to do the same
thing. In a python kernel, ``import h5pyd`` and then run:
instructions on how to set up HSDS for more performant data access that
requires a bit of setup. To find relevant HSDS files, you can use HSDS and
h5pyd to explore the NREL public data directory listings. For example, if you
are running an HSDS local server, you can use the CLI utility ``hsls``, for
example, run: ``$ hsls /nrel/`` or ``$ hsls /nrel/nsrdb/v3/``. You can also use
h5pyd to do the same thing. In a python kernel, ``import h5pyd`` and then run:
``print(list(h5pyd.Folder('/nrel/')))`` to list the ``/nrel/`` directory.

The `Open Energy Data Initiative (OEDI) <https://openei.org/wiki/Main_Page>`_
is also invaluable in finding energy-relevant public datasets that are not
necessarily spatiotemporal h5 data.
There is also an experiment with using `zarr
<https://nrel.github.io/rex/misc/examples.zarr.html>`_, but the examples below
may not work with these utilities and the zarr example is not regularly tested.

Note that raw NREL .h5 data files are hosted on AWS S3. In contrast, the files on HSDS are not real "files". They are just domains that you can access with h5pyd or rex tools to stream small chunks of the files stored on S3. The multi-terabyte .h5 files on S3 would be incredibly cumbersome to access otherwise.
The `Open Energy Data Initiative (OEDI) <https://openei.org/wiki/Main_Page>`_
is also invaluable for finding the source s3 filepaths and for finding
energy-relevant public datasets that are not necessarily spatiotemporal h5
data.

We have also experimented with external data access using `fsspec <https://nrel.github.io/rex/misc/examples.fsspec.html>`_ and `zarr <https://nrel.github.io/rex/misc/examples.zarr.html>`_, but the examples below may not work with these utilities.

Data Access Examples
--------------------

If you are on the NREL HPC, update the file paths in the examples below and set
``hsds=False``.
If you are on the NREL HPC, update the file paths with the relevant NREL HPC
file paths in ``/datasets/``.

If you are not at NREL, see the "Data Location - External Users" section above
for how to setup HSDS and how to find the files that you're interested in. Then
update the file paths to the files you want and keep ``hsds=True``.
for S3 instructions or for how to setup HSDS and how to find the files that
you're interested in. Then update the file paths to the files you want either
on HSDS or S3.

The rex Resource Class
++++++++++++++++++++++
Expand All @@ -105,7 +115,7 @@ retrieve ``time_index`` and ``meta`` datasets in their native pandas datatypes.
.. code-block:: python
from rex import Resource
with Resource('/nrel/nsrdb/current/nsrdb_2020.h5', hsds=True) as res:
with Resource('/nrel/nsrdb/current/nsrdb_2020.h5') as res:
ghi = res['ghi', :, 500]
print(res.dsets)
print(res.attrs['ghi'])
Expand All @@ -131,7 +141,7 @@ windspeed is not available as a ``dataset``:
.. code-block:: python
from rex import WindResource
with WindResource('/nrel/wtk/conus/wtk_conus_2007.h5', hsds=True) as res:
with WindResource('/nrel/wtk/conus/wtk_conus_2007.h5') as res:
ws88 = res['windspeed_88m', :, 1000]
print(res.dsets)
print(ws88)
Expand All @@ -150,7 +160,7 @@ for a requested coordinate:
.. code-block:: python
from rex import ResourceX
with ResourceX('/nrel/wtk/conus/wtk_conus_2007.h5', hsds=True) as res:
with ResourceX('/nrel/wtk/conus/wtk_conus_2007.h5') as res:
df = res.get_lat_lon_df('temperature_2m', (39.7407, -105.1686))
print(df)
Expand All @@ -170,7 +180,7 @@ the System Advisor Model (SAM). For example, try:
.. code-block:: python
from rex import SolarX
with SolarX('/nrel/nsrdb/current/nsrdb_2020.h5', hsds=True) as res:
with SolarX('/nrel/nsrdb/current/nsrdb_2020.h5') as res:
df = res.get_SAM_lat_lon((39.7407, -105.1686))
print(df)
Expand Down
15 changes: 6 additions & 9 deletions examples/fsspec/README.rst
Original file line number Diff line number Diff line change
@@ -1,37 +1,34 @@
fsspec
======

You can use ``fsspec`` to open NREL h5 resource files hosted on AWS S3 on your local computer. In our internal tests, this is slower than the `HSDS <https://nrel.github.io/rex/misc/examples.hsds.html>`_ and `zarr <https://nrel.github.io/rex/misc/examples.zarr.html>`_ examples, but is much easier to set up. This may be a good option for people outside of NREL trying to access small to medium amounts of NREL .h5 data in applications that are not sensitive to IO performance.
Filesystem utilities from ``fsspec`` enable users outside of NREL to open h5 resource files hosted on AWS S3 on your local computer. In our internal tests, this is slower than the `HSDS <https://nrel.github.io/rex/misc/examples.hsds.html>`_ and `zarr <https://nrel.github.io/rex/misc/examples.zarr.html>`_ examples, but as of ``rex`` version v0.2.92 it requires zero setup beyond installing ``rex`` and ``fsspec`` as described below. This may be a good option for people outside of NREL trying to access small to medium amounts of NREL .h5 data in applications that are not sensitive to IO performance.

For more info on ``fsspec``, read the docs `here <https://filesystem-spec.readthedocs.io/en/latest/>`_

Extra Requirements
------------------

You may need some additional software beyond the rex requirements to run this example:
You may need some additional software beyond the basic ``rex`` install to run this example:

.. code-block:: bash
pip install fsspec
pip install NREL-rex[s3]
Code Example
------------

To open an .h5 file hosted on AWS S3, follow the code example below. Here are some caveats to this approach:
To open an .h5 file hosted on AWS S3, simply use a path to an S3 resource with any of the ``rex`` file handlers:

- Change ``fp`` to your desired AWS .h5 resource paths.
- Change ``fp`` to your desired AWS .h5 resource paths (find the s3 paths on `OEDI <https://openei.org/wiki/Main_Page>`_ or with the `AWS CLI <https://aws.amazon.com/cli/>`_).
- Running this example on a laptop, it takes ~14 seconds to read the meta data, and another ~14 seconds to read the GHI timeseries. This may be faster when running on AWS services in the same region hosting the .h5 file. It is much slower when running on the NREL VPN.
- The ``s3f`` object works like a local .h5 filepath and can be passed to any of the ``rex`` resource handlers, which will handle all of the data scaling and formatting.

.. code-block:: python
import time
import fsspec
from rex import Resource
fp = "s3://nrel-pds-nsrdb/current/nsrdb_1998.h5"
s3f = fsspec.open(fp, mode='rb', anon=True, default_fill_cache=False)
res = Resource(s3f.open())
res = Resource(fp)
t0 = time.time()
meta = res.meta
Expand Down
Loading

0 comments on commit c05f5d9

Please sign in to comment.