-
Notifications
You must be signed in to change notification settings - Fork 2
Cloud Processing Setup
The WIBL Multitool, or any S3 client, can upload files to Amazon Web Services for processing. The sequence of operations are:
- Upload the raw logger file to an appropriate Simple Storage Service bucket.
- Read from raw logger file, timestamp, and convert to GeoJSON using an AWS Lambda function, writing into a second S3 bucket.
- Transfer from the staging S3 bucket to DCDB using their API.
Since each Trusted Node needs to have its own permissions to send data to DCDB, including identifying marks on the data, it is impossible to provide a complete implementation of this processing chain. Instead, this documentation describes how to use the components provided to configure your own processing system. The best and simplest way to do this is to use the automated setup scripts described immediately following; if you have to do something custom, however, full setup instructions follow.
The distribution provides scripts in wibl-python/scripts
to assist in setup of the AWS portion of the processing chain. Specifically, to set up a new instantiation of the cloud segment, it should be sufficient to edit wibl-parameters/scripts/cloud/AWS/configuration-parameters.sh
to reflect your configuration (e.g., your AWS account number, DCDB provider ID, and the location of your DCDB upload authorisation token), and then run first wibl-python/scripts/cloud/AWS/configure-buckets.sh
to generate the S3 buckets, then wibl-python/scripts/cloud/AWS/configure-lambdas.sh
to package, upload, configure, and authorise the Lambdas for processing and submission to DCDB. Given the complexity of doing this by hand, it is strongly recommended that you use the script (we do!).
The packaging script uses Docker to package the code using a runtime that you can also use in AWS, and assumes that you have access to an implementation of the bash
shell. This is native on macOS and most Unix-alike systems, and is available through Windows Subsystem for Linux on Windows. You will also have to install and configure the AWS command line tool.
The configuration parameters in wibl-parameters/scripts/cloud/AWS/configuration-parameters.sh
are:
-
ACCOUNT_NUMBER
. Set to your AWS 12-digit account number. This must match the authenticate you have configured into the AWS CLI as your default credentials. -
DCDB_PROVIDER_ID
. Your five or six-letter identifier that DCDB provides to all Trusted Nodes. For example, CCOM/JHC's ID is "UNHJHC".
The script assumes that the authorisation key for DCDB uploads (provided by DCDB for Trusted Nodes) will be in a simple ASCII text file at ../DCDB-Upload-Tokens/ingest-external-UNHJHC.txt
(or equivalent for your DCDB provider ID). You may have to modify this if you prefer to keep your credentials elsewhere.
The configuration script also makes some assumptions on where you want to deploy the Lambdas (by default, this is US-East-2) and the architecture for runtime (by default, x86_64). You can adjust either or both of these, but be aware that not all AWS services are available in all regions, and you can't always get a specific runtime and SciPy support for Lambdas with specific architectures. (For example, it would be nicer to use an ARM64 instance, since it's cheaper, but the support for runtimes and SciPy isn't there yet--as of 2023-03).
After signing up for AWS services, each implementation will need an Identify and Access Management (IAM) user to allow upload to S3, and a Role to allow for the processing to occur. Details of how these get set up may depend on whether your organisation uses Single Sign On, but the setup should be:
-
Go to the IAM console, and select "Users", then "Add User".
- Provide an appropriate user name (e.g., CSBUPload).
- Set "Access Type" to "Programmatic access"
- Click "Next: Permissions"
- Click on "Attach existing policies directly"
- In the policy search box, search for and check the box next to "AmazonS3FullAccess"
- Click "Next: Tags"
- Add any tags that you prefer (not important for setup)
- Click "Next: Review"
- Click "Create User"
- Ensure that you download the CSV file with the user's credentials in it, and store somewhere secure locally.
-
At the IAM console, click "Roles" and then "Create Role".
- Click on "AWS service" box at the top of the page (usually the default).
- Click on "Lambda" as the service (usually at the top in "Common use cases").
- Click "Next: Permissions".
- In the policies serach box, search for and check the box next to "AmazonS3FullAccess", "AmazonAPIGatewayAdministrator", and "AWSLambdaBasicExecutionRole".
- Click "Next: Tags"
- Add any tags that you prefer (not important for setup)
- Click "Next: Review"
- Give the role an appropriate name, e.g., "CSBProcessing".
- Click "Create role".
The CSV file generated when you make a user contains the user name, an Access Key ID, and a Secret Key, among other things. In order to make your upload scripts use these credentials by default:
- Create ~/.aws/credentials and add:
[default]
aws_access_key_id = ACCESS-KEY-ID
aws_secret_access_key = SECRET-KEY
- Create ~/.aws/config and add:
[default]
region = us-east-2
(changing the default region to something appropriate to your location around the world).
You will need three S3 buckets to hold the code images, the incoming data, and the GeoJSON files prior to submission to DCDB. At the AWS S3 console, create three buckets:
-
csb-code-images
to hold Lambda run-time ZIP files -
csb-upload-ingest-bucket
to hold files being transferred in from the outside. -
csb-upload-dcdb-staging
to hold files while they are being transferred to DCDB.
In each case, the only requirement is to give the buckets names and to ensure that they are created in the appropriate AWS region; otherwise, the default configuration settings apply. Note that S3 buckets have to be uniquely named globally, so it's likely that you won't be able to make a bucket with these specific names --- build something that's unique to you (e.g., "PREFIX-csb-code-images") --- and then modify the configuration JSON for the Lambdas to match.
AWS Lambda is a serverless function execution environment (i.e., AWS manages the servers and the user just provides the code to be executed as a suitable function, typically in Javascript, Java, Ruby, or Python). Although it is possible to edit code directly in the web interface, there is a limit to the support code that is available in this environment. Due to the support libraries required for the processing and submission scripts, custom environments must therefore be prepared as follows, and then uploaded.
Python supports stand-alone virtual environments that encapsulate the code and libraries required for a particular application. In order to prepare the virtual environments required:
- Make a new directory on a local hard disc, and change into it.
- Make sure that you're using Python 3.7 or later (e.g.,
python --version
). - Create the environment:
python -m virtualenv .
- If you're using
conda
, you can also generate an environment withconda create --name foo
.
You can then install any packages required using pip install
. Depending on your host OS, you may need to activate the virtual environment first so that the packages are installed in the environment, rather than your local Python setup. Usually, there's an source bin/activate
command for this in the virtual environment.
To package up the code for upload, ZIP up the contents of the site-packages
directory; usually ${BASE}/lib/python3.7/site-packages
(mutatis mutandis for Python version). Note that it is essential that you package the contents of the directory, not the directory itself!
The processing lambda custom code is provided in the repository as wibl-python/wibl/processing/cloud/aws/lambda_function.py
. Make a virtual environment to support these, add the pynmea2
library, and then copy the custom code into the site-packages
directory before creating the ZIP file for upload. Upload the ZIP file to the csb-code-images
bucket with an appropriately identifying name. Copy the S3 URI for the ZIP file once uploaded.
To make the lambda, go to the Lambda console, click "Create Function", and:
- Select the "Author from Scratch" box (the default)
- Name the function "csb-stage1-processing"
- Select the Python 3.7 (or later) run-time.
- Open the "Change default execution role" drop-down.
- Click the "Use an existing role" and then select the role that you created previously from the drop-down list.
- Click "Create function"
Once the function has been created, click on it in the console, and then configure:
- In "Code source", click "Upload from" and select "Amazon S3 location". Paste the ZIP file's S3 URI, and save.
- In "Layers", click "Edit" and then "AWSLambda-Python37-SciPy1x" at least version 35 to add NumPy support. Note that you may have to change this configuration slightly depending on the version of Python you're using for the runtime.
- Click "+ Add trigger", select "S3" and then specify the
csb-upload-ingest-bucket
under "Bucket" and "PUT" under "Event type"; acknowledge the check box on recursive invocation, and then click "Add".
The submission lambda custom code is provided in the repository as wibl-python/wibl/submission/cloud/aws/lambda_function.py
. The authentication information for talking to DCDB and your DCDB provider ID (e.g., "UNHJHC") can be provided either in a configure.json
file, or through environment variables associated with the Lambda (strongly preferred). The automated setup scripts (see previously) automatically configure for this, and are strongly recommended rather than doing hand-based setup.
After these changes are made, make a virtual environment to support these, add the requests
library, and then copy the custom code into the site-packages
directory before creating the ZIP file for upload. Upload the ZIP file to the csb-code-images
bucket with an appropriately identifying name. Copy the S3 URI for the ZIP file once uploaded.
To make the lambda, go to the Lambda console, click "Create Function", and:
- Select the "Author from Scratch" box (the default)
- Name the function "csb-stage2-submission"
- Select the Python 3.7 (or later) run-time.
- Open the "Change default execution role" drop-down.
- Click the "Use an existing role" and then select the role that you created previously from the drop-down list.
- Click "Create function"
Once the function has been created, click on it in the console, and then configure:
- In "Code source", click "Upload from" and select "Amazon S3 location". Paste the ZIP file's S3 URI, and save.
- Click "+ Add trigger", select "S3" and then specify the
csb-upload-dcdb-staging
under "Bucket" and "PUT" under "Event type"; acknowledge the check box on recursive invocation, and then click "Add".
As of 2023-03, the configuration details for the Lambdas have significantly changed. There are now default parameter configurations in wibl-python/wibl/defaults/processing/cloud/aws/configure.json
and wibl-python/wibl/defaults/submission/cloud/aws/configure.json
, which set the basic parameters for the Lambdas.
The verbose
binary flag can be used to turn on more detailed debugging information for the code, but should remain off for production work. The local
binary flag can be used for local (i.e., on a laptop or desktop) testing if required (although the WIBL Multitool is a better method).
The authorisation token (to talk to DCDB) and provider ID (the tag that represents you to DCDB) now have to be set, by preference, as environment variables in the Lambda. The simplest and most reliable method for this is to use the automated setup script (see previously).
At this stage, the Lambda processing chain will accept S3 uploads, process the files if possible, write them into the staging S3 bucket, and then attempt to upload them to the DCDB test service. In order to make the code actually send the data (ideally after testing), remove the test
component from the submission script's URL for the HTTP request during configuration.
The configurations of these scripts are designed to be minimal --- a framework on which to build further processing and more complex configurations.
For example, the code does not currently clean up after itself, and leaves the uploaded files in both csb-upload-ingest-bucket
and csb-upload-dcdb-staging
when the Lambdas complete so that further processing could be done on them. This will quickly start to cost serious amounts for S3 storage, however, and if further processing is not considered, it would be a good idea to modify the scripts to clean up the csb-upload-ingest-bucket
after the data is successfully converted for staging, and the csb-upload-dcdb-staging
after the file is successfully transferred to DCDB.
For Trusted Nodes that are interested in doing further processing, however, keeping the intermediate files might be useful. For example, adding water level adjustments, correcting for vertical offsets, or estimating observer reputation could all be coded in similar manner to run on one or more files on a regular basis (although use of an alternative mechanism might be better if the processing is not applied to single files). Given the verbosity of the GeoJSON format, however, it might be more useful to cache copies of the timestamped data before it is converted into GeoJSON, which could be done in the first-stage processing script.
(In the development stream, a limited capacity to include sub-algorithms has been added. These algorithms are applied in the wibl-python/wibl/processing/cloud/aws/lambda_function.py
script if the WIBL file contains an "algorithms" packet, and the algorithm requested is recognised; format conversion, timestamping, GeoJSON conversion, and upload always happen. This facility is expected to improve in the future as new algorithms are added.)