DO NOT USE - MIGRATED TO GITLAB

dataworks-aws-data-egress

A repo for Dataworks AWS Data Egress service infrastructure

This repo contains Makefile and base terraform folders and jinja2 files to fit the standard pattern. This repo is a base to create new Terraform repos, renaming the template files and adding the githooks submodule, making the repo ready for use.

Running aviator will create the pipeline required on the AWS-Concourse instance, in order pass a mandatory CI ran status check. this will likely require you to login to Concourse, if you haven't already.

After cloning this repo, please generate terraform.tf and terraform.tfvars files:
make bootstrap

In addition, you may want to do the following:

Create non-default Terraform workspaces as and if required:
make terraform-workspace-new workspace=<workspace_name> e.g.
make terraform-workspace-new workspace=qa
Configure Concourse CI pipeline:
1. Add/remove jobs in ./ci/jobs as required
2. Create CI pipeline:
  aviator

Data egress

The data egress task is responsible for receiving messages from a SQS queue, retrieving a configuration DynamoDb item for the message and then sending files to a destination location (another S3 bucket or to disk).

Data is uploaded to source s3 bucket
pipeline_success.flag file is uploaded to same file path as data
New SQS item added on new pipeline_success.flag file upload with path to file as datasource
Egress service picks up jobs from SQS queue
Egress service queries Dynamo to get what action needs to be taken with the data (set in data-egress.tf)
If transfer_type is SFT the data is copied to a local directory and picked up by the SFT Service
1. Prod environment: the data is sent to the corresponding data warehouse location
2. Non-prod environment: the data is sent to the stub-hdfs-*** bucket
If transfer_type is S3 the data is sent to the corresponding S3 location

Database items

Row	Description
source_prefix	Partition key. The S3 path to retrieve files for
pipeline_name	Sort key: The pipeline which sent the files
decrypt	Whether the files need to be decrypted
destination_bucket	The S3 bucket to send files to. Blank for SFT
destination_prefix	The folder path to save files to
recipient_name	Team name for the receiving files
source_bucket	S3 bucket location of the files to send
transfer_type	How to send the files, S3 or SFT

If source data is required to be sent via S3 and SFT, append the transfer type to pipeline_name

pipeline_name#sft

Note

Ensure the soure prefix is in data-egress_iam.tf

SFT Agent

The SFT Agent reads files written to disk by the data egress service and sends these to configured destinations via HTTPS. It is deployed as a sidecar to the data egress service, and a volume mounted to /data-egress is shared between the containers.
Which files are read and where they are sent to is determined by config

Testing

In non production environments files are send to stub nifi, which is a container running nifi deployed by snapshot sender. This listens for files sent on port 8091 on path /DA. The files it receives will be saved to the s3 bucket

stub-hdfs-****

Production Running

SFT sends files to a SDX F5 VIP. This receives the files and forwards them on. Authentication is established by TLS, the certificaties requried by SFT are defined here

Java keystore and truststore

These are created within the sft-agent entrypoint The agent config is updated with the keystore/ truststore passwords and paths. Importantly the private key password and keystore password have to be the same because SFT runs on a tomcat version with this 'requirement'

BYOK Encryption

Due to the nature of data being transferred there is a requirement to have a specific type of encryption on our EBS volumes where we store some of the DWX data.
It is required that our EBS volumes are encrypted with an external CMK generated by the Security Operations team.

For this implementation we have created an external KMS key. We manually download the wrappers and tokens for each key and send the wrappers to the relevant people in the security operations team that use that wrapper to wrap a key generated via their HSM.
This wrapper is then manually uploaded alongside the token via breakglass to each environment.

All the manual steps are done in the AWS console in the KMS section.

Now that the key is uploaded, we can use this external KMS key to encrypt the EBS volumes.

Some more information is available in our common wiki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DO NOT USE - MIGRATED TO GITLAB

dataworks-aws-data-egress

A repo for Dataworks AWS Data Egress service infrastructure

Data egress

Database items

Note

SFT Agent

Testing

Production Running

Java keystore and truststore

BYOK Encryption

Files

README.md

Latest commit

History

README.md

File metadata and controls

DO NOT USE - MIGRATED TO GITLAB

dataworks-aws-data-egress

A repo for Dataworks AWS Data Egress service infrastructure

Data egress

Database items

Note

SFT Agent

Testing

Production Running

Java keystore and truststore

BYOK Encryption