Skip to content
This repository has been archived by the owner on Nov 6, 2023. It is now read-only.

Latest commit

 

History

History
101 lines (73 loc) · 5.31 KB

README.md

File metadata and controls

101 lines (73 loc) · 5.31 KB

DO NOT USE - MIGRATED TO GITLAB

dataworks-aws-data-egress

A repo for Dataworks AWS Data Egress service infrastructure

This repo contains Makefile and base terraform folders and jinja2 files to fit the standard pattern. This repo is a base to create new Terraform repos, renaming the template files and adding the githooks submodule, making the repo ready for use.

Running aviator will create the pipeline required on the AWS-Concourse instance, in order pass a mandatory CI ran status check. this will likely require you to login to Concourse, if you haven't already.

After cloning this repo, please generate terraform.tf and terraform.tfvars files:
make bootstrap

In addition, you may want to do the following:

  1. Create non-default Terraform workspaces as and if required:
    make terraform-workspace-new workspace=<workspace_name> e.g.
    make terraform-workspace-new workspace=qa

  2. Configure Concourse CI pipeline:

    1. Add/remove jobs in ./ci/jobs as required
    2. Create CI pipeline:
      aviator

Data egress

The data egress task is responsible for receiving messages from a SQS queue, retrieving a configuration DynamoDb item for the message and then sending files to a destination location (another S3 bucket or to disk).

Data Egress Diagram

  1. Data is uploaded to source s3 bucket
  2. pipeline_success.flag file is uploaded to same file path as data
  3. New SQS item added on new pipeline_success.flag file upload with path to file as datasource
  4. Egress service picks up jobs from SQS queue
  5. Egress service queries Dynamo to get what action needs to be taken with the data (set in data-egress.tf)
  6. If transfer_type is SFT the data is copied to a local directory and picked up by the SFT Service
    1. Prod environment: the data is sent to the corresponding data warehouse location
    2. Non-prod environment: the data is sent to the stub-hdfs-*** bucket
  7. If transfer_type is S3 the data is sent to the corresponding S3 location

Database items

Row Description
source_prefix Partition key. The S3 path to retrieve files for
pipeline_name Sort key: The pipeline which sent the files
decrypt Whether the files need to be decrypted
destination_bucket The S3 bucket to send files to. Blank for SFT
destination_prefix The folder path to save files to
recipient_name Team name for the receiving files
source_bucket S3 bucket location of the files to send
transfer_type How to send the files, S3 or SFT

If source data is required to be sent via S3 and SFT, append the transfer type to pipeline_name

pipeline_name#sft

Note

Ensure the soure prefix is in data-egress_iam.tf

SFT Agent

The SFT Agent reads files written to disk by the data egress service and sends these to configured destinations via HTTPS. It is deployed as a sidecar to the data egress service, and a volume mounted to /data-egress is shared between the containers.
Which files are read and where they are sent to is determined by config

Testing

In non production environments files are send to stub nifi, which is a container running nifi deployed by snapshot sender. This listens for files sent on port 8091 on path /DA. The files it receives will be saved to the s3 bucket

stub-hdfs-****

Production Running

SFT sends files to a SDX F5 VIP. This receives the files and forwards them on. Authentication is established by TLS, the certificaties requried by SFT are defined here

Java keystore and truststore

These are created within the sft-agent entrypoint The agent config is updated with the keystore/ truststore passwords and paths. Importantly the private key password and keystore password have to be the same because SFT runs on a tomcat version with this 'requirement'

BYOK Encryption

Due to the nature of data being transferred there is a requirement to have a specific type of encryption on our EBS volumes where we store some of the DWX data.
It is required that our EBS volumes are encrypted with an external CMK generated by the Security Operations team.

For this implementation we have created an external KMS key. We manually download the wrappers and tokens for each key and send the wrappers to the relevant people in the security operations team that use that wrapper to wrap a key generated via their HSM.
This wrapper is then manually uploaded alongside the token via breakglass to each environment.

All the manual steps are done in the AWS console in the KMS section.

Now that the key is uploaded, we can use this external KMS key to encrypt the EBS volumes.

Some more information is available in our common wiki