Tanagra has two sets of GCP requirements, for the indexer environment and for the service deployment. Indexer environments and service deployments are not one-to-one. You can have multiple indexer environments for a single service deployment and vice versa.
Once you have an environment or deployment setup, you need to set the relevant properties in the config files (e.g. GCP project id, BigQuery dataset id, GCS bucket name). Pointers to the relevant config properties are included in each section below.
An indexer environment is a GCP project configured with the items below.
- These APIs enabled.
Some of these may be enabled by default in your project.
- BigQuery
bigquery.googleapis.com
- Cloud Storage
storage.googleapis.com
- Dataflow
dataflow.googleapis.com
- BigQuery
- GCS bucket in the same location as the Dataflow locations. Multi-region
US
is not supported for dataflow, pickus-central1
. More details at supported locations. Create a folder within this bucket. - "VM" service account with the below permissions to attach to the Dataflow worker VMs.
- Read the source BigQuery dataset.
roles/bigquery.dataViewer
granted at the dataset-level (on the source dataset) includes the required permissions. - Create BigQuery jobs.
roles/bigquery.jobUser
granted at the project-level (on the indexer GCP project) includes the required permissions. - Write to the index BigQuery dataset.
roles/bigquery.dataOwner
granted at the dataset-level (on the index dataset) includes the required permissions. - Execute Dataflow work units.
roles/dataflow.worker
granted at the project-level (on the indexer GCP project) includes the required permissions.
- Read the source BigQuery dataset.
- "Runner" end-user or service account with the below permissions to run indexing.
- Read the source BigQuery dataset.
roles/bigquery.dataViewer
granted at the dataset-level (on the source dataset) includes the required permissions. - Create BigQuery jobs.
roles/bigquery.jobUser
granted at the project-level (on the indexer GCP project) includes the required permissions. - Create/delete and write to the index BigQuery dataset.
roles/bigquery.dataOwner
granted at the project-level (on the indexer GCP project) includes the required permissions. - Kickoff Dataflow jobs.
roles/dataflow.admin
granted at the project-level (on the indexer GCP project) includes the required permissions. - Attach the "VM" service account credentials to the Dataflow worker VMs.
roles/iam.serviceAccountUser
granted at the service account-level (on the "VM" service account) includes the required permissions. - Rename files in the GCS bucket.
roles/storage.admin
granted at the GCS bucket level includes the required permissions.
- Read the source BigQuery dataset.
- VPC network and subnet may be needed in your project
- Unless specified, data flow jobs use a VPC network named
default
. Create a VPC network by that name if not present. - Use auto creation mode to automatically create subnets in the network.
- Enable Private Google Access in the subnet used by the dataflow region.
- Unless specified, data flow jobs use a VPC network named
You can use a single service account for both the "VM" and "runner" use cases, as long as it has all the permissions.
All indexer configuration lives in the indexer config file, the properties of which are all documented here. In particular:
- Set the pointer to the index BigQuery dataset that you create with the "runner" credentials above.
- Set all the Dataflow properties.
- A directory in the GCS bucket above as the Dataflow temp location.
- The "VM" service account as the Dataflow service account email.
- (Optional) The Dataflow VPC sub-network if you have a custom-mode network.
- (Optional) The Dataflow use public IPs flag if you want to use Private Google Access.
A service deployment lives in a GCP project configured with the items below.
- Java service packaged as a runnable JAR file, deployed either in GKE or AppEngine.
- CloudSQL database, either PostGres (recommended) or MySQL.
- BigQuery dataset for temporary tables, one per index dataset location.
- GCS bucket for export files, one per index dataset location.
- Update the CORS configuration with any URLs that
will need to read exported files. e.g. If there is an export model that writes a file and redirects to another URL
that will read the file, you will likely need to grant that URL permission to make
GET
requests for objects in the bucket. Example CORS configuration file:
[ { "origin": ["https://workbench.verily.com"], "method": ["GET"], "responseHeader": ["Content-Type"], "maxAgeSeconds": 3600 } ]
- The files stored in this bucket are available for either download to the user's computer or export to another configurable URL. It is recommended to configure the bucket to automatically delete objects after some expiration time. See lifecycle configuration
- Update the CORS configuration with any URLs that
will need to read exported files. e.g. If there is an export model that writes a file and redirects to another URL
that will read the file, you will likely need to grant that URL permission to make
- "Application" service account with the below permissions.
- Read the source BigQuery dataset.
roles/bigquery.dataViewer
granted at the dataset-level (on the source dataset) includes the required permissions. - Read the index BigQuery dataset.
roles/bigquery.dataViewer
granted at the dataset-level (on the index dataset) includes the required permissions. - Create BigQuery jobs.
roles/bigquery.jobUser
granted at the project-level (on the service GCP project) includes the required permissions. - Create tables in the temporary tables BigQuery dataset.
roles/bigquery.dataEditor
granted at the dataset-level (on the temporary tables dataset) includes the required permissions. - Read and write files to the export bucket(s).
roles/storage.objectAdmin
granted at the bucket-level includes the required permissions. - Generate signed URLs for export files.
roles/iam.serviceAccountTokenCreator
granted at the service account-level (on itself) includes the required permissions. - Talk to the CloudSQL database.
roles/cloudsql.client
granted at the project-level includes the required permissions.
- Read the source BigQuery dataset.
Service configuration lives in two places, depending on whether they apply to a single underlay or the entire deployment.
Each underlay hosted by a service deployment has its own service config file. All service config file properties are documented here. In particular:
- Set the pointer to the index BigQuery dataset that you create with the indexer "runner" credentials above.
Each service deployment is a single Java application. You can configure this Java application with a custom
application.yaml
file, or (more common) override the default application properties
with environment variables. All application properties are documented here.
In particular:
- Set the temporary tables BigQuery dataset(s) above as the shared export datasets.
- Set the GCS bucket(s) above as the shared export buckets.
- Set the application DB properties with the CloudSQL information.
This is probably the default for getting up and running with Tanagra. You can use the same GCP project for both the indexer environment and the service deployment. It may also be useful for automated testing or dev environments. For production services though, we recommend separating the indexer and service projects.
A single indexer environment can support multiple service deployments. There is one GCP project that is configured correctly for indexing (i.e. Dataflow is enabled, there are one or more service accounts with permissions to write to BigQuery and kickoff Dataflow jobs, etc.). Multiple underlays or multiple versions of a single underlay are indexed in this project, each into its own index BigQuery dataset. You could use different service accounts for each source dataset.
Service deployments may then read/query these index datasets directly from the indexer environment project. Or you can
"publish" (i.e. copy) the index datasets to another project (e.g. the service deployment project). The Verily test
and dev
service deployments both read directly from the indexer environment project. The AoU production service
deployment will add the "publish" step.
Separating the indexer environment from the service deployment means you can avoid increasing permissions in the service deployment project (e.g. don't need to enable Dataflow in your audited production project).
A single service deployment can host one or more underlays. Each underlay should have its own service configuration file. The service deployment configuration allows specifying multiple service configuration files.
Keep in mind that the access control implementation is per deployment, not per underlay. So if you want underlay-specific access control, then you should modify your access control implementation to have different behavior depending on which underlay is being used.
The Verily test
and dev
service deployments both host multiple underlays.