Skip to content

Developer Best Practices

Sarah French edited this page Nov 27, 2024 · 8 revisions

Developer Best Practices

Contents

Authentication

It's recommended that you configure your setup to run with "user credentials" or "application default credentials" by default, running as your GCP user account. You can do that by omitting any other form of credential, and by setting the GOOGLE_USE_DEFAULT_CREDENTIALS=TRUE environment variable so that you can run tests locally with that account.

This will mean that during development you'll encounter which APIs do not allow ADCs, and will be able to document as much on the relevant resources. User project overrides will handle most APIs correctly, and you can run as a service account again by configuring impersonation by specifying GOOGLE_IMPERSONATE_SERVICE_ACCOUNT={{email}} on a service account you have granted yourself roles/iam.serviceAccountTokenCreator on.

Note: This must be granted on the service account itself, and will not be inherited from roles/owner on the project. You can grant it with the following snippet:

variable "service_account_email" {
  type = string
}

variable "service_account_project" {
  type = string
}

variable "user_email" {
  type = string
}

resource "google_service_account_iam_member" "add-impersonation-permissions" {
  service_account_id = "projects/${var.service_account_project}/serviceAccounts/${var.service_account_email}"

  member             = "user:${var.user_email}"
  role               = "roles/iam.serviceAccountTokenCreator"
}

You can verify you've authenticated as your user credentials through gcloud by finding Authenticating using DefaultClient... in your provider logs with TF_LOG=DEBUG.

Self Links, Names, and IDs

Resources in GCP often reference other resources. For example, creating a subnetwork requires specifying a network that it belongs to. There are three ways that this reference may occur:

  • ID: A resource's ID should be something that can be used to reference this resource from a different one. For example, a subnetwork should be able to reference a network by specifying network = google_compute_network.foo.id. In past versions of the provider, we treated the ID as an implementation detail and preferred our users not depend on it. Starting in version 3.0.0, we have reversed that stance and now prefer that the ID field be in a format such that other resources can reference it. This also corresponds to the Resource names AIP.
  • name: Newer GCP APIs tend to use the full Resource name in the name field. To make it easier on our users, TPG resources that correspond to these APIs should accept a short name (the resource ID as defined in the AIP) in the name field if that field is settable. Examples and tests should not have resources referencing the name field of a different resource; they should use the id instead (unless a resource only accepts a reference to the short name, which is rare but does occur).
  • self_link: Many older GCP APIs only accept and return a short name in their name field, and return the full resource path (the name as defined in the AIP) in the self_link field. Referencing these resources either by id or self_link is acceptable, and should be chosen in examples/tests based on consistency with the rest of the example or test. For new resources, referencing the id field is preferred. Some resources that do not have a self_link field in their GCP API have one in their Terraform representation, such as google_kms_crypto_key. This is legacy behavior and should be avoided for new resources.

Default values

We have three options for how to handle fields that have default values that are returned by the API:

  • Set a default value in the schema: This allows users to see in their terraform plan exactly what will be created, and thus is the preferred option when possible. As of now, lists, sets, and nested objects cannot use schema-side defaults (though primitive fields within nested objects can). Using this does mean that our resource code gets out-of-date when API-side defaults change, but it also means that we get to control when they change on the resource.
  • Set the field as Optional+Computed: This is handy when a field, if unset, will have a value returned by the API, but we don't know what that value is. This option is also necessary for data types that can't use schema-side defaults.
  • Make the field required: If setting a default for a field would mean that it is possible for a user to set an empty block, we should instead make the field required, even if it isn't required by the API.

Defaults that aren't returned by the API

Some resources have fields that when set to their default value are not returned by the API. If a field has a default that is set in the schema, but the API does not return it when set to that value, make sure it gets set in state by using the default_if_empty flattener for generated resources (and something similar for hand-written ones).

Optional+Computed lists of objects

Setting a list of objects as Optional+Computed means that if the object is not set in a user's config, it will be seen as "Computed", and not show a diff. This means that for an Optional+Computed list of objects, it is impossible for a user to explicitly state that the list should be empty. Setting schema.SchemaConfigModeAttr on the list makes this possible, but is considered legacy behavior that should be avoided when possible.

Tests

Acceptance Tests

All resources must have acceptance tests. One common pattern is as follows:

  • TestAcc[ResourceName]_basic: This test should be the simplest possible test for this resource, and will generally set only required fields on the resource.
  • TestAcc[ResourceName]_full: This test should set as many fields as possible on the resource.
  • TestAcc[ResourceName]_update: This test will have a TestStep to create the resource and another to update it.

Every field that can be set on the resource should be set in at least one test, and every field that can be updated on the resource should be updated in at least one test. A single test may test setting/updating many fields; this approach is recommended to cut down on total test time. Because APIs have different behavior for setting empty values (whether it means to not update the field or to update it to empty), it is helpful to test that fields can be updated to their zero value.

Resources that are generated by Magic Modules will have tests that match the examples in the documentation. Because of this, it is also useful to have tests that correspond to common usage patterns of the resource.

Test Checks

Nearly every resource in the provider supports import, and so we can check that a resource was successfully created using ImportState and ImportStateVerify checks. These tests work in the following manner:

  • In the Config step, a resource is created and its representation is written to state.
  • In the ImportState step, terraform import is run (using the resource's id attribute as the import id), and that result is written to a separate state.
  • ImportStateVerify checks that the two states are exactly the same

Import-style tests mean that we don't have to check attributes individually on the resource, nor do we need custom logic to call an API and determine whether a resource exists. This manner of testing is expected for all resources except in exceptional circumstances.

ImportStateVerifyIgnore

Not all attributes on a resource are returned from a resource's GET call, and thus can't be imported. For example, secrets are often returned as empty, and we sometimes add virtual fields to a resource that are not part of its API to control behavior. In these circumstances, the field can be ignored in the test using ImportStateVerifyIgnore. This is also useful for fields that have DiffSuppressFuncs that can't be worked around by providing a different value in the test.

ImportStateId / ImportStateIdPrefix

Not all resources have an id that can be used as the import id. In these cases, it is best to attempt to make it possible to import the resource using its id, but if that doesn't make sense or makes for a bad UX, the import id can be specified in the test using ImportStateId or ImportStateIdPrefix. These fields are also useful to set to test that a resource can be imported with multiple different formats.

Computed Fields

Computed fields will always match between the two states that are compared, so import tests don't check that the value is set to an expected value (or set, period). To test computed fields, use the resource.TestCheckResourceAttr family of checks.

Sweepers

Certain circumstances (panics, failed deletes, etc.) will leave behind dangling resources because the post-test destroy action doesn't run successfully or is interrupted.

Running provider tests often can lead to dangling test resources caused by test failures. Terraform has a capability to run Sweepers which can go through and delete resources. In TPG, sweepers mainly:

  1. List every resource in a project of a specific kind
  2. Iterate through the list and determine if a resource is sweepable
  3. If sweepable, delete the resource

Sweepers run by using the -sweep and -sweep-run TESTARGS flags:

make testacc TEST=./google/sweeper TESTARGS='-sweep=us-central1 -sweep-run=<sweeper-name-here>'

Sweepers can be added in the tests for a given resource by adding an init() function that contains a call to resource.AddTestSweepers(...). Sweepers should be added when possible.

Sweepers in CI

Sweepers run after the nightly tests finish. There are 'Service Sweeper' builds that sweep away all 'sweepable' resources in a given test project in GCP. There are sweepers for the GA nightly test project, Beta nightly test project, and VCR project (this similar looking build is used to test changes to sweeper code from the upstream repo). Service sweepers sweep everything EXCEPT projects, which is achieved by setting the environment variable SKIP_PROJECT_SWEEPER=1.

Projects are swept by a special project sweeper build that only runs once all other builds have completed. This ensures that no running acceptance tests are affected by the sweeper.

Unit Tests

For the majority of our resources, the acceptance tests are sufficient for testing. However, some resources use utility functions or have more complicated business logic (such as CustomizeDiffs). These functions should be unit tested.

Fine-grained resources

Fine-grained resources are Terraform resources that are nested objects in other resources in their API representation. Generally, fine-grained resources are created because of user demand, though there are some patterns that make it likely that a resource should be fine-grained:

  • add[Field] / delete[Field] methods on the parent resource
  • Lists of objects that should not be authoritative (for example, having multiple google_project_service resources instead of a single google_project_services resource because entries can be added to the list of enabled APIs out-of-band)
  • Scenarios where a part of a resource might be modified by a different team than the team that created the resource

In general, whether or not to make a separate fine-grained resource for a nested object is subjective.

Validation

There are two places where a user's config might be validated by the provider: plan-time and apply-time.

Plan-time validation

Plan-time validation is accomplished through various fields on the field schema, such as ValidateFunc, [Min|Max]Items, ConflictsWith, AtLeastOneOf, and ExactlyOneOf, as well as CustomizeDiff on the resource as a whole. These functions give our users confidence that their plans will succeed, and thus are generally good to have when possible.

However, plan-time validation can sometimes bite our users, such as in cases where the API accepts new values that it didn't before (increasing the max length of a string, adding a new enum value, increasing the number of elements in a list, etc.) In these cases, users have to wait for new versions of the provider to be released that relax these validations, even though their configs should succeed. When adding validations, some judgment should be applied to decide whether that validation is likely to need to change in the future.

Apply-time validation

Apply-time validation happens as business logic inside of a resource's CRUD methods. Because a bad config value would likely cause a 400 level error from the API anyway, these rarely add value and should generally be avoided. They should only be added when a plan-time validation is somehow impossible and the API provides an error message that is confusing or nonexistent.

Failed Concurrent Updates and Eventual Consistency

Terraform uses a dependency graph to avoid creating parent and child resources at the same time. However, sibling relationships or parent resources with transient states aren't typically covered by this. In addition, some resources are eventually consistent, where creation/update succeeds but the resource isn't ready, causing subsequent updates or dependent resource creation to fail.

This tends to appear for customers as some combination of the following:

  • Initial apply fails but a second apply succeeds immediately after
  • Getting 404s or 403s errors immediately after creation (and less commonly update) succeeds
  • Errors with "operation in progress" messages involving the parent resource.

Mutexes

Mutexes should be used when a parent resource does not support concurrent edits or several sibling resources cannot be added at the same time.

We use a mutex that is created globally across the Terraform instance. The developer for a resource specifies a key format that is shared across connected resources to prevent them from being edited (create/update/delete) at the same time.

Usage

In MMv1, this is done by adding a mutex property to the Terraform resource override:

NetworkEndpoint: !ruby/object:Overrides::Terraform::ResourceOverride
    mutex: 'networkEndpoint/{{project}}/{{zone}}/{{network_endpoint_group}}'

In Terraform (either generated or handwritten), this appears as blocks in C/U/D before any transaction of API requests

    lockName, err := replaceVars(d, config, "networkEndpoint/{{project}}/{{zone}}/{{network_endpoint_group}}"
    if err != nil {
        return err
    }
    mutexKV.Lock(lockName)
    defer mutexKV.Unlock(lockName)
Example Resources
  • Container cluster and child node pool resources
  • Networks and children routes, network peerings
  • Bigquery datasets and accesses

Polling

Polling should be used when a read API state does not match the expected state right after an API call succeeds. For example:

  • Edit succeeded but resource is not ready, typically shown through a status like "Ready:False" or "UPDATING"
  • Reading a resource returns a 404 right after creation returns, due to cached reads or eventually consistent reads
Usage

In Terraform, we add a call to the PollingWaitTime util after the initial API call success. This takes as arguments:

  • A [PollReadFunc], a function that handles reading the resource state. This should essentially do the same API call as Read() but skip the state setting.
  • A PollCheckResponseFunc, a function to check the resource state (obtained via PollReadFunc) and determine whether or not we should stop polling with an success, stop polling with an error, or continue polling on resource state.

In MMv1, using a Provider::Terraform::PollAsync async manages adding the necessary code (autogenerating PollReadFunc and adding call to PollingWaitTime in appropriate place). Most of the work around defining a PollCheckResponseFunc to use. For example, for PubSub topic:

Topic: !ruby/object:Overrides::Terraform::ResourceOverride
  async: !ruby/object:Provider::Terraform::PollAsync
      check_response_func: PollCheckForExistence
      actions: ['create']
      operation: !ruby/object:Api::Async::Operation
        timeouts: !ruby/object:Api::Timeouts
          insert_minutes: 6
          update_minutes: 6
          delete_minutes: 4

generates the appropriate code in Terraform and uses the common util PollCheckForExistence to check whether the response is a 404.

common_polling.go should contain polling utils.

Examples:

Error Retry Predicates

A more generic approach to transient errors in Terraform is request retries, used to handle and retry temporary errors but also to wait against unpredictable states. retry_utils.go contains the various retry handler utils. Typically for eventually consistent resources we create a custom error retry predicate function to determine whether a given error is retryable and pass it as an additional retry predicate to retryTimeDuration(...) or sendRequest(...)

Error retry predicates can be added less manually via:

  • error_retry_predicates in Terraform resource overrides for MMv1
  • the default predicates for all requests (less likely for eventually consistent resources)

If the resource/API is mostly handwritten, you can also add it to the retry transport used in the API client as defined in config.go.

Examples

Adding Sleeps

The final non-ideal way of handling eventual consistency is by simply adding sleeps to specific resources. In some cases, we don't want to poll or can't actually poll for the appropriate state. Some examples:

  • Attempting to create a child resource right after the project was created - this manifests as several types of errors, some of which cannot be retried, and we can't poll on the expected child resource behavior from the parent (project) resource.
  • An IAM binding hasn't propagated, causing 403s that are impossible to differentiate from real 403 permission errors. We don't want to wait for the whole create/update timeout on 403s because of quota and because we should return early if user needs to add a permission.