Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical Destruction from Drift Attempts to "Physically" Destroy Machine Components #203

Open
sempervictus opened this issue Jul 2, 2024 · 2 comments

Comments

@sempervictus
Copy link

sempervictus commented Jul 2, 2024

Dynamic dispatch of resource such as NICs can result in destruction of NICs when MaaSMAAS re-commissions a machine drifting state such as to have the resource fall out of scope:

data "maas_network_interface_physical" "metal-1" {
  for_each = maas_machine.metal-1
  machine = each.value.id
  name    = "enp3s0"
}
resource "maas_network_interface_link" "metal-1" {
  for_each          = data.maas_network_interface_physical.metal-1
  machine           = each.value.machine
  network_interface = each.value.id
  subnet            = maas_subnet.prov_subnet_1.id
  mode              = "STATIC"
  ip_address        = cidrhost(maas_subnet.prov_subnet_1.cidr, 3 + index(keys(data.maas_network_interface_physical.metal-1), each.key))
  default_gateway   = true
}

but if the firmware update just completed or a k-commandline option changes that NIC name, it will cause the resource to be destroyed. This creates a death spiral because the destruction happens not just in tf state, but in MaaSMAAS itself - the nic is deleted from the machine requiring a re-commission to regain awareness of it. This seems like terminally incorrect behavior as TF cannot delete disks or NICs from a physical server or VM given to MaaS at this level of interface (in our case, since we're using a TF openstack provider it could trim them form the underlying instance, but the maas provider cannot). Other resources then depending on this resource get deleted and the provider loses state sync causing this disk dispatch block with the network one above:

data "maas_block_device" "metal-1" {
  for_each =  { for entry in local.disk_map_1: "${entry.machine}.${entry.disk}" => entry }
  name = each.value.disk
  machine = each.value.machine
}
resource "maas_block_device" "metal-1-os" {
  for_each       = { 
    for k,v in data.maas_block_device.metal-1: k => v if v.size_gigabytes != 128 && contains(flatten([
      for p in v.partitions: [
        p.mount_point
      ]
    ]),"/")
  }
  machine        = each.value.machine
  name           = each.value.name
  id_path        = each.value.id_path
  size_gigabytes = each.value.size_gigabytes
  is_boot_device = true
  tags = [
    "os_drive",
  ]
  partitions {
    size_gigabytes = 32
    fs_type        = "ext4"
    label          = "os"
    mount_point    = "/"
  }
  partitions {
    size_gigabytes = 1
    fs_type        = "fat32"
    mount_point    = "/boot"
    bootable       = true
  }
}

to correctly identify OS drives after a commissioning pass but lose track of other disks and crash the action:

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

OpenTofu planned the following actions, but then encountered a problem:

  # maas_block_device.metal-1-os["test-metal1-00.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_f150670d-9df4-440e-855f-07eb00906be1"
      + is_boot_device = true
      + machine        = "fwtq6w"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-01.sdg"] will be updated in-place
  ~ resource "maas_block_device" "metal-1-os" {
        id             = "20"
        name           = "sdg"
      ~ tags           = [
          - "1rpm",
          - "rotary",
          + "os_drive",
        ]
        # (8 unchanged attributes hidden)

      ~ partitions {
          ~ label          = "root" -> "os"
          ~ size_gigabytes = 63 -> 32
            tags           = []
            # (4 unchanged attributes hidden)
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-02.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_4a7e675b-d4e8-4f9e-9a4c-7c1b8902c3c8"
      + is_boot_device = true
      + machine        = "kherdk"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-03.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_d144595b-1881-416f-8797-63f102c731e4"
      + is_boot_device = true
      + machine        = "yxsy7e"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

Plan: 3 to add, 1 to change, 0 to destroy.

....
│ Error: ServerError: 404 Not Found (No BlockDevice matches the given query.)
│ 
│   with maas_block_device.metal-1-data["test-metal1-02.sde"],
│   on maas-blockdev.tf line 54, in resource "maas_block_device" "metal-1-data":
│   54: resource "maas_block_device" "metal-1-data" {
│ 
╵
╷
│ Error: cannot find link (32) on the network interface (12) from machine (kherdk)
│ 
│   with maas_network_interface_link.metal-1["test-metal1-02"],
│   on maas-machines.tf line 16, in resource "maas_network_interface_link" "metal-1":
│   16: resource "maas_network_interface_link" "metal-1" {
│ 
╵
...

bottom line being that physical components should not be capable of being destroyed - disks, NICs, BMCs. Logical partitions, VLANs, etc ... of course; but this logic bug messes up reference tracking and brings down the house of cards pretty hard requiring state rm and other bad behaviors to remedy.

Copy link

github-actions bot commented Aug 4, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Aug 4, 2024
@troyanov troyanov removed the stale label Aug 4, 2024
Copy link

github-actions bot commented Sep 4, 2024

This issue is stale because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants