Skip to main content

Updating Deployments

Once HPCBOX has been deployed using Terraform templates using Google Cloud's Infrastructure Manager, any of the following changes to the deployment will have to be performed by carefully editing the existing deployment files and redeploying, so as not to accidentally delete any resources that have been previously created by the deployment.

tip

If you have HPCBOX Premium Support, our support team will be happy to assist with these changes.

Adding new compute worker types

The initial deployment of HPCBOX includes one compute worker type called compute-worker. Generally, all compute workers of a type should use the same Image SKU for best performance. If you would like to include an additional compute worker with a different image SKU, this can be done as follows:

  • Assume we want to create a new compute worker of type compute-worker-2 and we want to use a SKU of type h3-standard-88 for this worker class.
  • Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
  • The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
  • Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.
info

Variables associated with the compute worker type compute-worker are prefixed with compute-worker and we are adding new ones for compute worker type compute-worker-2 and prefixing it with compute-worker-2.

  • Add following variables to variables.tf for the new compute worker class compute-worker-2.

    • compute-worker-2-imageSku
    variable "compute-worker-2-imageSku" {
    description = "The instance type SKU for the compute worker-2"
    type = string
    default = "h3-standard-88"
    }
    • compute-worker-2-numLocalSSDs - Set the default value to 0 if the SKU does not support local disks.
    variable "compute-worker-2-numLocalSSDs"{
    description = <<EOF
    "Number of local ssd disks that should be raided 0 for local scratch. Note that not all computeWorkerImageSku support local ssd.
    Please refer the Google cloud docs for supported VMs at https://cloud.google.com/compute/docs/disks/local-ssd#machine-types to
    find the machines which accept local ssds and the minimum number for each."
    EOF
    default = 0
    }
    • compute-worker-2-placementMaxDistance
    variable "compute-worker-2-placementMaxDistance"{
    description = <<EOF
    Max distance to be considered when creating compute workers for low latency between them. Please refer
    https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy#nested_group_placement_policy
    and https://cloud.google.com/compute/docs/instances/placement-policies-overview
    EOF
    type = number
    default = 1
    }
    • compute-worker-2-set
    variable "compute-worker-2-set"{
    description = "Set of Compute Workers of type compute-worker-2 to deploy"
    type = set(string)
    default = []
    }

  • Modify the value of the variable workerTypes to include an entry for the new compute worker type computer-worker-2. As an example, the value would be something like the following:

variable "workerTypes" {
description = <<EOF
"Worker types available on this HPCBOX Cluster. Format is <worker-role>//<worker-sku>//<suffix to use for naming>"
EOF
type = string
default = "login-worker//n1-standard-4//ln,compute-worker//c3d-standard-30-lssd//cn,compute-worker-2//h3-standard-88//cn-2,gpu-worker//n1-standard-4//gn,cuda-worker//a2-highgpu-1g//ccn"
}
  • Edit the template file main.tf and create a new section for compute-worker-2 by duplicating the existing one for compute-worker. Make sure you name the resources correctly. As an example, a section is reproduced below for your convenience.
// Start of COMPUTE-WORKER-2 TF CODE

resource google_compute_resource_policy "compute-worker-2-placementPolicy"{
provider = google-beta
project = var.project_id
name = "${var.goog_cm_deployment_name}-compute-worker-2-placement"
region = var.region
group_placement_policy {
max_distance = var.compute-worker-2-placementMaxDistance
// Since we want low latency for the compute workers
collocation = "COLLOCATED"
}

}

resource "google_compute_instance" "compute_workers_2" {
provider = google-beta
for_each = var.compute-worker-2-set
name = each.value
zone = var.zone
machine_type = var.compute-worker-2-imageSku
tags = [
format("%s-hpcbox-worker-no-ip",var.goog_cm_deployment_name)
]
boot_disk {
device_name = "boot"
auto_delete = true
initialize_params {
image = data.google_compute_image.compute_node_image.self_link
}
}
dynamic "scratch_disk" {
for_each = range(var.compute-worker-2-numLocalSSDs)
content {
interface = "NVME"
}
}
resource_policies = google_compute_resource_policy.compute-worker-2-placementPolicy[*].self_link

scheduling {
automatic_restart = true
preemptible = false
on_host_maintenance = "TERMINATE"
}

advanced_machine_features {
threads_per_core = 1
}
network_interface {
nic_type = "GVNIC"
//subnetwork = "regions/${var.subnetRegion}/subnetworks/${var.subnetName}"
//subnetwork = data.google_compute_subnetwork.hpcbox-subnet.self_link
subnetwork = local.subnetSelfLink
}
metadata = {
startup-script = <<-EOT
#!/bin/bash
echo "USERNAME=${var.adminUserName};CLUSTER=${var.goog_cm_deployment_name};EXTERNAL_FS=${var.externalFS} ;USE_NIS=1;FLAVOR=${var.flavor}" > /root/drz/artifacts/hpcbox-config.sh && rm -f /etc/sudoers.d/google_sudoers && touch /root/drz/artifacts/hpcbox-config-start
EOT
displayName = "WorkerVirtualMachines"
hpcbox-cluster = var.goog_cm_deployment_name
hpcbox-cluster-tag = var.clusterTag
VM_ROLE = "compute-worker-2//${var.compute-worker-2-imageSku}//cn-2"
enable-oslogin = var.enableCloudLogin
enableCloudLogin = var.enableCloudLogin
#We will use NIS from the master or os-login in GCP
block-project-ssh-keys = "TRUE"
}

depends_on = [
google_compute_route.hpcbox-head-node-compute-route,
google_compute_resource_policy.compute-worker-2-placementPolicy
]
}

// END OF COMPUTE-WORKER-2 TF CODE

  • Once you've edited the templates, use gcloud to apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment
gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/hpcbox-001-infra-manager@hpcbox-003.iam.gserviceaccount.com --local-source="." --inputs-file=.\variables.tf --location us-central1

Updating disk sizes

The initial deployment of HPCBOX includes one data and one apps. To review the existing size of the data and apps disks, execute df -Th. The output will show you the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the data disk is of size 1000GB and the apps disk is of size 512GB.

df -Th

/dev/sdb1 xfs 1000G 7.1G 993G 1% /data
/dev/sdc1 xfs 512G 6.2G 506G 2% /opt/drz

info

Note that we can only increase size of the disks and cannot shrink them.

Increase size of the data and/or apps disk

The data disk hosts the HOME directories of the users on your HPCBOX cluster. The apps disk is where all the applications you use on the HPCBOX cluster are installed. Follow the instructions below to increase the size of the data and/or apps disk.

warning

Make sure there are no running jobs and all worker nodes are in powered-off state before proceeding. You may also choose to snapshot disks for safety reasons.

  • Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
  • The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
  • Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.
  • Change the value of the variables appsDiskSize and/or dataDiskSize to the new desired size.
  • Once you've edited the templates, use gcloud to apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment

gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/[email protected] --local-source="." --inputs-file=.\variables.tf --location us-central1

Extend the file system

Once the deployment is complete, you should have the data and/or apps disk extended to the new size, however, we still need to expand the file system on the devices.

  • Use the command df -Th to identify the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the data disk is on device /dev/sdb and the apps disk on the device /dev/sdc.
df -Th

/dev/sdb1 xfs 1000G 7.1G 993G 1% /data
/dev/sdc1 xfs 512G 6.2G 506G 2% /opt/drz

  • To expand the selected partition, use the parted command after executing sudo -i as the Administrator user on the head/management node. The example below shows how to expand the partition on the device /dev/sdc.
parted /dev/sdc

(parted) resizepart

Partition number? 1

Warning: Partition /dev/sdc1 is being used. Are you sure you want to continue?

Yes/No? Yes

End? [550GB]? 100%

(parted) quit
Information: You may need to update /etc/fstab.

partprobe /dev/sdc

  • To grow the filesystem on the expanded partition /data use the command xfs_growfs -d /data and xfs_growfs -d /opt/drz to grow /opt/drz.

Destroying the HPCBOX cluster

danger

Destroying the HPCBOX cluster will delete all your files on the cluster. Make sure you have transferred them before proceeding.

tip

The HPCBOX cluster is always deployed with Terraform settings:

 lifecycle {
prevent_destroy = true
}

for the apps, data and the head node.

To destroy the entire HPCBOX cluster, first delete all the workers, login and head/management node using Google Cloud console, delete the disks associated with the head/management node. Once this is complete, delete the entire Terraform Deployment associated with the HPCBOX cluster using Google Cloud's Infrastructure Manager.