Updating Deployments

Once HPCBOX has been deployed using Terraform templates using Google Cloud's Infrastructure Manager, any of the following changes to the deployment will have to be performed by carefully editing the existing deployment files and redeploying, so as not to accidentally delete any resources that have been previously created by the deployment.

tip

If you have HPCBOX Premium Support, our support team will be happy to assist with these changes.

Adding new compute worker types

The initial deployment of HPCBOX includes one compute worker type called compute-worker. Generally, all compute workers of a type should use the same Image SKU for best performance. If you would like to include an additional compute worker with a different image SKU, this can be done as follows:

Assume we want to create a new compute worker of type compute-worker-2 and we want to use a SKU of type h3-standard-88 for this worker class.
Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.

info

Variables associated with the compute worker type compute-worker are prefixed with compute-worker and we are adding new ones for compute worker type compute-worker-2 and prefixing it with compute-worker-2.

Add following variables to variables.tf for the new compute worker class compute-worker-2.

compute-worker-2-imageSku

variable "compute-worker-2-imageSku" {
   description = "The instance type SKU for the compute worker-2"
   type = string
   default = "h3-standard-88"       
}

compute-worker-2-numLocalSSDs - Set the default value to 0 if the SKU does not support local disks.

variable "compute-worker-2-numLocalSSDs"{
    description = <<EOF
    "Number of local ssd disks that should be raided 0 for local scratch. Note that not all computeWorkerImageSku support local ssd.
    Please refer the Google cloud docs for supported VMs at https://cloud.google.com/compute/docs/disks/local-ssd#machine-types  to
    find the machines which accept local ssds and the minimum number for each."
    EOF
    default = 0
}   

compute-worker-2-placementMaxDistance

variable "compute-worker-2-placementMaxDistance"{
    description = <<EOF
    Max distance to be considered when creating compute workers for low latency between them. Please refer 
    https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy#nested_group_placement_policy
    and https://cloud.google.com/compute/docs/instances/placement-policies-overview
    EOF
    type = number
    default = 1
}

compute-worker-2-set

variable "compute-worker-2-set"{
    description = "Set of Compute Workers of type compute-worker-2 to deploy"
    type = set(string)
    default = []
}

Modify the value of the variable workerTypes to include an entry for the new compute worker type computer-worker-2. As an example, the value would be something like the following:

variable "workerTypes" {
    description = <<EOF
    "Worker types available on this HPCBOX Cluster. Format is <worker-role>//<worker-sku>//<suffix to use for naming>"
    EOF
    type = string
    default = "login-worker//n1-standard-4//ln,compute-worker//c3d-standard-30-lssd//cn,compute-worker-2//h3-standard-88//cn-2,gpu-worker//n1-standard-4//gn,cuda-worker//a2-highgpu-1g//ccn"
}

Edit the template file main.tf and create a new section for compute-worker-2 by duplicating the existing one for compute-worker. Make sure you name the resources correctly. As an example, a section is reproduced below for your convenience.

// Start of COMPUTE-WORKER-2 TF CODE

resource google_compute_resource_policy "compute-worker-2-placementPolicy"{
  provider = google-beta
  project  = var.project_id  
  name = "${var.goog_cm_deployment_name}-compute-worker-2-placement"
  region = var.region
  group_placement_policy {
    max_distance = var.compute-worker-2-placementMaxDistance
    // Since we want low latency for the compute workers
    collocation = "COLLOCATED" 
  }

}

resource "google_compute_instance" "compute_workers_2" {
    provider = google-beta    
    for_each = var.compute-worker-2-set    
    name = each.value
    zone = var.zone
    machine_type = var.compute-worker-2-imageSku
    tags = [
      format("%s-hpcbox-worker-no-ip",var.goog_cm_deployment_name)
   ]
  boot_disk {
    device_name = "boot"
    auto_delete = true
    initialize_params {
      image = data.google_compute_image.compute_node_image.self_link
    }
  }
   dynamic "scratch_disk" {
    for_each = range(var.compute-worker-2-numLocalSSDs)
    content {
      interface = "NVME"
    }
  }
  resource_policies = google_compute_resource_policy.compute-worker-2-placementPolicy[*].self_link
  
  scheduling {
    automatic_restart = true
    preemptible = false
    on_host_maintenance = "TERMINATE"    
  }

  advanced_machine_features {
    threads_per_core = 1    
  }
  network_interface {
    nic_type = "GVNIC"
    //subnetwork = "regions/${var.subnetRegion}/subnetworks/${var.subnetName}"
    //subnetwork = data.google_compute_subnetwork.hpcbox-subnet.self_link
    subnetwork = local.subnetSelfLink
  }
  metadata = {
    startup-script = <<-EOT
#!/bin/bash
echo  "USERNAME=${var.adminUserName};CLUSTER=${var.goog_cm_deployment_name};EXTERNAL_FS=${var.externalFS} ;USE_NIS=1;FLAVOR=${var.flavor}" > /root/drz/artifacts/hpcbox-config.sh && rm -f /etc/sudoers.d/google_sudoers && touch /root/drz/artifacts/hpcbox-config-start 
EOT
    displayName = "WorkerVirtualMachines"
    hpcbox-cluster = var.goog_cm_deployment_name
    hpcbox-cluster-tag = var.clusterTag
    VM_ROLE = "compute-worker-2//${var.compute-worker-2-imageSku}//cn-2"
    enable-oslogin = var.enableCloudLogin
    enableCloudLogin = var.enableCloudLogin
    #We will use NIS from the master or os-login in GCP
    block-project-ssh-keys = "TRUE"
  }

  depends_on = [
    google_compute_route.hpcbox-head-node-compute-route,
    google_compute_resource_policy.compute-worker-2-placementPolicy
  ]
}

// END OF COMPUTE-WORKER-2 TF CODE

Once you've edited the templates, use gcloud to apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment

gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/hpcbox-001-infra-manager@hpcbox-003.iam.gserviceaccount.com --local-source="." --inputs-file=.\variables.tf --location us-central1

Updating disk sizes

The initial deployment of HPCBOX includes one data and one apps. To review the existing size of the data and apps disks, execute df -Th. The output will show you the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the data disk is of size 1000GB and the apps disk is of size 512GB.

df -Th

/dev/sdb1      xfs      1000G  7.1G  993G   1% /data
/dev/sdc1      xfs       512G  6.2G  506G   2% /opt/drz

info

Note that we can only increase size of the disks and cannot shrink them.

Increase size of the data and/or apps disk

The data disk hosts the HOME directories of the users on your HPCBOX cluster. The apps disk is where all the applications you use on the HPCBOX cluster are installed. Follow the instructions below to increase the size of the data and/or apps disk.

warning

Make sure there are no running jobs and all worker nodes are in powered-off state before proceeding. You may also choose to snapshot disks for safety reasons.

Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.
Change the value of the variables appsDiskSize and/or dataDiskSize to the new desired size.
Once you've edited the templates, use gcloud to apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment

gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/[email protected] --local-source="." --inputs-file=.\variables.tf --location us-central1

Extend the file system

Once the deployment is complete, you should have the data and/or apps disk extended to the new size, however, we still need to expand the file system on the devices.

Use the command df -Th to identify the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the data disk is on device /dev/sdb and the apps disk on the device /dev/sdc.

df -Th

/dev/sdb1      xfs      1000G  7.1G  993G   1% /data
/dev/sdc1      xfs       512G  6.2G  506G   2% /opt/drz

To expand the selected partition, use the parted command after executing sudo -i as the Administrator user on the head/management node. The example below shows how to expand the partition on the device /dev/sdc.

parted /dev/sdc

(parted) resizepart

Partition number? 1

Warning: Partition /dev/sdc1 is being used. Are you sure you want to continue?

Yes/No? Yes

End?  [550GB]? 100%

(parted) quit
Information: You may need to update /etc/fstab.

partprobe /dev/sdc

To grow the filesystem on the expanded partition /data use the command xfs_growfs -d /data and xfs_growfs -d /opt/drz to grow /opt/drz.

Destroying the HPCBOX cluster

danger

Destroying the HPCBOX cluster will delete all your files on the cluster. Make sure you have transferred them before proceeding.

tip

The HPCBOX cluster is always deployed with Terraform settings:

 lifecycle {
         prevent_destroy = true
  }

for the apps, data and the head node.

To destroy the entire HPCBOX cluster, first delete all the workers, login and head/management node using Google Cloud console, delete the disks associated with the head/management node. Once this is complete, delete the entire Terraform Deployment associated with the HPCBOX cluster using Google Cloud's Infrastructure Manager.

Adding new compute worker types​

Updating disk sizes​

Increase size of the data and/or apps disk​

Extend the file system​

Destroying the HPCBOX cluster​

Adding new compute worker types

Updating disk sizes

Increase size of the data and/or apps disk

Extend the file system

Destroying the HPCBOX cluster