Updating Deployments
Once HPCBOX has been deployed using Terraform templates using Google Cloud's Infrastructure Manager, any of the following changes to the deployment will have to be performed by carefully editing the existing deployment files and redeploying, so as not to accidentally delete any resources that have been previously created by the deployment.
If you have HPCBOX Premium Support, our support team will be happy to assist with these changes.
Adding new compute worker types
The initial deployment of HPCBOX includes one compute worker type called compute-worker. Generally, all compute workers of a type should use the same Image SKU for best performance. If you would like to include an additional compute worker with a different image SKU, this can be done as follows:
- Assume we want to create a new compute worker of type compute-worker-2 and we want to use a SKU of type h3-standard-88 for this worker class.
- Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
- The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
- Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.
Variables associated with the compute worker type compute-worker are prefixed with compute-worker and we are adding new ones for compute worker type compute-worker-2 and prefixing it with compute-worker-2.
-
Add following variables to variables.tf for the new compute worker class compute-worker-2.
- compute-worker-2-imageSku
variable "compute-worker-2-imageSku" {
description = "The instance type SKU for the compute worker-2"
type = string
default = "h3-standard-88"
}- compute-worker-2-numLocalSSDs - Set the default value to 0 if the SKU does not support local disks.
variable "compute-worker-2-numLocalSSDs"{
description = <<EOF
"Number of local ssd disks that should be raided 0 for local scratch. Note that not all computeWorkerImageSku support local ssd.
Please refer the Google cloud docs for supported VMs at https://cloud.google.com/compute/docs/disks/local-ssd#machine-types to
find the machines which accept local ssds and the minimum number for each."
EOF
default = 0
}- compute-worker-2-placementMaxDistance
variable "compute-worker-2-placementMaxDistance"{
description = <<EOF
Max distance to be considered when creating compute workers for low latency between them. Please refer
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_resource_policy#nested_group_placement_policy
and https://cloud.google.com/compute/docs/instances/placement-policies-overview
EOF
type = number
default = 1
}- compute-worker-2-set
variable "compute-worker-2-set"{
description = "Set of Compute Workers of type compute-worker-2 to deploy"
type = set(string)
default = []
} -
Modify the value of the variable workerTypes to include an entry for the new compute worker type computer-worker-2. As an example, the value would be something like the following:
variable "workerTypes" {
description = <<EOF
"Worker types available on this HPCBOX Cluster. Format is <worker-role>//<worker-sku>//<suffix to use for naming>"
EOF
type = string
default = "login-worker//n1-standard-4//ln,compute-worker//c3d-standard-30-lssd//cn,compute-worker-2//h3-standard-88//cn-2,gpu-worker//n1-standard-4//gn,cuda-worker//a2-highgpu-1g//ccn"
}
- Edit the template file main.tf and create a new section for compute-worker-2 by duplicating the existing one for compute-worker. Make sure you name the resources correctly. As an example, a section is reproduced below for your convenience.
// Start of COMPUTE-WORKER-2 TF CODE
resource google_compute_resource_policy "compute-worker-2-placementPolicy"{
provider = google-beta
project = var.project_id
name = "${var.goog_cm_deployment_name}-compute-worker-2-placement"
region = var.region
group_placement_policy {
max_distance = var.compute-worker-2-placementMaxDistance
// Since we want low latency for the compute workers
collocation = "COLLOCATED"
}
}
resource "google_compute_instance" "compute_workers_2" {
provider = google-beta
for_each = var.compute-worker-2-set
name = each.value
zone = var.zone
machine_type = var.compute-worker-2-imageSku
tags = [
format("%s-hpcbox-worker-no-ip",var.goog_cm_deployment_name)
]
boot_disk {
device_name = "boot"
auto_delete = true
initialize_params {
image = data.google_compute_image.compute_node_image.self_link
}
}
dynamic "scratch_disk" {
for_each = range(var.compute-worker-2-numLocalSSDs)
content {
interface = "NVME"
}
}
resource_policies = google_compute_resource_policy.compute-worker-2-placementPolicy[*].self_link
scheduling {
automatic_restart = true
preemptible = false
on_host_maintenance = "TERMINATE"
}
advanced_machine_features {
threads_per_core = 1
}
network_interface {
nic_type = "GVNIC"
//subnetwork = "regions/${var.subnetRegion}/subnetworks/${var.subnetName}"
//subnetwork = data.google_compute_subnetwork.hpcbox-subnet.self_link
subnetwork = local.subnetSelfLink
}
metadata = {
startup-script = <<-EOT
#!/bin/bash
echo "USERNAME=${var.adminUserName};CLUSTER=${var.goog_cm_deployment_name};EXTERNAL_FS=${var.externalFS} ;USE_NIS=1;FLAVOR=${var.flavor}" > /root/drz/artifacts/hpcbox-config.sh && rm -f /etc/sudoers.d/google_sudoers && touch /root/drz/artifacts/hpcbox-config-start
EOT
displayName = "WorkerVirtualMachines"
hpcbox-cluster = var.goog_cm_deployment_name
hpcbox-cluster-tag = var.clusterTag
VM_ROLE = "compute-worker-2//${var.compute-worker-2-imageSku}//cn-2"
enable-oslogin = var.enableCloudLogin
enableCloudLogin = var.enableCloudLogin
#We will use NIS from the master or os-login in GCP
block-project-ssh-keys = "TRUE"
}
depends_on = [
google_compute_route.hpcbox-head-node-compute-route,
google_compute_resource_policy.compute-worker-2-placementPolicy
]
}
// END OF COMPUTE-WORKER-2 TF CODE
- Once you've edited the templates, use
gcloudto apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment
gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/hpcbox-001-infra-manager@hpcbox-003.iam.gserviceaccount.com --local-source="." --inputs-file=.\variables.tf --location us-central1
Updating disk sizes
The initial deployment of HPCBOX includes one data and one apps. To review the existing size of the data and apps disks, execute df -Th.
The output will show you the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the
data disk is of size 1000GB and the apps disk is of size 512GB.
df -Th
/dev/sdb1 xfs 1000G 7.1G 993G 1% /data
/dev/sdc1 xfs 512G 6.2G 506G 2% /opt/drz
Note that we can only increase size of the disks and cannot shrink them.
Increase size of the data and/or apps disk
The data disk hosts the HOME directories of the users on your HPCBOX cluster. The apps disk is where all the applications you use on the HPCBOX cluster are installed. Follow the instructions below to increase the size of the data and/or apps disk.
Make sure there are no running jobs and all worker nodes are in powered-off state before proceeding. You may also choose to snapshot disks for safety reasons.
- Locate your existing HPCBOX cluster deployment within Google Cloud Infrastructure Manager
- The Overview section of the deployment will have a link to a Google Cloud Storage Bucket which has the current state of the Terraform deployment.
- Download all the three files main.tf, variables.tf and setup_gcp.sh from the bucket.
- Change the value of the variables appsDiskSize and/or dataDiskSize to the new desired size.
- Once you've edited the templates, use
gcloudto apply and update the existing HPCBOX cluster deployment. As an example, the following execution updates the deployment we created during initial Deployment
gcloud config set project hpcbox-003
gcloud auth login
gcloud infra-manager deployments apply projects/hpcbox-003/locations/us-central1/deployments/hpcbox-cls-007 --service-account projects/hpcbox-003/serviceAccounts/[email protected] --local-source="." --inputs-file=.\variables.tf --location us-central1
Extend the file system
Once the deployment is complete, you should have the data and/or apps disk extended to the new size, however, we still need to expand the file system on the devices.
- Use the command
df -Thto identify the devices associated with the /data and the /opt/drz mount points. As an example, the output below shows the data disk is on device /dev/sdb and the apps disk on the device /dev/sdc.
df -Th
/dev/sdb1 xfs 1000G 7.1G 993G 1% /data
/dev/sdc1 xfs 512G 6.2G 506G 2% /opt/drz
- To expand the selected partition, use the parted command after executing
sudo -ias the Administrator user on the head/management node. The example below shows how to expand the partition on the device /dev/sdc.
parted /dev/sdc
(parted) resizepart
Partition number? 1
Warning: Partition /dev/sdc1 is being used. Are you sure you want to continue?
Yes/No? Yes
End? [550GB]? 100%
(parted) quit
Information: You may need to update /etc/fstab.
partprobe /dev/sdc
- To grow the filesystem on the expanded partition /data use the command
xfs_growfs -d /dataandxfs_growfs -d /opt/drzto grow /opt/drz.
Destroying the HPCBOX cluster
Destroying the HPCBOX cluster will delete all your files on the cluster. Make sure you have transferred them before proceeding.
The HPCBOX cluster is always deployed with Terraform settings:
lifecycle {
prevent_destroy = true
}
for the apps, data and the head node.
To destroy the entire HPCBOX cluster, first delete all the workers, login and head/management node using Google Cloud console, delete the disks associated with the head/management node. Once this is complete, delete the entire Terraform Deployment associated with the HPCBOX cluster using Google Cloud's Infrastructure Manager.