Skip to main content

Monitoring

HPCBOX provides both Job level and cluster level monitoring information.

Job Monitoring

Every workflow pipeline on HPCBOX is a collection of multiple steps in the workflow and each step in the workflow is submitted as a job to the underlying scheduling system.

Every job has an ID, Name, State, Owner (the cluster user) and number of slots/cores its using through the underlying scheduling system.

Job Information

The job can have multiple states, more information can be found using man qstat. However, the most common states are

  • qw: Queued and awaiting execution.
  • r: Running
  • d: Awaiting deletion.

Job Actions

There are three actions which the user can perform on a job. These are invoked using one of the three icons in the Actions column. Multiple jobs can be selected when performing one of the actions below.

Show Logs

The current job log, meaning the stdout and stderr for a job can be viewed using this option. This will open an xterm window and display the log file.

Follow Logs

The follow log option can be used to continuously follow the job log, this is useful for jobs which output a lot directly to the console.

Kill

Jobs can be killed using the option.

tip

Some jobs can take a few minutes to cleanup before exiting the queue and will remain in dr state while cleanup is ongoing. This is true specifically for tightly integrated distributed MPI jobs.

Hard Kill

A hard kill forces a job deletion. This option can be used if a job is stuck for a long time in dr state without leaving the queue.

tip

When following multiple job logs simultaneously, one can use the window tiler tool to tile windows in a grid. This tool appears at the right side of your desktop panel.

X-Tile

Cluster Monitoring

The monitoring section displays aggregated charts showing worker utilization including charts for load, GPU, memory use and events generated by the AutoScaler. It could take a few minutes for the graphs to appear.

Cluster Monitoring

Comprehensive Monitoring

To view more detailed monitoring, please open a browser window within your HPCBOX desktop and visit http://localhost/ganglia

Cluster Monitoring Ganglia

Autoscaler Monitoring

All activities performed by the HPCBOX AutoScaler are logged in the Events section of the Monitoring panel.

Cluster Monitoring AutoScaler