Node properties and metaclusters

You can find property/feature names for specifying an individual subcluster on the subcluster pages under the Node Types and Subclusters page. The most up-to-date information on what properties a node has can be queried directly from TORQUE with its pbsnodes command.

Generally, nodes are classified with properties in at least the following ways:

  • subcluster: usually just the subcluster name in lowercase, e.g. bora for Bora, and vortexa  for Vortex-α;
  • operating system: el6 or el7 for Enterprise Linux (Red Hat or a derivative) 6.x or 7.x, respectively, or rhel6 / rhel7 for Red Hat Enterprise Linux specifically;
  • processor type: xeon for Intel Xeon processors or opteron for AMD Opteron processors, and more specifically broadwell, knl, or x5672 for Xeon processors, or abu_dhabi, seoul, shanghai, magny_cours, or santa_rosa for Opteron processors;
  • hardware multithreading: noht if Intel's hyper-threading is disabled or nonexistent, and/or nocmt if AMD's clustered multi-threading is disabled or nonexistent; and
  • switch/network connectivity: ib01, ib02, ib03, ib04, ib05, or ib06 depending on which InfiniBand switch the node is connected to, and/or opa if the node is connected to SciClone's Omni-Path network.

When subclusters share a common processor technology and communication fabric, the distinctions between them can be ignored for certain applications, and they can be treated as a larger, unified "metacluster." For example, the Hurricane and Whirlwind subclusters employ the exact same number and model of Xeon processors and share the same InfiniBand switch. Overlooking the differences in memory capacities, local scratch disks, and the presence of GPUs in Hurricane, these two subclusters could be treated as a single 64-node system, rather than distinct 12-node and 52-node systems.

For job scheduling purposes, choose a node property specification which is common to all of the nodes of interest and uniquely identifies that set of resources. For example, to combine nodes from the Hurricane and Whirlwind subclusters into a 60-node job, you could use something like

qsub -l nodes=60:x5672:ppn=8 ...

which specifies that you want 8 cores on each of 60 compute nodes equipped with Xeon X5672 processors. Even a job needing 52 or fewer nodes, which could be satisfied by just Whirlwind, will be easier to schedule and therefore will likely run sooner with a more inclusive node specification. Alternatively, if you had a multi-threaded single-node job but wanted to allow it to run on any free node with a Xeon "Broadwell" processor, you might say:

qsub -l nodes=1:broadwell:ppn=20 ...

Some applications, especially pre-compiled ones, will only work on a particular operating system release. To run only on RHEL/CentOS 7 nodes, for example, you could specify:

qsub -l nodes=1:el7:ppn=1 ...

Nodes that qualify for the above specification (e.g. both Hima, from 2017, and Rain, from 2007) vary quite significantly with respect to processor model, processor speed, number of CPU cores per node, etc., but it is still quite possible to use a generic resource request to treat these subclusters as one large pool, particularly for serial jobs.

Finally, if an application is completely agnostic with respect to processor type, number of cores, memory capacity, operating system, etc., you could use a very generic node specification to treat the entire SciClone complex as one big cluster, allowing the job to run anywhere:

qsub -l nodes=1:ppn=1 ...

This strategy works particularly well when you are submitting large numbers of serial jobs using a software package which is supported across all of SciClone's computing platforms (e.g., MATLAB, Sage, Octave, GRASS, NumPy/SciPy, etc.) By putting all of the available computing resources at your disposal, you can reduce total turnaround time and maximize throughput for the entire collection of jobs, especially when the system is busy.