Hurricane / Whirlwind

Hurricane and Whirlwind are the subclusters of SciClone with Intel Xeon "Westmere-EP" processors. Their front-end is hurricane.sciclone.wm.edu and they share the same startup modules file .cshrc.rhel6-xeon.

Hardware

Front-end

(hurricane / hu00)

GPU nodes

(hu01-hu12)

Non-GPU nodes

(wh01-wh44)

Large-memory nodes (wh45-wh52)

Model

HP ProLiant

DL180 G6

HP ProLiant

SL390s G7 2U

Dell PowerEdge

C6100

Processor(s)

2×4-core

Intel Xeon E5620

2×4-core

Intel Xeon X5672

Clock speed

2.4 GHz

3.2 GHz

L3 cache

12 MB

Memory 16 GB

48 GB

1333 MHz

64 GB

1333 MHz

192 GB

1066 MHz

GPUs - 2 × NVIDIA Tesla M2075, 1.15 GHz, 448 CUDA cores

 -

 -

Network

interfaces

Application

QDR IB (hu??-ib, wh??-ib)

System

10 GbE (hu00)

1 GbE (hu??, wh??)

OS RHEL 6.8 CentOS 6.8

The Hurricane and Whirlwind subclusters share a single QDR InfiniBand switch, and can also communicate with the Rain subcluster via a single DDR (20 Gb/s) switch-to-switch link, with Hima via a single QDR (40 Gb/s) switch-to-switch link, and with Bora, and Vortex via two QDR switch-to-switch links.

Main memory size works out to a generous 6-24 GB/core, meeting or exceeding that of SciClone's large-memory Rain nodes. Per-core InfiniBand bandwidth is 5 GB/core (less 20% protocol overhead), also matching that of the existing Rain compute nodes. However, when the higher speed of the Xeon processors is taken into account, the bandwidth/FLOP is lower; when GPU acceleration is factored in, the communication-to-computation ratio could drop by a couple of orders of magnitude. Communication performance is therefore an important concern when designing multi-node parallel algorithms for this architecture.

TORQUE node specifiers

All access to compute nodes (for either interactive or batch work) is via the TORQUE resource manager, as described elsewhere. TORQUE assigns jobs to a particular set of processors so that jobs do not interfere with each other.  The TORQUE properties for the hurricane and whirlwind nodes are:

hu01-hu08: c10, c10x, x5672, el6, compute, hurricane

hu09-hu12: c10a, c10x, x5672, el6, compute, hurricane

wh01-wh44: c11, c11x, x5672, el6, compute, whirlwind

wh45-wh52: c11a, c11x, x5672, el6, compute, whirlwind

 This set of properties allows you to select different subsets of hurricane and whirlwind nodes.

Considerations for CPU jobs

While Hurricane additionally has GPUs, Hurricane and Whirlwind have the same CPU configuration and InfiniBand switch and can be used effectively together as a "metacluster" by non-GPU parallel jobs using the TORQUE property named for their processor model, e.g.

qsub -l nodes=16:x5672:ppn=8 ...

To use only Whirlwind nodes, you would instead specify

qsub -l nodes=16:whirlwind:ppn=8 ...

If you have memory requirements exceeding the 8000 MB/core available on every Whirlwind node, ask for only the large-memory nodes, like so:

qsub -l nodes=4:whirlwind:ppn=8,pmem=24000mb ...

or, one could specify the large memory whirlwind nodes explicitly:

qsub -l nodes=4:c11a:ppn=8 ...

Considerations for GPU jobs

Until we are able to install a GPU-aware job scheduler, there is no simple way to allocate GPU devices among multiple jobs running on the same node. To prevent device conflicts (which would result in either runtime errors or degraded performance), we recommend that GPU applications request an entire node, and then use as many of the GPU devices and Xeon cores as possible to avoid wasting resources. For example,

qsub -n -l nodes=1:hurricane ...

In some cases it may be necessary to obtain an interactive shell on a GPU-enabled compute node in order to successfully compile GPU applications. This can be done with qsub -I. This limitation arises because the CUDA device driver and associated libraries cannot be installed on systems (such as the hurricane front-end node) which do not have resident GPU devices.

Compilers

Several compiler suites are available in SciClone's RHEL 6 / Xeon environment, including PGI 11.10, several versions of the GNU Compiler Collection (GCC), and Solaris Studio 12.3. NVIDIA's nvcc compiler for CUDA runs on top of GCC 4.4.7.

In some cases libraries and applications are supported only for a particular compiler suite; in other cases they may be supported across multiple compiler suites. For a complete list of available compilers, use the "module avail" command.

In most cases code generated by the commercial compiler suites (PGI and Sun Studio) will outperform that generated by the open-source GNU compilers, sometimes by a wide margin. There are exceptions, however, so we strongly encourage you to experiment with different compiler suites in order to determine which will yield the best performance for a given task. When a GNU compiler is required, we recommend GCC 4.7.0 since it has better support for the Nehalem architecture than earlier versions.

Note that well-crafted GPU programs written with CUDA, OpenCL, or compiler directives can vastly exceed the performance achievable with conventional code running on the Xeon processors, but the inverse is also true: problems that are ill-suited to the GPU architecture or which require a lot of data movement between main memory and GPU memory can run much more slowly than CPU-based code.

You can switch between alternative compilers by modifying the appropriate module load command in your .cshrc.rhel6-xeon file. The default configuration loads pgi/11.10. Because of conflicts with command names, environment variables, libraries, etc., attempts to load multiple compiler modules into your environment simultaneously may result in an error.

For details about compiler installation paths, environment variables, etc., use the "module show" command for the compiler of interest, e.g.,

module show pgi/11.10
module show gcc/4.7.0
module show solstudio/12.3
module show cuda/7.0

etc.

For proper operation and best performance, it's important to choose compiler options that match the target architecture and enable the most profitable code optimizations. The options listed below are suggested as starting points. Note that for some codes, these optimizations may be too aggressive and may need to be scaled back. Consult the appropriate compiler manuals for full details.

Intel C icc -O3 -xSSE4.2 -align -finline-functions
C++ icpc -std=c11 -O3 -xSSE4.2 -align -finline-functions
Fortran ifort -O3 -xSSE4.2 -align array64byte -finline-functions
GCC C gcc -march=westmere -O3  -finline-functions
C++ g++ -std=c11 -march=westmere -O3 -finline-functions
Fortran gfortran -march=westmere -O3  -finline-functions

GCC 4.4.6

-O3 -march=core2 -m64

GCC 4.7.0

-O3 -march=corei7 -m64
PGI -fast -tp nehalem [-ta=nvidia,cc13,cuda4.0] -m64 -Mipa=fast [-Minfo=all]
Sun
-fast -xchip=westmere -xarch=sse4_2 -xcache=32/64/8:256/64/8:12288/64/16 -m64