Bora / Hima

Bora and Hima are subclusters of SciClone with Intel Xeon "Broadwell" processors, the former intended for multi-node parallel jobs, and the latter intended for serial and shared-memory jobs. Their front-end is bora.sciclone.wm.edu and they share the same startup modules file .cshrc.el7-xeon. They are also the first subclusters to utilize the new parallel file-system mounted at /sciclone/pscr/$USER.

Hardware

Front-end

(bora / bo00)

Parallel nodes

(bo01-bo55)

Serial/shared-memory

nodes (hi01-hi07)

Model Dell PowerEdge R630 Dell PowerEdge R730
Processor(s)

2×10-core

Intel Xeon E5-2640 v4

2×16-core

Intel Xeon E5-2683 v4

Clock speed

2.4 GHz

2.1 GHz

Memory 64 GB 128 GB 256 GB

Network

interfaces

Application

FDR IB (bo00-ib)

FDR IB (bo??-ib)

QDR IB (hi??-ib)

System

10 GbE (bo00)

1 GbE (bo??)

1 GbE (hi??)

OS CentOS 7.3

Torque Node Specifiers:

All access to compute nodes (for either interactive or batch work) is via the TORQUE resource manager, as described elsewhere. TORQUE assigns jobs to a particular set of processors so that jobs do not interfere with each other.

All Bora and Hima nodes have Intel's Hyper-Threading enabled; however, since Bora is intended for MPI parallel jobs, the TORQUE parameter np is set to 20, the total number of physical processors, and the nodes are configured to only run one job at a time. Therefore, users should request all 20 cores per node. Hyperthreading can still be accessed via OpenMP jobs by requesting the node as exclusive (#PBS -W x=\"NACCESSPOLICY:SINGLEJOB\") and setting the number of threads to 2 per MPI process (20 MPI processes each using 2 threads = 40 threads per node) see running parallel jobs with mvp2run for more information.  Beware that hyperthreading often slows down individual jobs, please test your code well before doing production runs.

For Hima, the TORQUE parameter np is set to 64, one per each logical processor.  This is to allow users to run up to 64 processors worth of serial or shared memory jobs on one node in order to maximize throughput (not necessarily individual job speed). If you do not wish to use hyperthreading on Hima nor share the node with other users, you should request ppn=32 and request the node as exclusive (#PBS -W x=\"NACCESSPOLICY:SINGLEJOB\").

Again, all Bora nodes, which are intended for multi-node parallel jobs, are configured to run at most one job, and users should take all 20 physical cores per node for all jobs. The nodes have the following TORQUE properties:

  bora, broadwell, c21, el7, compute

Since only one job can occupy a single Bora node at one time, the following node specs are sufficient for a pure MPI job:

  #PBS -n -l nodes=1:bora:ppn=20 (20 cores on a single node)
  #PBS -n -l nodes=4:bora:ppn=20 (80 cores across four nodes)

This also works for correct processor placement since cores # 0-19 are the physical cores.

Since jobs that use less than all 20 cores on Bora would still occupy the whole node and make the other cores and threads inaccessible to other users, such jobs should instead use Hima nodes, which allow multiple simultaneous jobs.

Hima nodes have the TORQUE properties hima, broadwell, and c22. Specify, for example, in your job script:

  #PBS -l nodes=1:hima:ppn=1

Specifying Hima nodes without GPUs

Of the 7 Hima nodes that are currently available, 2 have GPUs installed.  In order to keep these 2 nodes available for GPU use, the 5 Hima nodes without GPUs have an additional TORQUE property nogpu. For example:

  #PBS -l nodes=1:hima:nogpu:ppn=1

This node spec will only run on the Hima nodes without GPUs. Specifying the nogpu keyword is much more fair to the GPU users.  We reserve the right to add this keyword to any Hima job which does not require a GPU.

Torque time limit

The maximum walltime for jobs on Bora and Hima is 72 hours.  Please be careful about this limit since currently jobs that request more than 72 hours on Bora or Hima nodes will simply remain queued with no other information provided to the user. We will try to modify this behavior in the future.


User Environment

To login, use SSH from any host on the William & Mary or VIMS networks and connect to bora.sciclone.wm.edu with your HPC username (usually the same as your WMuserid) and W&M password.

Your home directory on Bora and Hima is the same as everywhere else on SciClone, and all of the usual filesystems (/sciclone/homeXX, /sciclone/dataXX, /sciclone/scrXX, /local/scr, etc.) are available throughout the Bora and Hima subclusters. Additionally, the parallel filesystem /sciclone/pscr is available.

SciClone uses Environment Modules (a.k.a Modules) to automatically configure the user's shell environment across multiple computing platforms, as well as to organize the dozens of different software packages which are available on the system. We support tcsh as the primary shell environment for user accounts and applications.  

The file which controls startup modules for Bora and Hima is .cshrc.el7-xeon. The most recent version of this file can be found in /usr/local/etc/templates on any of the front-end servers (including bora.sciclone.wm.edu).


Preferred filesystems

The preferred file system for all work on Bora is the parallel scratch file system available at /sciclone/pscr/$USER on the front-end and compute nodes. /sciclone/scr10/$USER is a good alternative (NFS, but connected to the same InfiniBand switch).

On Hima, the preferred filesystem is /local/scr, which on Hima nodes is much larger and faster than the /local/scr on most other nodes. We intend to rectify this in the future, but Hima presently shares its link to the FDR-hosted global filesystems (/sciclone/pscr, scr10, data10, aiddata10, and baby10) with Whirlwind and Hurricane, so you may get more consistent performance from /sciclone/scr20.


Compiler flags

Bora and Hima have the Intel Parallel Studio XE 2017 compiler suite as well as version 4.9.4 of the GNU compiler suite. Here are suggested compiler flags which should result in fairly optimized code on their Broadwell architecture (taken from http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Haswell.pdf):

Intel C icc -O3 -xCORE-AVX2 -fma -align -finline-functions
C++ icpc -std=c11 -O3 -xCORE-AVX2 -fma -align -finline-functions
Fortran ifort -O3 -xCORE-AVX2 -fma -align array64byte -finline-functions
GNU C gcc -march=broadwell -O3 -mfma -malign-data=cacheline -finline-functions
C++ g++ -std=c11 -march=broadwell -O3 -mfma -malign-data=cacheline -finline-functions
Fortran gfortran -march=broadwell -O3 -mfma -malign-data=cacheline -finline-functions

MPI

Currently there are three versions of MPI available on the Bora subcluster: openmpi (v2.1.1), intel-mpi (v 2016 and 2017) and mvapich2-ib (v2.2).   Both of these should be used through the mvp2run wrapper script.  OpenMPI can launch both pure MPI and hybrid MPI/OpenMP jobs, while MVAPICH2 can only launch pure MPI jobs. For pure MPI jobs, the syntax for both versions of MPI are the same:


#!/bin/tcsh 
#PBS -N MPI 
#PBS -l nodes=5:bora:ppn=20 
#PBS -l walltime=12:00:00 
#PBS -j oe 

cd $PBS_O_WORKDIR 

mvp2run ./a.out >& LOG

Hima GPUs

Hima nodes hi04-hi07 are now each equipped with one Tesla style GPU.  Nodes hi04 and hi05 have an Nvidia Telsa P100 while hi06 and hi07 have a V100.  All GPUs have 16GB of memory.  All nodes have Cuda v9.1 installed.  

Hima nodes with a GPU can be specified in the Torque batch system by using the following node specs:

-l nodes=1:hima:v100:ppn=64   # to select a hima node with a v100 GPU 

-l nodes=1:hima:p100:ppn=64   # to select a hima node with a p100 GPU 

-l nodes=1:hima:gpu:ppn=64   # to select a hima node with either a p100 or v100 GPU 

Since only one user at a time can access the GPU, we suggest that users take the whole hima node (i.e. ppn=64) if they plan to use it.  

Please send email to hpc-help@wm.edu if you have questions about setting up jobs or installing software to take advantage of the Hima GPUs.