Running jobs (TORQUE/Maui)

To ensure that users' calculations do not interfere with each other and that computational resources are allocated fairly and efficiently, W&M HPC systems employ the TORQUE resource manager in conjunction with the Maui cluster scheduler. With few exceptions, any computation on W&M HPC systems must be submitted and run via TORQUE/Maui -- collectively the "job scheduler" or "batch system" -- as a job. To schedule and assign resources to your job, the system needs to know what resources your job requires, so you must answer the following questions and provide those answers to TORQUE's qsub command.

What type of computer?

W&M's HPC systems are composed of several different types of computers. If your job can run on any computer ("node") in the cluster, that's excellent (your job is easier to fit in and will start sooner)! However, if it needs a specific type of computer, you must select which ones are acceptable using node properties/features.

How many of them?

Merely allocating extra nodes will not increase performance, but if you know that your application can use multiple nodes simultaneously (distributed memory parallelism, e.g. with MPI), you can request a particular number of nodes be allocated to your job.

How many processors per node?

Every node in the cluster has more than one processor, and again, merely allocating extra processors will not increase performance, but if you know that your application can use multiple processors simultaneously (via shared memory parallelism, e.g. with OpenMP or threads), you can request a particular number of processors per node.

How long?

You must give the job scheduler an upper bound on how long your job will run, called walltime ("wall" as in real time you'd see on a clock on the wall, to distinguish it from "CPU" time spent actively using a processor).

The maximum is usually either 180 hours (for older nodes) or 72 hours (for newer nodes). Computations which cannot complete within this time limit should be broken up into multiple jobs by incorporating a checkpoint/restart capability (which is advisable regardless, in case of equipment failures).

Because actual runtimes are not known until a job completes, the job scheduler can only schedule jobs based on their declared walltime limits. Turnaround time for individual jobs, as well as overall system utilization, are improved when walltime limits are reasonably accurate. Excessive time limits will lower your job's priority and reduce opportunities to fit your job into holes in the schedule, meaning that you may have to wait longer for your results when the system is busy. On the other hand, be sure that walltime limits provide enough cushion so your jobs will not be terminated prematurely if they take a little longer than expected.