Managing disk space

A primary goal of the HPC clusters is to enable researchers to tackle problems that are too big to address using personal computers or departmental servers. Many of these problems involve large datasets, either as input, output, or both. In addition, storage requirements can vary by several orders of magnitude from one research project to another. We believe that establishing disk quotas in this type of environment is both problematic and counter-productive; consequently, the only firm limit on disk space utilization within the clusters is the capacity of the filesystems.

This open-ended policy does not, however, mean that individual users can disregard their storage requirements. In fact, quite the opposite is true. If a user fills up a filesystem, it not only affects their work, but also that of everyone else who shares that filesystem. Jobs which are using that filesystem will likely fail, resulting in potentially thousands of CPU-hours of wasted computation. It is therefore every user's responsibility to ensure that filesystems have more than enough space to accommodate whatever data they intend to place there. If you are unsure how much output a computation will generate, run a small test case first and use that to extrapolate to a full run. Similar considerations apply to large file transfers, whether within the clusters or from external hosts.

Use the df command to determine how much space is available on a given filesystem. For example, to check the status of your home directory filesystem, you could say

   df -h $HOME

which might produce output similar to the following:

  Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1 11T 2.2T 8.8T 20% /sciclone/home20

Of most importance is the "Avail" figure, which indicates how much free space is left on the filesystem, in units of kilobytes (K), megabytes (M), gigabytes (G), or terabytes (T). As another example, to query the available space on SciClone's global scratch filesystems,

   df -h /sciclone/*scr*

would produce output of the form

   Filesystem                 Size  Used Avail Use% Mounted on
192.168.56.208@o2ib:/pscr 147T 3.4T 136T 3% /sciclone/pscr
mst00:/sciclone/scr-mlt 73T 6.0T 67T 9% /sciclone/scr-mlt
bz00-i8:/sciclone/scr10 101T 49T 53T 49% /sciclone/scr10
tw00-i8:/sciclone/scr20 73T 14T 60T 19% /sciclone/scr20
tn00:/sciclone/scr30 17T 15T 2.4T 86% /sciclone/scr30

This shows that available space on the scratch filesystems at that particular moment varies between 2.4 TB and 136 TB. Pick one that matches your needs. If a job needs only a few hundred gigabytes of storage, it could easily be directed (in this example) to /sciclone/scr30. On the other hand, if it will be generating a couple of terabytes of output, the other four would be the only feasible choices. Bear in mind that multiple users and/or multiple jobs may be writing to a filesystem concurrently, and that the amount of space available when a job is submitted may be reduced by the time it begins execution or is ready to write out results.

Besides matching output to available space, users have an individual responsibility to use disk space efficiently. Although several of the filesystems are quite large, their capacities are not unlimited. To help ensure that space is available when it is needed, there are several steps you can take:

  • Delete files which are no longer needed or which have been replicated on another system.
  • Don't use long-term storage (home or data directories) when short-term scratch storage will suffice.
  • If the output doesn't have to be human-readable, use binary files instead of ASCII text; the former is usually more compact and also more efficient to read and write.

You can monitor your own disk usage with the du command. For example, to see how much space you are using in each of SciClone's global filesystems:

   du -hsc /sciclone/*/$USER

If everyone remains vigilant in the management of their personal files, the system as a whole will operate more smoothly and the clusters can fulfill their mission of providing exceptional computing capabilities for demanding problems.