Managing disk space

A primary goal of the HPC clusters is to enable researchers to tackle problems that are too big to address using personal computers or departmental servers. Many of these problems involve large datasets, either as input, output, or both. In addition, storage requirements can vary by several orders of magnitude from one research project to another. We believe that establishing disk quotas in this type of environment is both problematic and counter-productive; consequently, the only firm limit on disk space utilization within the clusters is the capacity of the filesystems.

This open-ended policy does not, however, mean that individual users can disregard their storage requirements. In fact, quite the opposite is true. If a user fills up a filesystem, it not only affects their work, but also that of everyone else who shares that filesystem. Jobs which are using that filesystem will likely fail, resulting in potentially thousands of CPU-hours of wasted computation. It is therefore every user's responsibility to ensure that filesystems have more than enough space to accommodate whatever data they intend to place there. If you are unsure how much output a computation will generate, run a small test case first and use that to extrapolate to a full run. Similar considerations apply to large file transfers, whether within the clusters or from external hosts.

Use the df command to determine how much space is available on a given filesystem. For example, to check the status of your home directory filesystem, you could say

   df -h $HOME

which might produce output similar to the following:

  Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1 5.5T 3.9T 1.6T 72% /sciclone/home2

Of most importance is the "Avail" figure, which indicates how much free space is left on the filesystem, in units of kilobytes (K), megabytes (M), gigabytes (G), or terabytes (T). Also of interest is the "Use%" figure; if that is in the 90% range or above, you should probably direct your output someplace else. Bear in mind that multiple users and/or multiple jobs may be writing to a filesystem concurrently, and that the amount of space available when a job is submitted may be reduced by the time it begins execution or is ready to write out results.

As another example, to query the available space on SciClone's global scratch filesystems,

   df -h /sciclone/*scr*

would produce output of the form

   Filesystem                 Size  Used Avail Use% Mounted on
192.168.56.208@o2ib:/pscr 147T 5.2T 134T 4% /sciclone/pscr
mst00:/sciclone/scr-mlt 73T 45T 29T 62% /sciclone/scr-mlt
bz00-i8:/sciclone/scr10 101T 64T 38T 63% /sciclone/scr10
tn00:/sciclone/scr30 17T 14T 3.0T 83% /sciclone/scr30

This shows that available space on the scratch filesystems at that particular moment varies between 3.0 TB and 134 TB. Pick one that matches your needs. If a job needs only a few hundred gigabytes of storage, it could easily be directed (in this example) to /sciclone/scr30. On the other hand, if it will be generating a couple of terabytes of output, the other three would be the only feasible choices.

Besides matching output to available space, users have an individual responsibility to use disk space efficiently. Although several of the filesystems are quite large, their capacities are not unlimited. To help ensure that space is available when it is needed, there are several steps you can take. Delete files which are no longer needed or which have been replicated on another system. Don't use long-term storage (home or data directories) when short-term scratch storage will suffice. If the output doesn't have to be human-readable, use binary files instead of ASCII text; the former is usually more compact and also more efficient to read and write.

You can monitor your own disk usage with the du command. To see how much space is being taken up by your home directory,

   du -hs $HOME

To see how much space you are using on the big-data filesystems and the global scratch filesystems,

   du -hsc /sciclone/{data,*scr}*/$USER

If everyone remains vigilant in the management of their personal files, the system as a whole will operate more smoothly and the clusters can fulfill their mission of providing exceptional computing capabilities for demanding problems.