Usage Guidelines
General guidelines
The head node, della, should be used for interactive work only, such as compiling programs, and submitting jobs as described below. No jobs should be run on the head node, other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the 12-core nodes so that CPU core fragmentation is minimized.
Job Scheduling
All jobs must be run through the scheduler on Della. If a job would exceed any of the limits below, it will be held until it is eligible to run. Regular jobs should not specify the queue into which it should run, allowing the default routing queue to distribute the jobs accordingly. Users belonging to groups with an associated queue (chan, hep, prism, and vlong) should specify the queue in the job submission script (#PBS -q <queue_name>).
Jobs will move to either the test, small, or medium queues. They are differentiated by the amount of time requested, as follows:
test queue
1 hour limit
30 node maximum
360 core maximum allocation
2 job maximum per user
NOT for production mode
short queue
24 hour limit
40 job maximum
128 processor maximum allocation
medium queue
72 hour limit
16 jobs maximum per user
128 processor maximum per user
432 total cores
Jobs are further prioritized through the moab scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 7 day period.
Distribution of CPU and memory
There are 1344 processors available: della-001 through della-064 have 48 GB (4 GB memory per core), and della-065 through della-112 have 96 GB (8 GB memory per core). All nodes have 12 cores.
The 96 GB nodes can be requested by adding the large memory node flag (mem96) to your PBS allocation request: e.g., nodes=2:ppn=16:mem96 .
The nodes are all connected with QDR Infiniband.
Appropriate File System Usage
/home (shared via NFS to all the compute nodes, 280 GB) is intended for scripts, source code, executables and small static data sets that may be needed as standard input/configuration for codes.
/scratch/network (shared via NFS to all the compute nodes, 2.9 TB) is intended for dynamic data that doesn't require high bandwidth i/o such as storing final output for a compute job. You may a create a directory /scratch/network/myusername, and use this to place your temporary files. Files are NOT backed up so this data should be moved to persistent storage once it is no longer needed for continued computation.
/scratch/pvfs2 (shared via PVFS2 to all the compute nodes, 4.1 TB) is intended for dynamic data that requires higher bandwidth i/o. Files are NOT backed up so this data should be moved to persistent storage as soon as it is no longer needed for computations. This filesystem is now removing files over 60 days old.
/tigress-hsm (shared via GPFS to all TIGRESS resources, 270 TB) is intended for more persistent storage and should provide high bandwidth i/o (400 MB/s aggregate bandwidth for jobs across 16 or more nodes). Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since this filesystem is shared by the users of all our systems.
/scratch (local to each compute node) is intended for data local to each task of a job, and it should be cleaned out at the end of each job. Nodes 001 through 064 have about 130G available while the others, 065 through 112, have about 1400G.
Running 3rd-Party Software
If you are running 3rd-party software whose characteristics (e.g., memory usage) you are unfamiliar with, please check your job after 5-15 minutes using 'top' or 'ps -ef' on the compute nodes being used. If the memory usage is growing rapidly, or close to exceeding the per-processor memory limit, you should terminate your job before it causes the system to hang or crash. You can detremine on which node(s) your job is running using the "checkjob <jobnumber>" command.
Please remember that these are shared resources for all users.
