Usage Guidelines
Artemis is a powerful cluster that enables a wide variety of research. We hope that everyone who uses it will find it makes their work more productive. However, there are a few simple guidelines that are necessary to make sure the machine is productive for everyone.
Architecture
Artemis is an SGI ICE cluster, with 192 nodes (8 cores per node) for a total of 1592 cores. Each node has 32 Gb of RAM (4 GB/code). It has a fast interconnect between the nodes for efficient message passing, generally using MPT (although OpenMP can be used between cores on a single node).
Fair Share Usage
It is in large part a user's own responsibility to make sure that they do not exceed their fair share portion of Artemis resources, and that they use the machine in a manner that does not inhibit or interfere with the running of jobs by others.
The machine has been optimized for users requiring a large number of processors that can exploit its fast interconnect capability. If your project consists of a large number of single processor jobs we suggest that you use another machine, such as "della" which has been optimized for that type of computation.
Job Queue
The overwhelming advantage of Artemis compared to national resources (such as the NSF TeraGrid) is the quick turn-around and throughput of jobs, and it is critical that users submit jobs to the queue in a way that preserves this quality. Thus, we ask that users:
- Submit jobs that use all 8 processors per node, so that jobs can be packed on the nodes in the most effective fashion (use the line:
#PBS -l nodes=##:ppn=8
in your PBS script, where ## is the number of nodes needed). If your code cannot use all 8 cores per node, we suggest using another PICSciE machine.
- While long-running (up to 100 hours) small jobs are welcome, please do not submit large jobs (those using more than 16 nodes and 128 processors) for more than 48 hours, unless machine use is very light (less than 50% of the nodes in active use). A few such long jobs can quickly lock-out any others users from running jobs for days at a time. If users abuse the privilege of running long jobs, run time limits will have to be imposed on all jobs.
- No one user should run jobs on more than their fair share of nodes (based on the fraction of nodes owned by that user's group, e.g. astrophysics or engineering), unless usage is very light (less than 50% of the nodes in use). If you are using more than your group's fair share of nodes, please submit only short-running (less than 48 hours) jobs to prevent locking out other users for long periods.
- Single processor jobs should be run on other PICSciE machines, as such jobs do not take advantage of the expensive, high-speed network on Artemis.
- Please do not submit more than 2-3 jobs to the queue at any one time.
- Please try to keep the last 32-64 processors on the machine either empty or running small/short jobs (<16processors, <5 hours), especially between 10am and 10pm. This allows debugging and development and keeps the machine responsive.
- To monitor jobs, use showq and pbstop. A good setup is:
alias ptop='pbstop -c 10 -01234567'
With the "ptop" command, one can see how many nodes/cpus are available and in what configuration.
- Most importantly, please monitor the queue and watch the status of your own jobs. If you see other jobs backing up in the queue, due to your own jobs taking up more than your fair share usage, please cut back. Nothing is more important to ensure smooth and rapid turn-around of jobs than the attentiveness and courtesy of all users for others.
Disks
A user's home directory should be used only for source code and input files needed for production runs. All data generated by production runs should be stored on the /tigress-hsm file system, which has 100 TB of fast (parallel) disks. Filling the home directories with data will make the entire machine unusable for others!
Members of the astrophysics group can also use the Lustre filesystem /scratch/lustre.
Please do not back up or rsync other disks with the filesystems on Artemis. This defeats the purpose of buying expensive high performance storage systems.
Use scp or sftp to move data from Artemis to your local workstation.
Head Node
The head node is primarily for compiling code, submitting jobs, and simple analysis operations. The head node has only 2Gb of memory (less than the memory per core on the nodes!). Thus, the head node can quickly get swamped if users run large analysis programs on it. OIT owns a 16-processor workstation with 128 Gb of memory called 'tigressdata', please login to and use this machine for analysis of data. Both the /scratch/lustre and /tigress-hsm disks are mounted from tigressdata.
In rare circumstances, large-scale post-processing involving large amounts of data can be done by submitting a dedicated job to the batch queue. Single processor jobs involving analyisis of data on /tigress-hsm generated by previous multi-procesor runs on Artemis are welcome; indeed these are the only acceptable form of single-processor jobs on the machine. Note that parallelized visualization software such as VisIt has been installed and can be run using the batch queues. In this way, the analysis can use as many processors, and as much memory, as needed. However, in general, running analysis software on Artemis itself should only be done if more than 16-processors, or 132 Gb of memory (the limits of tigressdata) are needed.
We hope you have a productive time on Artemis!
Useful environment flags and sample batch submission scripts can be found in the Artemis Tutorial.
