Data Storage

IMPORTANT: The /tigress storage system is closing in on end-of-life. Quotas have been reduced for most users. The plan is to make both /tigress and /projects read-only on the cluster nodes in the near future (months). This is in preparation for the new data management service, TigerData, which will be replacing /projects and /tigress.

We are not able to grant quota increases on /tigress or /projects. Users will need to use the /scratch/gpfs filesystems which are not backed up.

 

Overview

The schematic diagram below shows Research Computing's clusters (as rectangles) and the file systems that are available to each cluster (as ovals and diamond). The available visualization nodes (grey machines) are also pictured.

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

The dotted line stemming from Traverse’s and Stellar’s /scratch/gpfs indicates the /tigress and /projects filesystems are only available to users who have collaborations with members on Princeton's main campus.

Here are the most important takeaways from the diagram above:

  • /home/<YourNetID>
    • For - source code and executables
    • Details - This folder is small and it is intended for storing source code, executables, Conda environments, R or Julia packages and small data sets.
  • /scratch/gpfs/<YourNetID> (or /scratch/network/<YourNetID>, if on Adroit)
    • For - job output, intermediate results
    • Details - This folder is a fast, parallel filesystem that is local to each cluster which makes it ideal for storing job input and output files. However, because /scratch/gpfs is not backed up you will need to transfer your completed job files to /tigress or /projects when a backup is desired.
  • /tigress/<YourNetID> and /projects/<YourAdvisorsLastName>/
    • *NOTE* Access to these directories is not granted automatically. Your advisor must request that you get access to their /projects directory.
    • *NOTE* Quotas are being reduced on these systems. Increase requests are likely to be denied.
    • For - final results (/projects)
    • Details - These are the long-term storage systems. They are shared by the large clusters via a single, slow connection and they are designed for non-volatile files only (i.e., files that do not change over time). For these reasons one should never write the output of actively running jobs to /tigress or /projects. Doing so may adversely affect the work of other users and it may cause your job to run inefficiently. Instead, write your output to /scratch/gpfs/<YourNetID> and then, after the job completes, copy or move the output to /projects if a backup is desired. Some users will have access to /projects and not /tigress and vice versa.

Two additional important points:

  • /tmp (not shown in the figure) is local scratch space that exists on each compute node for high-speed reads and writes. If file I/O is a bottleneck in your code or if you need to store temporary data then you should consider using this.
  • use the checkquota command to check your storage limits and to request quota increases.

 

Using the Filesystems

Let's say that you just got an account on one of the HPC clusters and you want to start running jobs. Usually the first step is to install software in your /home directory. Most users begin by installing various packages in Python, R or Julia. By default these packages will be installed in /home/<YourNetID>. If you need to build your code from source then transfer the source code to your /home directory and compile it. If the software you need is pre-installed like MATLAB or Stata then you are ready to proceed to the next step.

With your software ready to be used, the next step is to run a job. The /scratch/gpfs filesystem on each cluster is the right place for storing job files. Create a directory in /scratch/gpfs/<YourNetID> (or /scratch/network/<YourNetID> on Adroit) and put the necessary input files and Slurm script in that directory. Then submit the job to the scheduler. If the run produces output that you want to backup then transfer the files to /projects. The commands below illustrate these steps:

$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob
$ cd myjob
# put necessary files and Slurm script in myjob
$ sbatch job.slurm

Your files in /scratch/gpfs are not backed up, so after a job finishes, if you want to backup the output then copy or move the files to /tigress or /projects using a command like the following:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /projects/<YourAdvisorsLastName>/<YourDirectory>

In summary, install your software in /home, run jobs on /scratch/gpfs and transfer final job output to /projects for long-term storage and backup.

 

Additional Details

Given the small size of /home, users often run out of space which can lead to many issues. If you need to request more space then see the checkquota page. There are also directions on that page for finding and removing large files as well as dealing with large Conda environments.

The importance of not writing the output of actively running jobs to /tigress or /projects was emphasized above. Reading files or calling executables from these filesystems is allowed. However, in general, one will get better performance when using /scratch/gpfs or /home so those filesystems should be preferred.

A volatile file is one that changes over time. Actively running jobs tends to create volatile files such as a log file that records the progress of the run. One must avoid copying volatile files to /tigress or /projects since any subsequent change will cause a new backup of the modified version of the file to be made. The long-term storage systems are for non-volatile files only. Only after a job has completed should the job output files be transferred from /scratch/gpfs to /projects.

There have been multiple failures of /scratch/gpfs in the past. In some cases data was lost. It is your responsibility to copy important files to /projects for backup. Note that once you have copied the files to the backup system you can continue using them on /scratch/gpfs where the I/O performance is optimal.

Data that has been classified as level 0 (public) or level 1 (unrestricted) is allowed on the RC clusters. For level 2 (confidential) and level 3 (restricted) one must use the Secure Research Infrastructure. Learn more about data classification.

Each compute node on each cluster has a local scratch disk at /tmp. Most users will never need to use the local scratch space. If your workflow is I/O bound or if you need to write large temporary files then it may be useful to you.

Tigressdata is a machine for visualization and data analysis. As indicated in the diagram above, it mounts the /scratch/gpfs filesystems of Della and Tiger as well as /tigress and /projects. After a job completes you can SSH to tigressdata to start working with the new output. This keeps the login nodes of the large HPC clusters free from this type of work.

 

Why is my /tigress directory missing?

Previously, users were given access to /tigress/<YourNetID> for storage. As of Fall 2020, the new approach is to create a single directory or fileset on /projects named after the PI of the user's research group. Individual group members can create directories for their work within this directory. Like /tigress, /projects is also backed up so users should be responsible about the number and size of the files they store. The new approach allows group members to more easily share files. Also, when a group member leaves the university, their files on /projects will still be available to the group. If you need access to the /projects directory of your research group then write to [email protected] and CC the PI of the group who will need to give their approval.

 

Data Management

Image that two researchers had to leave their positions is a hurry. Based on their files, which work of the two would you want to continue with?

Researcher 1 Researcher 2
.
├── figure1.pdf
├── figure2.pdf
├── file1.py
├── file2.py
├── file3.py
├── main.tex
├── out_a_10
├── out_a_20
├── out_a_30
├── out_b_1
├── out_b_2
├── out_b_3
├── refs.bib
├── output1.log
└── output2.log
.
├── code
│   ├── analysis.py
│   ├── main.py
│   ├── README
│   └── tests
│       └── test_analysis.py
├── data
│   ├── effect_of_length
│   │   ├── length.log
│   │   ├── system1_length_10.csv
│   │   ├── system1_length_20.csv
│   │   └── system1_length_30.csv
│   ├── effect_of_width
│   │   ├── width.log
│   │   ├── system1_width_1.csv
│   │   ├── system1_width_2.csv
│   │   └── system1_width_3.csv
│   └── README
├── manuscript
│   ├── figures
│   │   ├── length.pdf
│   │   └── width.pdf
│   └── text
│       ├── main.tex
│       └── refs.bib
└── README

Of course, Researcher 2 has done a much better job of arranging and documenting their files. They are using multiple directories, descriptive file names and README files. The README file in the code directory may contain notes such as "The software was written by Alan Turing ([email protected]). Contact Alan if there are issues."

Researcher 1 is storing all of their files in one directory, the names of their files are not descriptive, and there are no README files. It would not be easy for a second person to continue with their work.

Good data management means:

  • Create a logical directory structure
  • You should create README files liberally
  • Don't worry about writing polished text inside the files (be practical)
  • It is perfectly fine to create a README file in every directory (although for most projects this is unnecessary) 
  • Think of these files as notes to yourself at a later time

There is much more to data management than using a proper directory structure and README files. See Princeton Research Data Services for more.

 

FAQ