Queuing system on Cyrus1 and Quantum2
How it works for users
Users should edit their shell scripts to add special directives to the queue system, beginning with "#PBS", that request resources, declare a required walltime, and direct standard output and error.
Some notes and suggestions for users:
- Users should request the lowest possible walltime for their jobs, since the queue system will need to "block out" the entire 24-hour period when no walltime is specified. This is analogous to a customer arriving at a busy barbershop and explaining that he only needs a "very quick trim."
- Input and output to files do not seem to be as immediate as when running directly from the command line. Users should not count on immediate access to program output.
- Users should test system scaling before expanding beyond one node; for systems of 10 katoms, poor scaling has been observed beyond 8 ppn, while the 92 katom ApoA1 benchmark case scales well to 2 nodes.
Technical Approach used by Torque
The PBS queue system allocates a set of nodes and processors to an individual job, either for the walltime specified in the job or the maximum walltime in the queue. It then provides a set of environmental variables to the shell in which the script runs, such as PBS_NODEFILE, the temporary node file describing allocated CPUs.
Table of Queues
Queue settings on Cyrus1 and Quantum2
|
debug |
short |
long |
---|---|---|---|
max walltime |
20 min |
24 hr |
6 days |
max nodes per job |
1 |
2 |
1 |
priority |
100 |
80 |
60 |
Queue settings on Darius
|
debug |
short |
long |
---|---|---|---|
max walltime |
20 min |
24 hr |
12 days |
max nodes per job |
1 |
4 |
8 |
priority |
100 |
80 |
60 |
Old Policies
In order to efficiently use our computational resources, we ask all group members to follow the guidelines below when planning and running simulations:
- Please run jobs on the computational nodes ("slave nodes"), rather than the head node, of each cluster. In the past, head node crashes – a great inconvenience to everyone – have occurred when all eight of its processors are engaged with computationally intensive work.
- Please do not run jobs "on top of" those of other users. If a node is fully occupied and you need to run something, please contact the other user, rather than simply starting your job there.
- For fastest disk read/write speeds, write the to local /scratch/username/ directory on each node, rather than your home directory. Home directories, which are accessible from every node, are physically located on the head node, so that reading and writing to disk may be limited by network transmission rates.
- The new cluster, with its fast interconnection hardware, is very well suited for large-scale simulations which benefit from the use many processors. A queuing system will be used to manage jobs on this cluster, and no jobs should run outside this queue system.
Please note that we attempted to implement the OpenPBS queue system on Cyrus1 and Quantum2 in December 2009; these systems appeared to be working in testing, but did not perform as desired when multiple jobs were submitted. The use of these queuing systems on those clusters has been suspended until further notice.
Old Alloocation
Effective Monday, April 12, at noon, we will move to a fixed allocation of nodes to users. To promote efficient usage, some nodes are shared between two users. We hope sharing with a single other user will be easier to coordinate, and please try to do so equitably.
These allocations in no way preclude flexibility: users should simply e-mail the owner and ask permission to use idle nodes.
Cyrus1
node |
user(s) |
n001 |
Erik |
n002 |
Erik |
n003 |
Erik |
n004 |
Erik |
n005 |
Erik |
n006 |
Erik/Jie |
n007 |
Jie |
n008 |
Jie |
n009 |
Jie |
n010 |
down |
n011 |
Jie |
n012 |
Manas |
n013 |
Manas |
n014 |
Manas |
n015 |
Jie until Apr 22, Diwakar thereafter |
n016 |
Manas |
n017 |
Manas/Fa |
n018 |
Fa/Diwakar |
n019 |
Fa |
n020 |
Fa |
n021 |
Fa |
n022 |
Fa |
n023 |
Nicholas/Li Xi |
n024 |
Nicholas |
Quantum2
node |
user(s) |
n001 |
Neeraj |
n002 |
Neeraj |
n003 |
Neeraj |
n004 |
down |
n005 |
Neeraj |
n006 |
Neeraj/Geoff |
n007 |
Geoff |
n008 |
Geoff |
n009 |
Geoff |
n010 |
Geoff |
n011 |
Diwakar |
n012 |
Diwakar |
n013 |
Diwakar |
n014 |
down |
n015 |
Diwakar/Taosong |
n016 |
Tao |
n017 |
Tao |
n018 |
Tao |
n019 |
Tao |
Daedalus1
all available nodes are public
Cyrus2
all available nodes are public