Policies and Temporary Allocation

Queuing system on Cyrus1 and Quantum2

How it works for users

Users should edit their shell scripts to add special directives to the queue system, beginning with "#PBS", that request resources, declare a required walltime, and direct standard output and error.

Some notes and suggestions for users:

Users should request the lowest possible walltime for their jobs, since the queue system will need to "block out" the entire 24-hour period when no walltime is specified. This is analogous to a customer arriving at a busy barbershop and explaining that he only needs a "very quick trim."
Input and output to files do not seem to be as immediate as when running directly from the command line. Users should not count on immediate access to program output.
Users should test system scaling before expanding beyond one node; for systems of 10 katoms, poor scaling has been observed beyond 8 ppn, while the 92 katom ApoA1 benchmark case scales well to 2 nodes.

Technical Approach used by Torque

The PBS queue system allocates a set of nodes and processors to an individual job, either for the walltime specified in the job or the maximum walltime in the queue. It then provides a set of environmental variables to the shell in which the script runs, such as PBS_NODEFILE, the temporary node file describing allocated CPUs.

Table of Queues

Queue settings on Cyrus1 and Quantum2

	debug	short	long
max walltime	20 min	24 hr	6 days
max nodes per job	1	2	1
priority	100	80	60

Queue settings on Darius

	debug	short	long
max walltime	20 min	24 hr	12 days
max nodes per job	1	4	8
priority	100	80	60

Old Policies

In order to efficiently use our computational resources, we ask all group members to follow the guidelines below when planning and running simulations:

Please run jobs on the computational nodes ("slave nodes"), rather than the head node, of each cluster. In the past, head node crashes – a great inconvenience to everyone – have occurred when all eight of its processors are engaged with computationally intensive work.
Please do not run jobs "on top of" those of other users. If a node is fully occupied and you need to run something, please contact the other user, rather than simply starting your job there.
For fastest disk read/write speeds, write the to local /scratch/username/ directory on each node, rather than your home directory. Home directories, which are accessible from every node, are physically located on the head node, so that reading and writing to disk may be limited by network transmission rates.
The new cluster, with its fast interconnection hardware, is very well suited for large-scale simulations which benefit from the use many processors. A queuing system will be used to manage jobs on this cluster, and no jobs should run outside this queue system.

Please note that we attempted to implement the OpenPBS queue system on Cyrus1 and Quantum2 in December 2009; these systems appeared to be working in testing, but did not perform as desired when multiple jobs were submitted. The use of these queuing systems on those clusters has been suspended until further notice.

Old Alloocation

Effective Monday, April 12, at noon, we will move to a fixed allocation of nodes to users. To promote efficient usage, some nodes are shared between two users. We hope sharing with a single other user will be easier to coordinate, and please try to do so equitably.

These allocations in no way preclude flexibility: users should simply e-mail the owner and ask permission to use idle nodes.

Cyrus1

node	user(s)
n001	Erik
n002	Erik
n003	Erik
n004	Erik
n005	Erik
n006	Erik/Jie
n007	Jie
n008	Jie
n009	Jie
n010	down
n011	Jie
n012	Manas
n013	Manas
n014	Manas
n015	Jie until Apr 22, Diwakar thereafter
n016	Manas
n017	Manas/Fa
n018	Fa/Diwakar
n019	Fa
n020	Fa
n021	Fa
n022	Fa
n023	Nicholas/Li Xi
n024	Nicholas

Quantum2

node	user(s)
n001	Neeraj
n002	Neeraj
n003	Neeraj
n004	down
n005	Neeraj
n006	Neeraj/Geoff
n007	Geoff
n008	Geoff
n009	Geoff
n010	Geoff
n011	Diwakar
n012	Diwakar
n013	Diwakar
n014	down
n015	Diwakar/Taosong
n016	Tao
n017	Tao
n018	Tao
n019	Tao

Daedalus1
all available nodes are public

Cyrus2
all available nodes are public

Child pages