You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

Queuing system on Cyrus1 and Quantum2

How it works for users

Users should edit their shell scripts to add special directives to the queue system, beginning with "#PBS", that request resources, declare a required walltime, and direct standard output and error.

#PBS -l walltime=14:30:00

#PBS -l nodes=1:ppn=4
     OR
#PBS -l nodes=1:ppn=8
     OR
#PBS -l nodes=n024:ppn=8
     OR
#PBS -l nodes=1:ppn=8+nodes=1:ppn=4
     OR
#PBS -l nodes=n024:ppn=8+nodes=1:ppn=8
[bashprompt]$ qsub -l walltime=5:00:00 -l nodes=2:ppn=8 -e test_stderr.txt  ./test_simulation.sh

Some notes and suggestions for users:

  1. Users should request the lowest possible walltime for their jobs, since the queue system will need to "block out" the entire 24-hour period when no walltime is specified. This is analogous to a customer arriving at a busy barbershop and explaining that he only needs a "very quick trim."
  2. Input and output to files do not seem to be as immediate as when running directly from the command line. Users should not count on immediate access to program output.
  3. Users should test system scaling before expanding beyond one node; for systems of 10 katoms, poor scaling has been observed beyond 8 ppn, while the 92 katom ApoA1 benchmark case scales well to 2 nodes.

Technical Approach used by Torque

The PBS queue system allocates a set of nodes and processors to an individual job, either for the walltime specified in the job or the maximum walltime in the queue. It then provides a set of environmental variables to the shell in which the script runs, such as PBS_NODEFILE, the temporary node file describing allocated CPUs.

Table of Queues

Queue settings on Cyrus1 and Quantum2

 

debug

short

long

max walltime

20 min

24 hr

6 days

max nodes per job

1

2

1

priority

100

80

60

Queue settings on Darius

 

debug

short

long

max walltime

20 min

24 hr

12 days

max nodes per job

1

4

8

priority

100

80

60

Old Policies

In order to efficiently use our computational resources, we ask all group members to follow the guidelines below when planning and running simulations:

  • Please run jobs on the computational nodes ("slave nodes"), rather than the head node, of each cluster. In the past, head node crashes – a great inconvenience to everyone – have occurred when all eight of its processors are engaged with computationally intensive work.
  • Please do not run jobs "on top of" those of other users. If a node is fully occupied and you need to run something, please contact the other user, rather than simply starting your job there.
  • For fastest disk read/write speeds, write the to local /scratch/username/ directory on each node, rather than your home directory. Home directories, which are accessible from every node, are physically located on the head node, so that reading and writing to disk may be limited by network transmission rates.
  • The new cluster, with its fast interconnection hardware, is very well suited for large-scale simulations which benefit from the use many processors. A queuing system will be used to manage jobs on this cluster, and no jobs should run outside this queue system.

Please note that we attempted to implement the OpenPBS queue system on Cyrus1 and Quantum2 in December 2009; these systems appeared to be working in testing, but did not perform as desired when multiple jobs were submitted. The use of these queuing systems on those clusters has been suspended until further notice.

Old Alloocation

Effective Monday, April 12, at noon, we will move to a fixed allocation of nodes to users. To promote efficient usage, some nodes are shared between two users. We hope sharing with a single other user will be easier to coordinate, and please try to do so equitably.

These allocations in no way preclude flexibility: users should simply e-mail the owner and ask permission to use idle nodes.

Cyrus1

node

user(s)

n001

Erik

n002

Erik

n003

Erik

n004

Erik

n005

Erik

n006

Erik/Jie

n007

Jie

n008

Jie

n009

Jie

n010

down

n011

Jie

n012

Manas

n013

Manas

n014

Manas

n015

Jie until Apr 22, Diwakar thereafter

n016

Manas

n017

Manas/Fa

n018

Fa/Diwakar

n019

Fa

n020

Fa

n021

Fa

n022

Fa

n023

Nicholas/Li Xi

n024

Nicholas

Quantum2

node

user(s)

n001

Neeraj

n002

Neeraj

n003

Neeraj

n004

down

n005

Neeraj

n006

Neeraj/Geoff

n007

Geoff

n008

Geoff

n009

Geoff

n010

Geoff

n011

Diwakar

n012

Diwakar

n013

Diwakar

n014

down

n015

Diwakar/Taosong

n016

Tao

n017

Tao

n018

Tao

n019

Tao

Daedalus1
all available nodes are public

Cyrus2
all available nodes are public

  • No labels