Introduction
There are fundamentally two services that can be provided by a large cluster such as this:
This service is optimised as a Capacity Computing resource.
This page has the following sections
Overview of the scheduling system
Jobs on the cluster are under the control of a scheduling system to optimise the throughput of the service. This means that there are some classes of work, for example work that requires repeated access to specified nodes, that are not appropriate for this service. IT Services may be able to offer alternative facilities for this type of work; contact the IT Service Desk in the first instance to discuss any such requirements. The scheduling system is configured to offer an equitable distribution of resources over time to all users. The key means by which this is achieved are:
jobs are scheduled according to the queue, if any, and the resources that are requested. See the Job Resources section of this help page for more details.
jobs are not necessarily run in the order in which they are submitted. The scheduling system runs a fair-share policy, whereby users associated with accounts (projects) that have not used the system for some time will be given a higher priority than accounts who have recently made heavy use of the system. This means that it is not possible for any one account to block the system just by submitting multiple jobs.
jobs requiring a large number of cores will have to queue until these become available. The system will run smaller jobs until these resources become available, and hence it is beneficial to specify a realistic wall time for a job - this is known as backfill.
when short jobs are held in the job queue alongside long jobs the priority of these short jobs increases relative to the long jobs over time. This, along with the backfill mechanism described above, means that short jobs cannot be blocked by a large number of long jobs waiting to run.
Interactive work requiring significant resources, especially Graphical User Interfaces for commercial applications, should not be carried out on the logon nodes but run on the worker nodes under the control of the scheduler by running interactive jobs.
There are limits on the resources available to individual jobs and to the overall resources in use at any time by a single user. See the Resource Limits section of this page for details
The command to submit a job is qsub. This reads its input from a file or from its standard input, the latter usually using the echo command. The job is submitted to the scheduling system, using the requested resources, and will run on the first available node(s) that are able to provide the resources requested. For example, to submit a job to return the hostname of the node on which the job is run using the default set of resources, use the command
echo 'echo $HOSTNAME' | qsub
To submit the set of commands contained in the file myscript, use the command
qsub myscript
The batch system will return a job number, for example
35929.qmaster.cvos.cluster
When this job completes there will be two files in the home directory (the one that you are when logging on) named STDIN.e35929 and STDIN.o35929. STDIN.e35929 contains any error output and also the output from any module commands. STDIN.o35929 contains the job output, which for the first example is the name of the node that the job ran on, for example u2n127, possibly a system message relating to the processes used in the job and a list of the resources used in the job.
There is a one-second delay before the qsub command is executed. Experience has shown that scripts which submit many - hundreds or more - jobs can cause problems if there is no delay between consecutive qsub commands.
To kill a queued or running job use the qdel command. For example, to kill the previous job:
qdel 35929
A job will start in the home directory (the one that you are when logging on) by default, not the directory that the job was submitted from. For example, the job
echo 'ls' | qsub
will give a listing of the home directory, regardless of the directory that the job was submitted from. This might not be the desired action, for example if an input file is specified in the job. To ensure that the job runs from the submitted directory, add -d $PWD to the qsub command. To submit a job to list the files in the directory that the job was submitted from:
echo 'ls' | qsub -d $PWD
(-d is the option to specify the initial directory for a job and $PWD always refers to the current directory). Similarly,
qsub -d$PWD myscript
will run the script file myscript that is in the current directory.
The output and error files described in the previous section can be merged into a single output file by the -j oe option. The following command runs the script file myscript that is in the current directory and produces a single output file:
qsub -d$PWD -j oe myscript
There are many other options; use the command
man qsub
for a fuller description of these options.
Jobs can be run interactively as well as in the batch system; for example when running an interactive graphical environment for one of the commercial applications. These should not be run on the logon nodes but on one of the worker nodes. To gain access to a worker node use the command
qsub -I
where the I in -I is an upper-case letter i (short for Interactive). This submits a job to give interactive access to a worker node and so will return a job number similar to:
qsub: waiting for job 4377688.qmaster.cvos.cluster to start
The job will then queue like any other job, and when it runs it will give output similar to:
qsub: job 4377688.qmaster.cvos.cluster ready
+--------------------------------------------------------------------------+
| Job starting at 2011-07-19 14:41:35 for hattonps on the BlueBEAR Cluster
| Job identity jobid 4377688 jobname STDIN
| Job requests nodes=1:ppn=1,pmem=1996mb,walltime=02:00:00
| Job assigned to node u4n061
+--------------------------------------------------------------------------+
Please modify .bashrc to tailor your needs
[username@u4n061 ~]$
where the node that is being used is shown in the prompt.
This will transfer your session to a worker node, on which you can use any of the commands that are available on the logon nodes. If the -X option is also specified graphical traffic is returned to your desktop from this node as long as Exceed and putty have been configured as recommended. As an example, the following set of commands will transfer a session to a worker node and run the Abaqus CAE graphical environment:
qsub -X -I
(wait for worker node, as described above)
module load apps/abaqus
abaqus cae
Such jobs are subject to the usual default job resources, described with examples in a following section. In particular, a longer walltime may be specified if required.
To use all 4 cores interactively on a single node, for example when developing parallel codes:
qsub -I -l 'nodes=1:ppn=4'
where the first I is an upper-case i and the second l is a lower-case L.
-q bbtest
to the qsub command; for example
echo 'ls' | qsub -q bbtest
Since this queue is intended for short test jobs it allows jobs from different users to run on a node, and for more than one job to be running on a core. This means that the elapsed (wall) time may vary between runs, since a job can be affected by other jobs on the same node, but the overall throughput is higher than if the same jobs were run in the default queue. Note that
-q bbtest
has to be specified, along with any other options, to run in this queue - jobs that do not specify the queue name but asked for resources that are low enough to run in this queue will be scheduled in the standard job queue; this means that small jobs can run in the standard queue and hence be isolated from other users' jobs.
Job scripts can contain batch system options as well as Unix commands. For example, consider a file myscript containing the following:
#!/bin/bash
#PBS -l walltime=4:0:0
#PBS -j oe
cd "$PBS_O_WORKDIR"
module load apps/intel/2011.0.084
icc -o mybin mysource.c
./mybin
The first line runs the job in the GNU Bourne Again Shell (the same shell as is used on the logon node) and the next two lines here set job options, which could alternatively been supplied as options on the qsub command. The second line sets a resource limit (with the -l option, lowercase el) of 4 hours of wall (elapsed) time. The third line requests that command errors are merged with the standard output in a single file. The fourth line changes the current directory in a job from the home directory to the one current when the job was submitted. The next line makes the Intel C compiler available to the job which is then used to compile a C program and produce a binary file called mybin. The last line runs that binary file. This whole script is submitted with the command
qsub myscript
The qstat command gives the status of the jobs submitted by the logged-on user.
The output from qstat consists of one row for each job with several columns, including the job Status which will typically be Queued if the job is queuing or Active if the job is Running. The Elapsed Time is also shown. If no job is found, no output is returned - in particular, using qstat with a jobid that has completed will not return any output. There are many option to qstat; use the command
man qstat
for more details. The -f option gives a full output of the job status; to get the full status of job 35929 use the command
qstat -f 35929
The output from qstat -f returns information about the resources used by a job. To show the cpu time, elapsed time, real and virtual memory used by the job 35929 use the command
qstat -f 35929|grep resources
To just show the cpu and elapsed (wall) time, which can be useful to see how efficiently a parallel job is running:
qstat -f 35929|grep resources|grep -v mem
The checkjob command can be used to view a detailed status of each job; with the command followed by the job id:
checkjob 35929
This command shows all job attributes and state information and also provides an analysis of whether or not the job can run. If the job is unable to run, this command will provide a breakdown of these reasons why. The section on associating a job with a project describes how to check if a job is being queued because of an invalid project code being specified.
The showstart command provides an estimate of job start time.
showstart 35929
The output from showstart is based on the load on the system, including the specified walltime for running and queued jobs, at the time that the command is issued. The estimated start time may be inaccurate due to jobs not requiring their stated wall time, which would lead to the job starting sooner than indicated, or higher-priority jobs being submitted and run, which would lead to the job starting later than indicated. In most cases the start time from showstart is pessimistic.
It can be useful to monitor the progress of the job as it runs by looking at the standard error and output files. By default these are written to local disk on the compute nodes and only copied to the user filestore at the end of the job. To monitor the output of the job myjob.sh as it runs there are 3 options, all of which have limitations:
The -k option puts all the output of your job into a fixed place in your home directory, named in the default way only [ name.oNNNNN ] This is a little inflexible in using the home directory only.
exec > /home/my/own/output 2>&1
this redirects the output from the job script commands only, not system messages at the start and end of the job, which go to the normal file,
mycalcprog > /home/my/own/output 2>&1
In this case it is only the output of the executable that's directed.
With the second and third method care is needed to name the output uniquely, and ideally the name would include the job number.
There are 4 classes of nodes in the BlueBEAR cluster that are available for general use:
Jobs are scheduled onto these nodes in the following order:
There are limits on the resources available to individual jobs and to the overall resources in use at any time by a single user. See the Resource Limits section of this page for details
The following resources can be specified:
There is also an overall limit on the total number of cores that any one user may be occupying at any time, to prevent any one user having a disproportionate impact on the service. Queued jobs do not have their requested resources counted against this limit. The limit will be at least 128 and IT Services will review this periodically to ensure reasonable usage of the cluster whilst maintaining a responsive service for all users. The current value is in the table of Resource Limits in this help page.
Resources are specified using the -l (lower-case el) option to qsub. For example, to submit a script that will require 6 hours walltime on a single core:
qsub -l 'walltime=6:0:0' myscript
To request 6 hours wall time, 4 nodes and 4 cores per node with the job constrained to only use nodes in silo 2:
qsub -l 'walltime=6:0:0,nodes=4:ppn=4,feature=U2' myscript
Note that in the example above a colon is used to separate the nodes and ppn values.
To request 4 hours wall time, 4 nodes and 2 cores per node and an anticipated 3 GBytes of memory per process:
qsub -l 'walltime=4:0:0,nodes=4:ppn=2,pmem=3gb' myscript
This job requests a total of 6 GB of memory (2 cores with 3 GB per process) per node. Since this would leave just under 2 GB of memory that has not been requested on a standard node (8 GB, less the 6 GB requested in this job less a small system overhead) only jobs requesting less than 2 GB of memory submitted by the same user would be scheduled to run on the same node.
To submit a job to use 4 cores and 16 GB of memory on one of the 16 GB 4 core nodes::
qsub -l 'walltime=2:0:0,nodes=1:ppn=4,pmem=3gb' myscript
In the above example, the total requested memory per node is 12 GB (4 cores and 3 GB per process), so the job will not run on one of the 8 GB nodes. Since pmem is advisory, not a hard limit, the job will be able to use all of the available memory - just under 16 GB - on the node.
To submit a job to use all 8 cores and all of the memory on an 8 core node:
qsub -q bbquad myscript
To submit a short test job:
qsub -q bbtest myscript
To run an interactive job that will return graphical output to the desktop with a walltime of 12 hours:
qsub -IX -l 'walltime=12:0:0'
Requesting more resources than are needed will usually increase the queuing time for the job; in particular a job asking for multiple nodes and/or cores will have to queue until all of them are available. Requesting more resources than are available may give an error similar or identical to:
qsub: Job rejected by all possible destinations
This message is given if the walltime or memory request cannot be met, but requesting more cores per node (ppn) than are available does not give an error message; the job will queue but never run. This is unlikely to be resolved in the near future.
There are many other options that may be specified; some of the more commonly-used ones are as follows (in all cases, replace words in italics with a suitable value):
All of these can be specified in a script or on the qsub command. For example, if a script myjob contains
#!/bin/bash
#PBS -l "nodes=1:ppn=4,walltime=2:0:0,pmem=3gb"
#PBS -j oe
#PBS -N MathJob
cd "$PBS_O_WORKDIR"
module load apps/mathematica
math < math.in > math.out
either of the following commands will submit the job:
qsub myjob
or
echo "module load apps/mathematica;math < math.in > math.out" | qsub -l "nodes=1:ppn=4,walltime=2:0:0,pmem=3gb" -j oe -N MathJob -d $PWD
where the second command is all on one line. Note that in this case the job will be scheduled to run on one of the 4 core 16 GB nodes, since the memory requested per node is 12 GB
If a job uses significant I/O (Input/Output) then files should be created using the local disk space and only written back to the final directory when the job is completed. This is particularly important if a job makes heavy use of disk for scratch or work files. Heavy I/O to GPFS filestore such as the personal filestore can cause poor interactive response on the cluster for all users.
When a job is submitted an environment variable TMPDIR is set on each of the nodes that the job is running on to a unique directory for that job on the local filestore. This environment variable is available to any prologue and epilogue scripts as well as all process of the running job - that is, for an MPI job a directory with that name will be available on all of the nodes that the job uses. This directory is local to each node, not shared across all of the nodes associated with the job.
A typical sequence of commands to use this local disk space would be:
cd $TMPDIR
cp $PBS_O_WORKDIR/my_input .
mpiexec my_program
cp my_output $PBS_O_WORKDIR
which changes directory to the local temporary directory, copies the input file my_input from the job's working directory (note the final dot in this command to refer to the current directory), runs the parallel program my_program which creates an output file my_output in the program's working directory and copies the output back to the job's working directory so that it is available on the logon nodes. The temporary directory, and any sub-directories, are deleted when the job completes. If this is a single-core job the mpiexec command is not needed, so the commands are:
cd $TMPDIR
cp $PBS_O_WORKDIR/my_input .
my_program
cp my_output $PBS_O_WORKDIR
Many applications can specify a directory for scratch or temporary files which should also be assigned to $TMPDIR. Where appropriate, this is documented in the help for the application that is linked from the list of applications on this service.
The is a limited - typically 100 GB - of space available locally on the nodes. If this proves inadequate, especially for application workspace, please open a call with the IT Service Desk so that we can advise on the best way of running such jobs whilst minimising the impact on the overall service.
Associating a job with a project
From 22 March 2010 every job has to be associated with a project to ensure the equitable distribution of the limited resources on the cluster. Project owners and members will have been issued a project code for each registered project which may have to be quoted when a job is submitted, and only usernames authorised by the project owner will be able to run jobs using that project code.
In most cases a username on the cluster is only associated with one project, and any jobs submitted by that user will be associated with their sole project and no project code needs to be specified in the job script. If a user is associated with more than one project, their default project is the first one that was registered with that username associated with it.
To see the projects that are associated with the logged-on user issue the command
mdiag -u $USER|grep ADEF
which will give a result similar to
ADEF=defproj01 ALIST=defproj01,otherproj02
where ADEF is the default project and ALIST lists all of the projects that the user is authorised to use.
If a user is registered on more than one project a specific project code is specified by the -A option followed by the project code. For example the following command would submit a job to run under account project01:
qsub -j oe -A project01 myscript
Similarly, if a job script myjob contains:
#!/bin/bash
#PBS -j oe
#PBS -A project01
cd "$PBS_O_WORKDIR"
myscript
and is then submitting using the command
qsub myjob
it will run under the account project01. If the -A option is specified then a project code should also be specified; using -A without a following project code can have an undefined action, which will depend on any subsequent arguments, if any, on the qsub command or in the job script.
If a job is submitted using an invalid project, either because the project does not exist or the username is not authorised to use that project, no warning will be given when the job is submitted but it will just queue and not run. To see the project associated with a job, and any associated messages, use the command
checkjob -v jobname|grep account
where jobname is the job name as shown by the qstat command, for example 2372790.qmaster. A typical output if an invalid project was specified will be
Message[0] job requested invalid account 'nonsuch01'
and if a project was specified that is not authorised for the user:
Message[0] job not authorized to access requested account 'notmine01'
Certain environmental variables, see the table below, are available within the Torque/PBS environment which may be used or tested in a job script. Additional variables can be exported using the -v option of qsub.
| Env Variable | Value | For example |
|---|---|---|
| PBS_O_HOME | HOME value at qsub time | /bb/ccc/hattonps |
| PBS_O_HOST | Hostname at qsub time | bluebear3.cvos.cluster |
| PBS_O_INITDIR | set to the value of the qsub -d option directory | (undefined by default) |
| PBS_O_LOGNAME | LOGNAME value at qsub time | hattonps |
| PBS_O_MAIL | MAIL value at qsub time | /var/spool/mail/hattonps |
| PBS_O_PATH | PATH value at qsub time | /usr/lib64/qt-3.3/bin:/usr/....... |
| PBS_O_SHELL | SHELL value at qsub time | /bin/bash |
| PBS_O_QUEUE | the name of queue to which the job was submitted | default |
| PBS_O_WORKDIR | the absolute current directory at qsub time | /bb/ccc/hattonps/ |
| PBS_ENVIRONMENT | set to PBS_BATCH or to PBS_INTERACTIVE | PBS_BATCH |
| PBS_JOBCOOKIE | a hexadecimal string unique to the job | (unpredictable) |
| PBS_JOBID | the job identifier assigned by the batch system | 515122.qmaster.cvos.cluster |
| PBS_JOBNAME | the job name defaulted or given by the user (-N) |
mytest |
| PBS_NODEFILE | the filename of nodelist file (main node script only) | /cvos/local/config/torque/aux//515122.qmaster.cvos.cluster |
| PBS_NODENUM | 0 for main or only node, 1:n-1 for others | 0 |
| PBS_VNODENUM | Under pbsdsh: 0 to c-1 where c is the number of cores | 0 |
| PBS_TASKNUM | Under pbsdsh: a unique task-instance counter | 2 |
| PBS_QUEUE | name of the queue from which the job is executed | bball |
| TMPDIR | unique directory local to the node | /tmp/515122.qmaster.cvos.cluster |
Multiple Threaded Applications
This section is only relevant to expert users who are familiar with the difference between threads and processes. In the vast majority of case memory requests should be made as described in the section on Job Resources
The pmem parameter specifies the total memory that any process is expected to use; if the process is multi-threaded this specifies the expected total memory per process, not per thread.
Consider the case of a 4-way threaded process requiring a total of 7 GB of memory. This memory could be requested in several ways, only one of which is valid:
We recommend that programs which use posix-threads should always ask for the full node in this way, because otherwise their threads will compete with other users' ordinary processes for the 4 cores, which will increase all the programs' wall-time unpredictably.
Note that this is all entirely different to an MPI program which might create a number of different processes, not threads.
Parallel jobs can either consist of a single program distributed across several cores with communication between the distributed components or the same program running in isolation on several cores. The former will usually consist of a program using a library such as MPI to provide the distributed computing facilities, whilst the latter will make use of a facility such as pbsdsh to spawn multiple copies of the same program on different nodes. Both methods are supported on this cluster. Examples of MPI programs are found on the help pages for the mpi compilers mpicc and mpif90 which are linked from the applications help page
There are limits on the resources available to individual jobs and to the overall resources in use at any time by a single user. The limits at the time of writing this page are:
| Resource | default queue | bbquad queue(1) | bbtest queue(2) | |||
| default | maximum | default | maximum | default | maximum | |
| walltime per job | 2 hours | 240 hours | 2 hours | 240 hours | 15 minutes | 15 minutes |
| number of nodes per job | 1 | limited by the maximum simultaneous cores(5) | 1 | 2 | 1 | 2 |
| number of cores per node | 1 | 4 | 8 | 8 | 1 | 4 |
| anticipated per-process memory per core | 1996mb | 16gb(3)(4) | 4032mb | 32gb(4) | 1996mb | 1996mb(4) |
|
maximum number of queued and running jobs per user(5) |
5000 |
10 |
1000 |
|||
| simultaneous cores in use per user (may be across more than 1 job and 1 queue)(6) | soft limit 256 hard limit 256 |
|||||
Notes:
To run on the 4-processor (8 core) 32 GB memory nodes specify -q bbquad on the qsub line, as well as any other resource requests
To run in the test queue specify -q bbtest on the qsub line, as well as any other resource requests
The majority of nodes have 8 GB of memory with 18 having 16 GB. Specifying pmem greater than 8gb will schedule jobs to run on these large-memory nodes, but since these are a scarce resource they may queue for longer than jobs scheduled on the 8 GB nodes.
In all cases, the product (number of cores specified by ppn)*(memory pre process specified by pmem), must not exceed 7 GB for one of the standard 4 core, 8 GB memory nodes, 15 GB for one of the 4 core, 16 GB memory nodes or 30 GB for one of the 8 core, 32 GB theory nodes. If this limit is exceed the job will queue without being scheduled. Note that, at present, no warning is given if a job will not run due to the memory requested.
This limit is necessary to preserve the responsiveness of MOAB to commands such as qsub and qstat. An error is given if a job is attempted to be submitted if a user has more than the maximum number of queued and running jobs.
The hard limit is the maximum number of cores that can be in use by a single user at any time across all jobs, which may be running in different queues and includes any interactive jobs. If the system is busy then requests for more than the soft limit on the number of cores will be given a lower priority than requests for the soft limit or fewer cores; if the system is quiet then all requests are treated equally. This helps to prevent large jobs from blocking the cluster when it is busy whilst allowing such jobs to run when the cluster is quiet.
Last updated: 5 September 2011