There is one significant different when running application on HPC clusters and desktop workstation: in HPC clusters, the computational work must be packaged into a job that contains a script specifying what resources the job will need and the commands necessary to perform the work. Finally the job must be submitted to the HPC clusters by using a software called job manager/scheduler. BIRUNI Grid uses TORQUE software to schedules and run the job on a dedication portion of the cluster. This tutorial provides a guide how user can prepare the job script, submit the job and retrieve the result.
Submitting a Job
Jobs are submitted with the
$ qsub options job-script
The options tell Torque information about the job, such as what resources will be needed. These can be specified in the job-script as PBS directives, or on the command line as options, or both (in which case the command line options take precedence should the two contradict each other). For each option there is a corresponding PBS directive with the syntax:
For example, you can specify that a job needs 2 nodes and 8 cores on each node by adding to the script the directive:
or as a command-line option to
qsub when you submit the job:
$ qsub -l nodes=2:ppn=8 my_script.q
Options to manage job output:
Give the job a name. The default is the filename of the job script. Within the job,
$PBS_JOBNAMEexpands to the job name
path/for/stdout. Can be a filename or an existing directory. The default filename is
myjob.o12345, in the directory from which the job was submitted
path/for/stderr. Same usage as for
Options to request compute resources:
Maximum wallclock time the job will need. Default depends on queue, mostly 1 hour. Walltime is specified in seconds or as
Maximum memory per node the job will need. Default depends on queue, normally 2GB for serial jobs and the full node for parallel jobs. Memory should be specified with units, eg
Total number of CPUs required. Use this if it does not matter how CPUs are grouped onto nodes - eg, for a purely-MPI job. Don't combine this with
-l nodes=numor odd behavior will ensue.
Number of nodes and number of processors per node required. Use this if you need processes to be grouped onto nodes - eg, for an MPI/OpenMP hybrid job with 4 MPI processes and 8 OpenMP threads each, use -l nodes=4:ppn=8. Don't combine this with
-l procs=numor odd behavior will ensue. Default is 1 node and 1 processor per node. When using multiple nodes the job script will be executed on the first allocated node.
Torque will set the environment variables
PBS_NUM_NODESto the number of nodes requested,
PBS_NUM_PPNto the value of
PBS_NPto the total number of processes available to the job.
To see the status of a single job - or a list of specific jobs - pass the Job IDs to
qstat, as in the following example:
$ qstat 3593014 3593016
Job id Name User Time Use S Queue
------------- ---------------- --------------- -------- - -----
3593014 model_scen_1 ab123 7:23:47 R s48
3593016 model_scen_1 ab123 7:23:26 R s48
Most of the fields in the output are self-explanatory. The second-last column "S" is the job status, which can be :
- Q meaning "Queued"
- H meaning "Held" - this may be the result of a manual hold or of a job dependency
- R meaning "Running"
- C meaning "Completed". After the job finishes, it will remain with "completed" status for a short time before being removed from the batch system.
Other, less common job status flags are described in the manual (
pbstop, available on the login nodes, shows which jobs are currently running on which nodes and cores of a cluster.
Jobs belonging to a single user can be highlighted by launching
pbstop with the
pbstop -u <username>
<username> with your username). Or, you can use the alias "me":
pbstop -u me
When you start pbstop you see something like the annotated screenshot below. You might need to resize your terminal to make it all fit:
Canceling a Job
To kill a running job, or remove a queued job from the queue, use
$ qdel jobid
To cancel ALL of your jobs:
$ qdel all