Child pages
  • 1. Running job on BIRUNI Grid as HPC users
Skip to end of metadata
Go to start of metadata


There is one significant different when running application on HPC clusters and desktop workstation: in HPC clusters, the computational work must be packaged into a job that contains a script specifying what resources the job will need and the commands necessary to perform the work. Finally the job must be submitted to the HPC clusters by using a software called job manager/scheduler. BIRUNI Grid uses TORQUE software to schedules and run the job on a dedication portion of the cluster. This tutorial provides a guide how user can prepare the job script, submit the job and retrieve the result.

Submitting a Job

Jobs are submitted with the qsub command:

$ qsub options job-script

The options tell Torque information about the job, such as what resources will be needed. These can be specified in the job-script as PBS directives, or on the command line as options, or both (in which case the command line options take precedence should the two contradict each other). For each option there is a corresponding PBS directive with the syntax:

#PBS option

For example, you can specify that a job needs 2 nodes and 8 cores on each node by adding to the script the directive:


#PBS -l nodes=2:ppn=8


or as a command-line option to qsub when you submit the job: 

$ qsub -l nodes=2:ppn=8 my_script.q


Options to manage job output:

  • -N jobname
    Give the job a name. The default is the filename of the job script. Within the job, $PBS_JOBNAME expands to the job name 
  • -o path/for/stdout
    Send stdout to path/for/stdout. Can be a filename or an existing directory. The default filename is $PBS_JOBNAME.o${PBS_JOBID/.*}, eg myjob.o12345, in the directory from which the job was submitted 
  • -e path/for/stderr
    Send stderr to path/for/stderr. Same usage as for stdout.


Options to request compute resources:

  • -l walltime=walltime
    Maximum wallclock time the job will need. Default depends on queue, mostly 1 hour. Walltime is specified in seconds or as hh:mm:ss or mm:ss.
  • -l mem=memory
    Maximum memory per node the job will need. Default depends on queue, normally 2GB for serial jobs and the full node for parallel jobs. Memory should be specified with units, eg 500MB or 8GB
  • -l procs=num
    Total number of CPUs required. Use this if it does not matter how CPUs are grouped onto nodes - eg, for a purely-MPI job. Don't combine this with -l nodes=num or odd behavior will ensue.
  • -l nodes=num:ppn=num
    Number of nodes and number of processors per node required. Use this if you need processes to be grouped onto nodes - eg, for an MPI/OpenMP hybrid job with 4 MPI processes and 8 OpenMP threads each, use -l nodes=4:ppn=8. Don't combine this with -l procs=num or odd behavior will ensue. Default is 1 node and 1 processor per node. When using multiple nodes the job script will be executed on the first allocated node.
    Torque will set the environment variables PBS_NUM_NODES to the number of nodes requested, PBS_NUM_PPN to the value of ppn and PBS_NP to the total number of processes available to the job.


Monitoring Jobs

To see the status of a single job - or a list of specific jobs - pass the Job IDs to qstat, as in the following example: 

$ qstat 3593014 3593016
Job id Name User Time Use S Queue
------------- ---------------- --------------- -------- - -----
3593014 model_scen_1 ab123 7:23:47 R s48
3593016 model_scen_1 ab123 7:23:26 R s48

Most of the fields in the output are self-explanatory. The second-last column "S" is the job status, which can be :

  • Q meaning "Queued"
  • H meaning "Held" - this may be the result of a manual hold or of a job dependency
  • R meaning "Running"
  • C meaning "Completed". After the job finishes, it will remain with "completed" status for a short time before being removed from the batch system.

Other, less common job status flags are described in the manual (man qsub).

The program pbstop, available on the login nodes, shows which jobs are currently running on which nodes and cores of a cluster.

Jobs belonging to a single user can be highlighted by launching pbstop with the -u switch:

pbstop -u <username>

( replace <username> with your username). Or, you can use the alias "me":

pbstop -u me 

When you start pbstop you see something like the annotated screenshot below. You might need to resize your terminal to make it all fit: 


Canceling a Job

To kill a running job, or remove a queued job from the queue, use qdel:

$ qdel jobid

To cancel ALL of your jobs:

$ qdel all

  • No labels