There is one significant different when running application on HPC clusters and desktop workstation: in HPC clusters, the computational work must be packaged into a job that contains a script specifying what resources the job will need and the commands necessary to perform the work. Finally the job must be submitted to the HPC clusters by using a software called job manager/scheduler. BIRUNI Grid uses TORQUE software to schedules and run the job on a dedication portion of the cluster. This tutorial provides a guide how user can prepare the job script, submit the job and retrieve the result.
Submitting a Job
Jobs are submitted with the
Maximum wallclock time the job will need. Default depends on queue, mostly 1 hour. Walltime is specified in seconds or as
Maximum memory per node the job will need. Default depends on queue, normally 2GB for serial jobs and the full node for parallel jobs. Memory should be specified with units, eg
Total number of CPUs required. Use this if it does not matter how CPUs are grouped onto nodes - eg, for a purely-MPI job. Don't combine this with
-l nodes=numor odd behavior will ensue.
Number of nodes and number of processors per node required. Use this if you need processes to be grouped onto nodes - eg, for an MPI/OpenMP hybrid job with 4 MPI processes and 8 OpenMP threads each, use -l nodes=4:ppn=8. Don't combine this with
-l procs=numor odd behavior will ensue. Default is 1 node and 1 processor per node. When using multiple nodes the job script will be executed on the first allocated node.
Torque will set the environment variables
PBS_NUM_NODESto the number of nodes requested,
PBS_NUM_PPNto the value of
PBS_NPto the total number of processes available to the job.
To see the status of a single job - or a list of specific jobs - pass the Job IDs to
qstat, as in the following example:
When you start pbstop you see something like the annotated screenshot below. You might need to resize your terminal to make it all fit:
Canceling a Job
To kill a running job, or remove a queued job from the queue, use