Using PBS on the UFS HPC
Introduction
To allow many users to simultaneously use the resources of the UFS HPC (which is distributed across many nodes), a software framework called the Portable Batch System (PBS) is implemented. This framework is used to submit, cancel and monitor the progress of jobs on the HPC. The following figure outlines how the PBS operates:
Intricate detail of PBS is unnecessary for a regular HPC user, however knowing how to use some of the parts of this framework can aid the user in rudimentary job management.
Thus, this guide will only include typical use case examples to perform to the most common tasks on the HPC.
Using qsub to submit a non-interactive job
The qsub command is used to submit jobs to the UFS. In most cases the user will be provided with a submission script (typical with a .pbs extension). In these cases, after editing your submission script, submit your job as follows:
$ qsub mysubmitscript.pbs
The submission script will differ by software package (Always consult the usage documentation for the specific software package) but generally the header of the script will look like this:
#!/bin/bash -l
#################################################################
### PBS Job Parameters #####
#################################################################
#PBS -N md_NSP5_ctrl
#PBS -l nodes=4:ppn=32:prod
#PBS -l walltime=1000:00:00
#PBS -S /bin/bash
#PBS -m abe
#PBS -o md_NSP5_ctrl.out
#PBS -e md_NSP5_ctrl.err
#################################################################
The above options can be given directly to qsub as follows:
$ qsub -N md_NSP5_ctrl -l nodes=4:ppn=32:long -l walltime=1000:00:00
-S /bin/bash -m abe -o md_NSP5_ctrl.out -e md_NSP5_ctrl.err
mysubmitscript.pbs
However, defining the options in the control script is more convenient and thus preferred.
qsub options
The following list describes the options used in the previous control script example:
Option | Description |
---|---|
-N | Name of the submitted job |
-l | Comma separated list of requested resources |
-S | Shell to use |
-o | Output path used for standard output (STDOUT) |
-e | Output path used for standard error (STDERR) |
-m | Option to email the user when the job (a) aborts, (b) begins, (e) ends |
How to define the resources required
In the example above, resources are defined across two lines with the -l option and with each variable separated by colons.
Note: Appending prod in the first -l line is necessary to ensure that job is submitted to nodes that are in production.
Variable | Description |
---|---|
nodes | The number of nodes |
ppn | Processors/cores per nodes |
walltime | The time (hh:mm:ss) reserved for the job |
Thus, in the example of nodes=4 and ppn=32, the total number of cores reserved will be 32 x 4 = 128.
Using qsub to submit an interactive job
The qsub command can also be used to request an interactive session. However, direct usage of qsub is discouraged for normal users, who should use the qwiz script instead.
Follow these steps to submit an interactive job with qwiz:
-
Invoke the qwiz command
$ qwiz
-
Follow the prompts on screen to define the resources required for the session. The user may simply press enter to accept the defaults in square brackets.
3. If the required resources is available, the job will be automatically submitted and the user will change location from the login node to one of the computer nodes on the cluster.
4. To cancel the session early (thus releasing the resources used) use qdel as described below. The message "Terminated" should appear on screen if this was successful:
Using qstat to check the status of running jobs
To get information about the status of running jobs the qstat command can be used as follows:
$ qstat
The status of all the user's jobs in the queue will be displayed on screen. For example:
The S column shows the status of the job:
- R - The job is currently running
- Q - The job is in the queue, awaiting available resources
- C - The job is cancelled
Note that to cancel the job, the job id will be required.
Using qdel to cancel a running job
To cancel a running job, the user can use qdel together with the job id.
Follow these steps to delete a job using qdel:
-
Use qstat to obtain the job id for the job to cancel
-
Use qdel with the job id:
$ qdel <jobid>
- Use qstat to confirm that the job has been cancelled. The process was successful if the job is absent from the queue or its status changed from R to C.
External Guides and Resources
- No external guides are available.
- If you know of a guide/tutorial that you have found useful, please help us share it by contacting the HPC staff at hpc@ufs.ac.za