Using PBS on the UFS HPC

Introduction

To allow many users to simultaneously use the resources of the UFS HPC (which is distributed across many nodes), a software framework called the Portable Batch System (PBS) is implemented. This framework is used to submit, cancel and monitor the progress of jobs on the HPC. The following figure outlines how the PBS operates:

pbs workflow

Intricate detail of PBS is unnecessary for a regular HPC user, however knowing how to use some of the parts of this framework can aid the user in rudimentary job management.

Thus, this guide will only include typical use case examples to perform to the most common tasks on the HPC.

Using qsub to submit a non-interactive job

The qsub command is used to submit jobs to the UFS. In most cases the user will be provided with a submission script (typical with a .pbs extension). In these cases, after editing your submission script, submit your job as follows:

    $ qsub mysubmitscript.pbs

The submission script will differ by software package (Always consult the usage documentation for the specific software package) but generally the header of the script will look like this:

    #!/bin/bash -l

    #################################################################
    ###                     PBS Job Parameters                  #####
    #################################################################
    #PBS -N md_NSP5_ctrl
    #PBS -l nodes=4:ppn=32:prod
    #PBS -l walltime=1000:00:00
    #PBS -S /bin/bash
    #PBS -m abe
    #PBS -o md_NSP5_ctrl.out
    #PBS -e md_NSP5_ctrl.err
    #################################################################

The above options can be given directly to qsub as follows:

    $ qsub -N md_NSP5_ctrl -l nodes=4:ppn=32:long -l walltime=1000:00:00 
           -S /bin/bash -m abe -o md_NSP5_ctrl.out -e md_NSP5_ctrl.err 
            mysubmitscript.pbs

However, defining the options in the control script is more convenient and thus preferred.

qsub options

The following list describes the options used in the previous control script example:

Option Description
-N Name of the submitted job
-l Comma separated list of requested resources
-S Shell to use
-o Output path used for standard output (STDOUT)
-e Output path used for standard error (STDERR)
-m Option to email the user when the job (a) aborts, (b) begins, (e) ends

How to define the resources required

In the example above, resources are defined across two lines with the -l option and with each variable separated by colons.

Note: Appending prod in the first -l line is necessary to ensure that job is submitted to nodes that are in production.

Variable Description
nodes The number of nodes
ppn Processors/cores per nodes
walltime The time (hh:mm:ss) reserved for the job

Thus, in the example of nodes=4 and ppn=32, the total number of cores reserved will be 32 x 4 = 128.

Using qsub to submit an interactive job

The qsub command can also be used to request an interactive session. However, direct usage of qsub is discouraged for normal users, who should use the qwiz script instead.

Follow these steps to submit an interactive job with qwiz:

  1. Invoke the qwiz command

    $ qwiz
    
  2. Follow the prompts on screen to define the resources required for the session. The user may simply press enter to accept the defaults in square brackets.

qwiz example

3. If the required resources is available, the job will be automatically submitted and the user will change location from the login node to one of the computer nodes on the cluster.

4. To cancel the session early (thus releasing the resources used) use qdel as described below. The message "Terminated" should appear on screen if this was successful:

qwiz terminate

Using qstat to check the status of running jobs

To get information about the status of running jobs the qstat command can be used as follows:

    $ qstat

The status of all the user's jobs in the queue will be displayed on screen. For example:

qstat example

The S column shows the status of the job:

  • R - The job is currently running
  • Q - The job is in the queue, awaiting available resources
  • C - The job is cancelled

Note that to cancel the job, the job id will be required.

Using qdel to cancel a running job

To cancel a running job, the user can use qdel together with the job id.

Follow these steps to delete a job using qdel:

  1. Use qstat to obtain the job id for the job to cancel

  2. Use qdel with the job id:

    $ qdel <jobid>
    
    1. Use qstat to confirm that the job has been cancelled. The process was successful if the job is absent from the queue or its status changed from R to C.

External Guides and Resources

  • No external guides are available.
  • If you know of a guide/tutorial that you have found useful, please help us share it by contacting the HPC staff at hpc@ufs.ac.za