Creating a Single Batch Job
This page gives step-by-step instructions to teach you to submit a batch job on the Social Sciences Computing Cluster. Please work through this page from top to bottom in detail to create your first batch job. The following links outline the process, and also provide easy access to information when you need to review it later on.
- Create a Shell Script for Your Job
- Make the Script Executable
- Submit Your Job to the Queue
- Check the Status of Your Jobs
- Delete a Job
- Where is the Output?
- Submit Multiple Batch Jobs Using Job Arrays
- Submit Multiple Individual Jobs
Hardin and seldon (the interactive login nodes to which you connect from your own computer) are not the computational workhorses of the cluster. The main computing power of the cluster lies in the additional 392 processors that are available to programs submitted to the PBS Professional workload management and job scheduler. Submitted jobs are placed in the job queue and put into execution when the resources needed become available. These instructions show you how you can use that power.
The job queue is like a valet:
- You give it brief instuctions (a shell script) telling what program to run and how to run it.
- It waits until the required resources are free,
- — runs the program with exclusive access to those resources until it is finished,
- — writes out errors and output from your programs,
- — sends you e-mail notifications at start and finish, if you wish.
Up to 180 of your jobs, if single-CPU, can be executing at a time, when resources are available. However, if your jobs will run for more than a few hours, please limit your use to 50 jobs at once to keep the system available for other users. Each job will have exclusive use of one processor and up to 2GB of memory unless you specify otherwise. These policy restrictions may change at any time.
Design your jobs to be monitored. Write output to files in your home directory (not standard output) so that you can check your results as they are written. For example, if your program is writing to the file results.lst, then, at your command prompt, type tail -f results.lst and you'll see the lines as they are created. Type
<Ctrl>-C to return to the shell prompt. If your job is not running correctly, delete it.
Design your jobs to be rerun. Do not change the files your job will be reading before the job finishes. Batch jobs may be rerun for a variety of reasons -- priority decisions, node failure, or administrative maintenance.
Jobs in execution will be preempted (either suspended -- temporarily stopped from execution, or actually terminated -- ending the job and putting it back into the input queue) by the scheduler if more than 4 of your jobs are running at one time and another job becomes more deserving for execution as determined by a fairshare algorithm. Of the jobs eligible to be preempted, the job having used the least amount of wall clock time will be selected.
Preempted jobs gain priority over time, and they will be put into execution again, possibly preempting other jobs.
You can improve the thruput of your jobs, and possibly avoid preemption, if you limit the maximum amount of wall clock time your job will use by specifying that limit when you submit the job. Be sure to set the limit high enough to allow for speed variations of different nodes. The scheduler will factor this limit into the fairshare decision-making process. If you do not specify a wall clock time limit, the scheduler will assume your job will run for one month.
A shell script is a set of ordered instructions that the batch node uses to find and run your program. It's a simple text file (usually with a .txt extension). You should create it with the evim editor on an interactive login system such as hardin, seldon or mule2.
To create a shell script called myprog.txt type
evim myprog.txt at the $ prompt:
[abc123@seldon abc123]$ evim myprog.txt
When you are finished editing, exit
evim saving the file.
To run one program you need a nine-line shell script like this (which runs MATLAB):
#PBS -j oe
#PBS -l walltime=08:00:00
#PBS -l mem=2gb
#PBS -l ncpus=1
matlab -nosplash -nodisplay -nodesktop -r 'myprog'
The first line
#!/bin/bash specifies which shell program to use, is mandatory and does not change. Be sure to type it correctly.
The second line
#PBS -j oe tells PBS to join standard output and standard error together in the output file that is delivered back to the submitting directory. Note that there is a space between the join option (-j) and its values (oe).
The resource list directive "
#PBS -l" is specified with the lower-case letter L, not the numeral one.
The third line
#PBS -l walltime=08:00:00 tells PBS the maximum amount of wall clock time the job will take. Wall clock time is specified in hours:minutes:seconds , so one hour would be written as "1:00:00". This PBS directive sets the wall clock time limit (-l walltime=) for your job. Typical jobs run at 98% cpu utilization, which means that the CPU time for the job will be 98% of the wall clock time used to finish the job. Jobs that involve a lot of disk input/output will be less efficient, and you should modify your wallclock limit accordingly.
Choosing an execution time limit is a matter of experience. Start with a high estimate. If your job exceeds the execution time limit, it will be terminated by the batch system and an error message will be written to standard error explaining the reason. See Where is the Output?
The fourth line
#PBS -l mem=2gb tells PBS the maximum amount of memory the job can use. Units are kb , mb and gb . This PBS directive sets the memory limit (-l mem=) for your job. Most SSCC batch jobs run in 2gb or less of memory. If you specify more memory than your application actually uses, your memory utilization will be low (see pbstat below), and you will in essence be denying others the use of that memory because PBS packs multiple jobs together on any single node. The maximum you can request is 64gb (see hardware specifications).
The fifth line
#PBS -l ncpus=1 tells PBS to require one CPU for your job. This PBS directive sets the number of CPUs requirement (-l ncpus=) for your job. The number of CPUs you specify will all be from the same node, so the maximum you can specify is 16 (see hardware specifications). Use this directive only if your application is programmed to use multiple CPUs (e.g. in R, Stata/MP, Mplus, or MATLAB's Parallel Computing Toolbox). Otherwise, just omit the entire line.
If you specify more CPUs than your application can use, your CPU utilization will be low (see pbstat below), and you will in essence be denying others the use of those CPUs. Remember that most applications only use one CPU, unless you do something very specific to use multiple CPUs.
If you fail to specify enough CPUs, your CPU utilization will be high (see pbstat below), and you will be slowing down all jobs running on your execution node. Your job will be subject to termination in this situation. Each batch node has a job monitor that kills jobs that exceed their allocated resources (number of CPUs and memory are most commonly the issues).
The sixth line of the example script is blank. It makes the script more readable and it signals the end of the #PBS directive prologue section. Any #PBS directive that follows that blank line will be ignored by the PBS batch system.
The seventh line
cd ~/myprograms tells the cluster node to change your working directory to
~/myprograms (all the nodes use the same storage system for your home directory). The tilde sign
~ is a shortcut to your home directory. If you omit this line, your batch job will begin execution in the directory that it was submitted from.
The last line tells MATLAB to start in batch mode and run myprog.m located in your myprograms directory. The name myprograms is just an example, and has no special meaning.
Commands for running in batch mode differ among program applications. Find yours in the list below.
Application Line: MATLAB
matlab -nosplash -nodisplay -nodesktop -r 'commands;exit'
Runs MATLAB commands or your own M-files from the working directory. Separate multiple commands with commas or semicolons (;). Do not include the pathname or a file extension (.m) to run an M-file. Put quotes around your list of commands. Be sure to end with an exit command. Do not put spaces between the commands in the list.
You can pass parameters to your M-file using this syntax, for example:
-r 'myprog(3.8, 0.2, 2.5);exit'
If you are submitting multiple jobs which execute the same MATLAB program with different parameters, you need a way to distinguish the output files. You can do this simply by printing the parameter values in the beginning of your MATLAB code. You can also redirect MATLAB output into a log file with a name that contains the parameters. This command using the greater-than sign
-r 'myprog(3.8, 0.2, 2.5);exit' > myprog_3.8_0.2_2.5.log
will save MATLAB output in a file called
An alternative MATLAB command form is
The left arrow (actually the less-than sign) feeds
myprog.m into MATLAB line by line, as if you were typing it in. You cannot pass parameters to your program using this syntax.
You can also redirect the output to a special log file by adding "
> filename.log" to the command:
Back to Top
< myprog.m > filename.log
Application Line: Stata
stata < myprog.do > myprog.log
Stata will read its commands from
myprog.do and write its output to
myprog.log in the working directory.
If given ncpus=1, Stata/SE will be run. If ncpus=4, Stata/MP will be run. Do not specify more than 4 ncpus.
Application Line: SAS
SAS will write a log file showing the commands executed and any errors named
myprog.log and an output file with statistical results in
Application Line: R
R CMD BATCH myprog.R myprog.log
R will write the log file
Application Line: Ox
oxl finance.ox > finance.lst
oxl finance.oxo > finance.lst
Ox writes its results to standard output, which in this case is redirected to the file named
Application Line: Mathematica
module load mathematica; math -script input.txt > output.txt
Note that the mathematica module must be loaded first in this compound application line. The semi-colon separator is necessary to put both commands on a single line. Alternatively, you could split the commands into two lines. The -script command line option specifies that the Mathematica kernel will be run in batch mode. Commands are processed in order from the file input.txt. Default line wrapping is turned off and no In and Out labels are printed. All output goes to the file specified as output.txt.
Application Line: Mplus
mplus inputfile.inp outputfile.out >> outputfile.out
Mplus reads its commands from inputfile.inp (the first filename specified) and writes results to outputfile.out (the second specified filename). Because Mplus always writes job statistics to standard output, it's good to combine those statistics with your output using the specification >> outputfile.out (with a double greater-than sign) at the end of the command.
Some (but not all) Mplus analyses will use multiple processors. Set the Mplus processors option and the PBS ncpus specification to equal values.
For other programs, see Analytical Software Manuals.
Make the Script Executable
You should make your shell script executable and test it before you submit it to the queue. At the prompt, type in:
chmod u+x myprog.txt
You can change permissions on multiple shell scripts located in the same directory at once:
chmod u+x *.txt
Test your script by running it (you can abort it by typing <Ctrl>-C):
Remember to clean up any unwanted files your script may have created when you tested it.
Submit Your Job to the Queue
qsub command sends your jobs for execution on cluster nodes:
qsub -m abe -N jobname myprog.txt
-N jobname specifies the name of your job that will show in the queue. You can use any name up to 15 characters long without spaces starting with a letter. You can omit the
-N option and your job will have the same name as your shell script.
The letters following
-m specify what email updates you will receive about your job:
- n: no mail.
- a: mail is sent when the job is aborted by the batch system.
- b: mail is sent when the job begins execution.
- e: mail is sent when the job ends.
-m option is not specified, mail will be sent only if the job is aborted (same as
The last parameter (do not omit it) is the name of your shell script file.
Check the Status of Your Jobs
To check the status of your jobs, type in
qstat -u $USER at the prompt. It will display the status of your jobs in the job queue:
[abc123@seldon abc123]$ qstat -u $USER
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
5002.seldon Job52 abc123 55:15:0 R A
5031.seldon simul007 abc123 18:45:4 R A
5068.seldon 331 abc123 0 Q A
Name column lists job names assigned in the
qsub command. Give different names to your jobs to distinguish them from each other in the job queue.
S column indicates the job state:
- A - Array job has at least one subjob running.
- E - Job is exiting after having run.
- H - Job is held.
- Q - Job is queued, eligible to run.
- R - Job is running.
- S - Job is suspended.
- T - Job is being moved to new location.
- W - Job is waiting for its execution time (-a option) to be reached.
- X - Subjob has completed execution or has been deleted.
pbstat condenses and interprets the output of
qstat -f to display more readable information about PBS jobs. With no arguments,
pbstat will display information only about your own jobs. You can specify another username or the keyword
all as a command argument to display PBS jobs owned by other users. Pay particular attention to "CPU utilization," "Elapsed CPU time" and "Memory usage."
[abc123@seldon abc123]$ pbstat
PBS Job ID number : 7364.seldon
Job owner : email@example.com
Job name : stage1.run
Job started on : Tue Oct 23 08:16:08 2006
Job status : Running
Mail Points : a
PBS queue and server : A on seldon
Job is running on : node21:mem=524228kb;ncpus=1
# of CPUs being used : 1
CPU utilization : 98% (ideal max is 100%)
Elapsed walltime : 08:13:35 (max is 672:00:00)
Elapsed CPU time : 08:13:00 (max is N/A)
Memory usage : 105.5 MB
VMemory usage : 818.1 MB
Delete a Job
If you need to remove your job from the queue before it starts, or if you want to terminate an already running batch job, type in:
job_id is the number listed in the first column of
You can delete ALL of your jobs with a more complicated command:
qdel `qselect -u $USER`
This command runs the qselect command between the back-tics (`) first, and substitutes its output into the qdel line before running qdel. The qselect command lists the job numbers of all of the jobs belonging to your username.
Where is the Output?
When a job is finished you will see a new file in the directory from which you typed in the
qsub command. It has a
.oXXXX extension (where XXXX is the job_id) and contains the standard output of the program. That file may or may not contain error messages -- it all depends on the design of the application program. It's best to join standard output and standard error as specified below.
You may also see a similar file with the
.eXXXX extension, which contains the standard error output of your job.
You should use the
#PBS -j oe command to join standard output and standard error for your job. That's explained in Create a Shell Script for Your Job, above.
Only text output is automatically saved in the log file. If your program produces graphs you need to add instructions to your program to save those graphs to disk in a file. In MATLAB this is done by the
Last Updated: 24 January 2017Get Help Back to top