Skip to main content

Creating a Single Batch Job

This page gives step-by-step instructions to teach you to submit a batch job on the Social Sciences Computing Cluster. Please work through this page from top to bottom in detail to create your first batch job. The following links outline the process, and also provide easy access to information when you need to review it later on.

Hardin and seldon (the interactive login nodes to which you connect from your own computer) are not the computational workhorses of the cluster. The main computing power of the cluster lies in the additional 392 processors that are available to programs submitted to the PBS Professional workload management and job scheduler. Submitted jobs are placed in the job queue and put into execution when the resources needed become available. These instructions show you how you can use that power.

The job queue is like a valet:

You give it brief instructions (a shell script) telling what program to run and how to run it.
It waits until the required resources are free,
— runs the program with exclusive access to those resources until it is finished,
— writes out errors and output from your programs,
— sends you e-mail notifications at start and finish, if you wish.

Up to 180 of your jobs, if single-CPU, can be executing at a time, when resources are available. However, if your jobs will run for more than a few hours, please limit your use to 50 jobs at once to keep the system available for other users. Each job will have exclusive use of one processor and up to 2GB of memory unless you specify otherwise. These policy restrictions may change at any time.

Design your jobs to be monitored. Write output to files in your home directory (not standard output) so that you can check your results as they are written. For example, if your program is writing to the file results.lst, then, at your command prompt, type tail -f results.lst and you'll see the lines as they are created. Type <Ctrl>-C to return to the shell prompt. If your job is not running correctly, delete it. 

Design your jobs to be rerun. Do not change the files your job will be reading before the job finishes. Batch jobs may be rerun for a variety of reasons -- priority decisions, node failure, or administrative maintenance.

Jobs in execution will be preempted (either suspended -- temporarily stopped from execution, or actually terminated -- ending the job and putting it back into the input queue) by the scheduler if more than 4 of your jobs are running at one time and another job becomes more deserving for execution as determined by a fairshare algorithm. Of the jobs eligible to be preempted, the job having used the least amount of wall clock time will be selected.

Preempted jobs gain priority over time, and they will be put into execution again, possibly preempting other jobs.

You can improve the thruput of your jobs, and possibly avoid preemption, if you limit the maximum amount of wall clock time your job will use by specifying that limit when you submit the job. Be sure to set the limit high enough to allow for speed variations of different nodes. The scheduler will factor this limit into the fairshare decision-making process. If you do not specify a wall clock time limit, the scheduler will assume your job will run for one month.

Back to Top

Create a Shell Script for Your Job

A shell script is a set of ordered instructions that the batch node uses to find and run your program. It's a simple text file (usually with a .txt extension). You should create it with the evim editor on an interactive login system such as hardin, seldon or mule2.

To create a shell script called myprog.txt type evim myprog.txt at the $ prompt:

[abc123@seldon abc123]$ evim myprog.txt

When you are finished editing, exit evim saving the file.

To run one program you need a nine-line shell script like this (which runs MATLAB):

#!/bin/bash
#PBS -j oe
#PBS -l walltime=08:00:00
#PBS -l mem=2gb
#PBS -l ncpus=1

cd ~/myprograms
matlab -nosplash -nodisplay -nodesktop -r 'myprog'

The first line #!/bin/bash specifies which shell program to use, is mandatory and does not change. Be sure to type it correctly.

The second line #PBS -j oe tells PBS to join standard output and standard error together in the output file that is delivered back to the submitting directory. Note that there is a space between the join option (-j) and its values (oe).

The resource list directive " #PBS -l" is specified with the lower-case letter L, not the numeral one.

The third line #PBS -l walltime=08:00:00 tells PBS the maximum amount of wall clock time the job will take. Wall clock time is specified in hours:minutes:seconds , so one hour would be written as "1:00:00". This PBS directive sets the wall clock time limit (-l walltime=) for your job. Typical jobs run at 98% cpu utilization, which means that the CPU time for the job will be 98% of the wall clock time used to finish the job. Jobs that involve a lot of disk input/output will be less efficient, and you should modify your wallclock limit accordingly.

Choosing an execution time limit is a matter of experience. Start with a high estimate. If your job exceeds the execution time limit, it will be terminated by the batch system and an error message will be written to standard error explaining the reason. See Where is the Output?

The fourth line #PBS -l mem=2gb tells PBS the maximum amount of memory the job can use. Units are kb , mb and gb . This PBS directive sets the memory limit (-l mem=) for your job. Most SSCC batch jobs run in 2gb or less of memory. If you specify more memory than your application actually uses, your memory utilization will be low (see pbstat below), and you will in essence be denying others the use of that memory because PBS packs multiple jobs together on any single node. The maximum you can request is 64gb (see hardware specifications).

The fifth line #PBS -l ncpus=1 tells PBS to require one CPU for your job. This PBS directive sets the number of CPUs requirement (-l ncpus=) for your job. The number of CPUs you specify will all be from the same node, so the maximum you can specify is 16 (see hardware specifications). Use this directive only if your application is programmed to use multiple CPUs (e.g. in R, Stata/MP, Mplus, or MATLAB's Parallel Computing Toolbox). Otherwise, just omit the entire line.

If you specify more CPUs than your application can use, your CPU utilization will be low (see pbstat below), and you will in essence be denying others the use of those CPUs. Remember that most applications only use one CPU, unless you do something very specific to use multiple CPUs.

If you fail to specify enough CPUs, your CPU utilization will be high (see pbstat below), and you will be slowing down all jobs running on your execution node. Your job will be subject to termination in this situation. Each batch node has a job monitor that kills jobs that exceed their allocated resources (number of CPUs and memory are most commonly the issues).

The sixth line of the example script is blank. It makes the script more readable and it signals the end of the #PBS directive prologue section. Any #PBS directive that follows that blank line will be ignored by the PBS batch system.

The seventh line cd ~/myprograms tells the cluster node to change your working directory to ~/myprograms (all the nodes use the same storage system for your home directory). The tilde sign ~ is a shortcut to your home directory. If you omit this line, your batch job will begin execution in the directory that it was submitted from.

The last line tells MATLAB to start in batch mode and run myprog.m located in your myprograms directory. The name myprograms is just an example, and has no special meaning.

Commands for running in batch mode differ among program applications. Find yours in the list below.

Back to Top

Application Line: MATLAB

matlab -nosplash -nodisplay -nodesktop -r 'commands;exit'

Runs MATLAB commands or your own M-files from the working directory. Separate multiple commands with commas or semicolons (;). Do not include the pathname or a file extension (.m) to run an M-file. Put quotes around your list of commands. Be sure to end with an exit command. Do not put spaces between the commands in the list.

You can pass parameters to your M-file using this syntax, for example:

matlab -nosplash -nodisplay -nodesktop -r 'myprog(3.8, 0.2, 2.5);exit'

If you are submitting multiple jobs which execute the same MATLAB program with different parameters, you need a way to distinguish the output files. You can do this simply by printing the parameter values in the beginning of your MATLAB code. You can also redirect MATLAB output into a log file with a name that contains the parameters. This command using the greater-than sign

matlab -nosplash -nodisplay -nodesktop -r 'myprog(3.8, 0.2, 2.5);exit' > myprog_3.8_0.2_2.5.log

will save MATLAB output in a file called myprog_3.8_0.2_2.5.log.

An alternative MATLAB command form is

matlab -nosplash -nodisplay -nodesktop < myprog.m

The left arrow (actually the less-than sign) feeds myprog.m into MATLAB line by line, as if you were typing it in. You cannot pass parameters to your program using this syntax.

You can also redirect the output to a special log file by adding "> filename.log" to the command:

matlab -nosplash -nodisplay -nodesktop < myprog.m > filename.log

Back to Top

Application Line: Stata

stata < myprog.do > myprog.log

Stata will read its commands from myprog.do and write its output to myprog.log in the working directory.

If given ncpus=1, Stata/SE will be run. If ncpus=4, Stata/MP will be run. Do not specify more than 4 ncpus.

Application Line: SAS

sas myprog.sas

SAS will write a log file showing the commands executed and any errors named myprog.log and an output file with statistical results in myprog.lst.

Back to Top

Application Line: R

R CMD BATCH myprog.R myprog.log

R will write the log file myprog.log.

Back to Top


Application Line: Ox

oxl finance.ox > finance.lst

or

oxl finance.oxo > finance.lst

Ox writes its results to standard output, which in this case is redirected to the file named finance.lst

Back to Top

Application Line: Mathematica

module load mathematica; math -script input.txt > output.txt

Note that the mathematica module must be loaded first in this compound application line. The semi-colon separator is necessary to put both commands on a single line. Alternatively, you could split the commands into two lines. The -script command line option specifies that the Mathematica kernel will be run in batch mode. Commands are processed in order from the file input.txt. Default line wrapping is turned off and no In[] and Out[] labels are printed. All output goes to the file specified as output.txt.

Back to Top


Application Line: Mplus

mplus inputfile.inp outputfile.out >> outputfile.out

Mplus reads its commands from inputfile.inp (the first filename specified) and writes results to outputfile.out (the second specified filename). Because Mplus always writes job statistics to standard output, it's good to combine those statistics with your output using the specification >> outputfile.out (with a double greater-than sign) at the end of the command.

Some (but not all) Mplus analyses will use multiple processors. Set the Mplus processors option and the PBS ncpus specification to equal values.

Back to Top


For other programs, see Analytical Software Manuals.

Make the Script Executable

You should make your shell script executable and test it before you submit it to the queue. At the prompt, type in:

chmod u+x myprog.txt

You can change permissions on multiple shell scripts located in the same directory at once:

chmod u+x *.txt

Test your script by running it (you can abort it by typing <Ctrl>-C):

./myprog.txt

Remember to clean up any unwanted files your script may have created when you tested it.

Back to Top

Submit Your Job to the Queue

The qsub command sends your jobs for execution on cluster nodes:

qsub -m abe -N jobname myprog.txt

-N jobname specifies the name of your job that will show in the queue. You can use any name up to 15 characters long without spaces starting with a letter. You can omit the -N option and your job will have the same name as your shell script.

The letters following -m specify what email updates you will receive about your job:

n: no mail.
a: mail is sent when the job is aborted by the batch system.
b: mail is sent when the job begins execution.
e: mail is sent when the job ends.

If the -m option is not specified, mail will be sent only if the job is aborted (same as -m a).

The last parameter (do not omit it) is the name of your shell script file.

Back to Top

Check the Status of Your Jobs

To check the status of your jobs, type in qstat -u $USER at the prompt. It will display the status of your jobs in the job queue:

[abc123@seldon abc123]$ qstat -u $USER
Job id               Name              User        Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
5002.seldon          Job52            abc123        55:15:0 R A
5031.seldon          simul007         abc123        18:45:4 R A
...
5068.seldon          331              abc123        0       Q A

The Name column lists job names assigned in the qsub command. Give different names to your jobs to distinguish them from each other in the job queue.

The S column indicates the job state:

A - Array job has at least one subjob running.
E - Job is exiting after having run.
H - Job is held.
Q - Job is queued, eligible to run.
R - Job is running.
S - Job is suspended.
T - Job is being moved to new location.
W - Job is waiting for its execution time (-a option) to be reached.
X - Subjob has completed execution or has been deleted.

pbstat condenses and interprets the output of qstat -f to display more readable information about PBS jobs. With no arguments, pbstat will display information only about your own jobs. You can specify another username or the keyword all as a command argument to display PBS jobs owned by other users. Pay particular attention to "CPU utilization," "Elapsed CPU time" and "Memory usage."

[abc123@seldon abc123]$ pbstat
------------------------------------------------------
PBS Job ID number      : 7364.seldon
Job owner              : abc123@seldon.it.northwestern.edu
Job name               : stage1.run
Job started on         : Tue Oct 23 08:16:08 2006
Job status             : Running
Mail Points            : a
PBS queue and server   : A on seldon
Job is running on      : node21:mem=524228kb;ncpus=1
# of CPUs being used   : 1
CPU utilization        : 98% (ideal max is 100%)
Elapsed walltime       : 08:13:35 (max is 672:00:00)
Elapsed CPU time       : 08:13:00 (max is N/A)
Memory usage           : 105.5 MB
VMemory usage          : 818.1 MB

Back to Top

Delete a Job

If you need to remove your job from the queue before it starts, or if you want to terminate an already running batch job, type in:

qdel job_id

job_id is the number listed in the first column of qstat output.

You can delete ALL of your jobs with a more complicated command:

qdel `qselect -u $USER`

This command runs the qselect command between the back-tics (`) first, and substitutes its output into the qdel line before running qdel. The qselect command lists the job numbers of all of the jobs belonging to your username.

Back to Top

Where is the Output?

When a job is finished you will see a new file in the directory from which you typed in the qsub command. It has a .oXXXX extension (where XXXX is the job_id) and contains the standard output of the program. That file may or may not contain error messages -- it all depends on the design of the application program. It's best to join standard output and standard error as specified below.

You may also see a similar file with the .eXXXX extension, which contains the standard error output of your job.

You should use the #PBS -j oe command to join standard output and standard error for your job. That's explained in Create a Shell Script for Your Job, above.

Only text output is automatically saved in the log file. If your program produces graphs you need to add instructions to your program to save those graphs to disk in a file. In MATLAB this is done by the saveas function.

Back to Top

Last Updated: 10 April 2017

Get Help Back to top