How to Create a Single Batch Job
- Create a Shell Script for Your Job
- Make the Script Executable
- Submit Your Job to the Queue
- Check the Status of Your Jobs
- Delete a Job
- Where is the Output?
- Submit Multiple Jobs
Hardin and seldon (the interactive computers to which you connect using SSH) are not the computational workhorses of the cluster. The main computing power of the cluster lies in the additional 100 processors which are available to programs submitted to the batch queues. These instructions will show you how your programs can use that power.
The job queue is like a valet:
- You give it brief instuctions (a shell script) telling what program to run.
-
- It waits until one of the batch cluster processors is free,
- — runs the program on that processor until it is finished,
- — writes out errors and output from your programs,
- — sends you email notifications at start and finish, if you wish.
Up to 20 of your jobs can be executing at a time, when resources are available. Each job will have exclusive use of one processor and up to 512MB of memory unless you specify otherwise.
Design your jobs to be rerun. Do not change the files your job will be using before the job finishes. Batch jobs may be rerun for a variety of reasons -- priority decisions, node failure, or administrative maintenance.
Jobs in an execution queue will be preempted (either suspended -- temporarily stopped from execution, or actually terminated -- ending the job and putting it back into the input queue) by the scheduler if more than 4 of your jobs are running at one time and another job becomes more deserving for execution as determined by a fairshare algorithm. Of the jobs eligible to be preempted, the job having used the least amount of CPU time will be selected.
Preempted jobs gain priority over time, and they will be put into execution again, possibly preempting others' jobs.
You can improve the thruput of your jobs, and possibly avoid preemption, if you limit the maximum amount of CPU time your job will use by specifying that limit when you submit the job. The scheduler will factor this limit into the fairshare decision-making process. If you do not specify a CPU time limit and a wall clock time limit, the scheduler will assume your job will run for one month.
Create a Shell Script for Your Job
A shell script is a set of instructions that the cluster node needs to find and run your program. It's a simple text file (usually with a .txt extension). You can create it with thenano editor on an interactive system such as hardin, seldon or mule2.
To create a shell script called myprog.txt, type nano myprog.txt at the $ prompt:
[abc123@seldon abc123]$ nano myprog.txtWhen you are finished editing, press
<Ctrl>-X to exit the nano editor and Y to save the changes.
To run one program you need a nine-line shell script like this:
#!/bin/bash
#PBS -j oe
#PBS -l cput=08:00:00
#PBS -l walltime=08:00:00
#PBS -l mem=512mb
#PBS -l ncpus=1
cd ~/myprograms
matlab -r 'myprog'
The first line #!/bin/bash specifies which shell program to use, is mandatory and does not change.
The second line #PBS -j oe tells PBS to join standard output and standard error together in the output file that is delivered back to the submitting directory. Note that there is a space between the join option ( -j ) and its values ( oe ).
The third line #PBS -l cput=08:00:00 tells PBS the maximum amount of CPU time the job will take. CPU time is specified in hours:minutes:seconds , so one hour would be written as "1:00:00". This PBS directive sets the CPU time limit (-l cput=) for your job. Ninety-five percent (95%) of SSCC batch jobs finish in less than 8 CPU hours.
The directive " #PBS -l " is specified with the lower-case letter L, not the numeral one.
The fourth line #PBS -l walltime=08:00:00 tells PBS the maximum amount of wall clock time the job will take. Wall clock time is specified in hours:minutes:seconds , so one hour would be written as "1:00:00". This PBS directive sets the wall clock time limit (-l walltime=) for your job.Typical jobs run at 98% cpu utilization, which means that the CPU time for the job will be 98% of the wall clock time used to finish the job. Jobs that involve a lot of disk input/output will be less efficient, and you should modify your wallclock limit accordingly.
The fifth line #PBS -l mem=512mb tells PBS the maximum amount of memory the job can use. Units are kb , mb and gb . This PBS directive sets the memory limit (-l mem=) for your job. Eighty-eight percent (88%) of SSCC batch jobs run in 512mb or less of memory. Ninety-three percent (93%) of SSCC batch jobs run in 1gb or less of memory.
The sixth line #PBS -l ncpus=1 tells PBS to reserve one CPU for your job. This PBS directive sets the number of CPUs limit (-l ncpus=) for your job.
The seventh line of the example script is blank. It makes the script more readable and it signals the end of the #PBS directive section. Any #PBS directive that follows that blank line will be ignored by the PBS batch system.
The eighth line cd ~/myprograms tells the cluster node to change working directory to ~/myprograms (all the nodes use the same storage system for your home directory). The tilde sign ~ is a shortcut to your home directory.
The last line tells MATLAB to start in batch mode and run myprog.m located in your myprograms directory. Commands for running in batch mode differ among program applications.
Application Line: MATLAB
matlab -nodisplay -r 'commands'
Runs MATLAB commands or your own M-files from the working directory. Separate multiple commands with commas or semicolons (;). Do not include the pathname or a file extension (.m) to run an M-file. Put quotes around your list of commands.
You can pass parameters to your M-file using this syntax, for example:
matlab -nodisplay -r 'myprog(3.8, 0.2, 2.5)'If you are submitting multiple jobs which execute the same MATLAB program with different parameters, you need a way to distinguish the output files. You can do this simply by printing the parameter values in the beginning of your MATLAB code. You can also redirect MATLAB output into a log file with a name that contains the parameters. This command
matlab -r 'myprog(3.8, 0.2, 2.5)' > myprog_3.8_0.2_2.5.logwill save MATLAB output in a file called
myprog_3.8_0.2_2.5.log.
An alternative MATLAB command is:
matlab -nodisplay < myprog.m
The left arrow feeds myprog.m into MATLAB line by line, as if you were typing it in. You cannot pass parameters to your program using this syntax.
You can also redirect the output to a special log file by adding ">filename.log" to the command:
matlab -nodisplay < myprog.m > filename.log
Back to Top
Application Line: Stata
stata -b do myprog.doStata will write its output to
myprog.log in the working directory.
Application Line: SAS
sas myprog.sasSAS will write a log file named
myprog.log and an output file with results myprog.lst.
Application Line: R
R CMD BATCH myprog.R myprog.logor
R --no-save < myprog.R > myprog.logR will write a log file
myprog.log.
Application Line: Splus
Splus BATCH input_file output_file
Splus reads the program from input_file and writes the results to output_file.
Application Line: GAUSS
gauss -b finance.e > finance.lstGAUSS writes its results to standard output, which in this case is redirected to the file named
finance.lst
Application Line: Ox
oxl finance.ox > finance.lstor
oxl finance.oxo > finance.lstOx writes its results to standard output, which in this case is redirected to the file named
finance.lst
Application Line: Singular
Singular -t -q < adjoint.sing > adjoint.lst
Singular writes its results to standard output, which in this case is redirected to the file named adjoint.lst
For other programs, see Statistical Software Manuals.
Make the Script Executable
You should make your shell script executable and test it before you submit it to the queue. At the prompt, type in:
chmod u+x myprog.txt
You can change permissions on multiple shell scripts located in the same directory at once:
chmod u+x *.txt
Test your script by running it (you can abort it by typing <Ctrl>-C):
./myprog.txt
Remember to clean up any unwanted files your script may have created when you tested it.
Submit Your Job to the Queue
The qsub command sends your jobs for execution on cluster nodes:
qsub -m abe -N jobname myprog.txt
-N jobname specifies the name of your job that will show in the queue. You can use any name up to 15 characters long without spaces starting with a letter. You can omit the -N option and your job will have the same name as your shell script.
The letters following -m specify what email updates you will receive about your job:
- n: no mail.
- a: mail is sent when the job is aborted by the batch system.
- b: mail is sent when the job begins execution.
- e: mail is sent when the job ends.
If the -m option is not specified, mail will be sent only if the job is aborted (same as -m a).
The last parameter (do not omit it) is the name of your shell script file.
Check the Status of Your Jobs
To check the status of your jobs, type in qstat at the prompt. It will display the status of the entire job queue:
[abc123@seldon abc123]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
5002.seldon Job52 abc123 55:15:0 R A
5031.seldon simul007 def456 18:45:4 R A
...
5068.seldon m331 xyz987 0 Q A
The Name column lists job names assigned in the qsub command. Give different names to your jobs to distinguish them from each other in the job queue.
The S column indicates the job state:
- E - Job is exiting after having run.
- H - Job is held.
- Q - job is queued, eligible to run or routed.
- R - job is running.
- T - job is being moved to new location.
- W - job is waiting for its execution time (-a option) to be reached.
- S - job is suspended.
pbstat condenses and interprets the output of qstat -f to display more readable information about PBS jobs. With no arguments, pbstat will display information only about your own jobs. You can specify another username or the keyword all as a command argument to display PBS jobs owned by other users.
[abc123@seldon abc123]$ pbstat
------------------------------------------------------
PBS Job ID number : 7364.seldon
Job owner : abc123@seldon.it.northwestern.edu
Job name : stage1.run
Job started on : Tue Oct 23 08:16:08 2006
Job status : Running
Mail Points : a
PBS queue and server : A on seldon
Job is running on : node21:mem=524228kb;ncpus=1
# of CPUs being used : 1
CPU utilization : 98% (ideal max is 100%)
Elapsed walltime : 08:13:35 (max is 672:00:00)
Elapsed CPU time : 08:13:00 (max is 672:00:00)
Memory usage : 105.5 MB
VMemory usage : 818.1 MB
Back to Top
Delete a Job
If you need to remove your job from the queue before it starts, or if you want to terminate an already running batch job, type in:
qdel job_id
job_id is the number listed in the first column of qstat output.
Where is the Output?
When a job is finished you will see a new file in the directory from which you typed in the qsub command. It has a .oXXXX extension and contains standard (text) output of the program.
You may also see a similar file with the .eXXXX extension, which contains the standard error output of your job. XXXX in both cases is the job_id.
You should use the #PBS -j oe command to join standard output and standard error for your job. That's explained in Create a Shell Script for Your Job, above.
Only text output is automatically saved in the log file. If your program produces graphs you need to add instructions to your program to save those graphs to disk in a file. In MATLAB this is done by the saveas function.
Last Updated: 09 March 2009

