Skip to main content

Examples of Jobs on Quest

This page provides examples of how various jobs appear on Northwestern’s high performance computing cluster, Quest. Topics include:

Example: Interactive Job with Two Nodes

abc123@quser02:~ $ msub -I -l nodes=2:ppn=8
qsub: waiting for job 10000000.qsched02.quest.it.northwestern.edu to start
qsub: job 10000000.qsched02.quest.it.northwestern.edu ready
----------------------------------------
PBS: Begin PBS Prologue Wed Feb 26 12:13:42 CST 2014 1393438422
PBS: Job ID:            14162416.qsched02.quest.it.northwestern.edu
PBS: Username:          abc123
PBS: Group:       abc123
PBS: Executing queue:     short
PBS: Job name:          STDIN
PBS: Account:           p20001
----------------------------------------
   The following variables are not
   guaranteed to be the same in
   prologue and the job run script 
----------------------------------------
PBS: Temporary Dir($TMPDIR):  /tmp/14162416.qsched02.quest.it.northwestern.edu
PBS: Master Node($PBS_MSHOST):            qnode0302
PBS: node file($PBS_NODEFILE):  /hpc/opt/torque/nodes/qnode0302/aux//14162416.qsched02.quest.it.northwestern.edu
PBS: PATH (in prologue) : /bin:/usr/bin
PBS: WORKDIR ($PBS_O_WORKDIR) is:  /home/abc123
----------------------------------------


PBS: End PBS Prologue Wed Feb 26 12:13:42 CST 2014 1393438422

abc123@qnode0302:~ $

Back to Top

Example: Submit a Batch Job

Take a look at the job script. Note that lines starting with #MSUB are "directives", commands that are for the use of the scheduler as it manages your job.

[abc123@quser01 testjob]$ cat jobscript.pbs
## interpret commands (except PBS directives) using the bash shell
#!/bin/bash ##interpret commands using the bash shell
## Request two nodes using 8 cores per node with MSUB directive
#MSUB -l nodes=2:ppn=8
## Tell the scheduler how long your job will run. The scheduler will
## automatically put your job in the optimal queue.
#MSUB –l walltime=1:00:00
## Name my job "testjob"
#MSUB -N “testjob”
## Notify me if my job Aborts, when it Begins and when it Ends. Make
## sure you have your email address in your .forward file.
#MSUB –m abe

## Change the current working directory to the directory
## from which the script was submitted
cd $PBS_O_WORKDIR

## These are the commands to actually run the job
module load mpi
mpirun testjob | tee testjoboutput.txt

To submit the job, issue the command:

[abc123@quser01 testjob]$ msub jobscript.pbs
10000001

Back to Top

Example: Checking a Successfully Running Job

abc123@quser01:~ $checkjob 10000001
job

AName: testjob
State: Running
Creds: user:abc123 group:p20001 class:short
WallTime: 0:02:39 of 1:00:00
SubmitTime: Wed Feb 26 12:14:50
(Time Queued Total: 00:00:50 Eligible: 00:00:00)

StartTime: Tue Dec 8 11:18:02
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 16

Req[0] TaskCount: 16 Partition: quest1

Allocated Nodes:
[qnode0376:8][qnode0375:8]

Executable: /opt/moab/spool/moab.job.QCGqLA

StartCount: 1
Partition List: SHARED
Flags: GLOBALQUEUE
Attr: checkpoint
StartPriority: 1
Reservation '10000001' (-0:02:39 -> 1:00:00 Duration: 1:00:00)

Back to Top

Example: A Blocked Job (because it requested too many cores per node)


Your job is blocked because it requested too many cores per node.


>abc123@quser01:$ checkjob -v Moab.1001
job Moab.1001

AName: testjob
State: Idle
Creds: user:abc123 group:abc123
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Wed Dec 9 12:13:55
(Time Queued Total: 00:01:16 Eligible: 00:00:31)

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 36
Total Requested Nodes: 2

Req[0] TaskCount: 36 Partition: ALL
TasksPerNode: 18 NodeCount: 2

SystemID: Moab
SystemJID: Moab.1189

IWD: /home/abc123/
UMask: 0022
Executable: /home/abc123/testjob

Partition List: SHARED,quest1
SrcRM: torque
Flags: GLOBALQUEUE
StartPriority: 1
PE: 36.00
NOTE: job violates constraints for partition quest1 (partition ppn exceeded)

Back to Top

Example: Why won't my interactive job start?

abc123@quser02:~ $msub -I -l nodes=8:ppn=8
qsub: waiting for job 10000003.quser02 to start

Use the checkjob command to get information about your job's status. In a separate terminal:


[abc123@quser01 ~]# checkjob 10000003
job 10000003

AName: STDIN
State: Idle
Creds:  user:abc123  group:abc123  account:p20001  class:short
WallTime:   00:00:00 of 4:00:00
SubmitTime: Wed Feb 26 13:56:29
(Time Queued  Total: 00:01:25  Eligible: 00:00:40)

TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 40

Req[0]  TaskCount: 64  Partition: ALL

SystemID:   Moab
SystemJID:  Moab.1888979
Notification Events: JobFail

Partition List: [quest1][quest2][quest3][questgpu2][SHARED]
Flags:          SUSPENDABLE,INTERACTIVE
Attr:           INTERACTIVE,checkpoint
StartPriority:  256
available for 8 tasks     - qnode[0470,0452,0449,0439,0431,0397]
rejected for CPU          - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest1 (insufficient procs available: 39 < 64)

available for 8 tasks     - qnode[0515,0559,0745,0738,0727,0570,0722,0716]
rejected for CPU          - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest2 (insufficient procs available: 28 < 64)

rejected for CPU          - (null)
rejected for State        - (null)
rejected for Reserved     - (null)
NOTE:  job req cannot run in partition quest3 (available procs do not meet requirements : 0 of 64 procs found)
idle procs: 316  feasible procs:   0

Node Rejection Summary: [CPU: 59][State: 20][Reserved: 10]

NOTE:  job does not meet specific constraints for partition questgpu2 (NodeCount)

NOTE:  job violates constraints for partition quest (partition quest not in job partition mask)

NOTE:  job violates constraints for partition pim (partition pim not in job partition mask)

PROBLEM: In this case checkjob tells us that there are not enough Nodes/Cores to complete the request.  Quest is routinely a very busy cluster so this is pretty common for all kinds of jobs.

SOLUTION: Choose

  1. Wait for more processes to free up
  2. Resubmit with a smaller number of cores or
  3. Modify the job to use a smaller processor count

If you do not want a job to wait and want it rejected if it can't run immediately, submit it with:

>-1 queuejob-false

Back to Top

Example: Running Out of Time

This is an issue that users will likely run into as they enter the far reaches of their granted allocation. The first clue is that a job languishes in a "BatchHold" condition. BatchHolds are placed on a job by the scheduler when it determines that a job cannot run, typically due to the lack of a resource.

blocked jobs 
JOBID        USERNAME   STATE      PROCS     WCLIMIT        QUEUETIME
10000004     abc123     BatchHold  2400      7:00:00:00     Sun May 11 14:52:56

As always, the first step is to use the checkjob command. The -v parameter requests verbose output. There is a lot of information here so we'll highlight the relevant output.

[abc123@quser02 ~]$ checkjob -v 10000004
job 10000004 (RM job '10000004.qsched01')

AName: testjob
State: Idle

Creds: user:abc123 group:abc123 account:p20001 class:long
WallTime: 00:00:00 of 7:00:00:00 SubmitTime: TueF Feb 18 17:12:04
(Time Queued Total: 22:57:48 Eligible: 00:00:00)

NodeMatchPolicy: EXACTNODE

Total Requested Tasks: 2400

Total Requested Nodes: 300

Req[0] TaskCount: 2400 Partition: users TasksPerNode: 8

IWD: /home/abc123/testjob UMask: 0022

Executable: /opt/moab/spool/moab.job.QweJ6F

OutputFile: qsched02:/home/abc123/testjob.stdout

ErrorFile: qsched02: /home/abc123/testjob.stderr
Partition List: SHARED,users

SrcRM: internal DstRM: torque DstRMJID: 10000004.qsched02
Flags: GLOBALQUEUE
Attr: checkpoint
StartPriority: 512
PE: 64.00
Holds:     Batch:CannotDebitAccount

NOTE: job cannot run (job has hold in place) Node Availability for Partition users --------

NOTE: job cannot run (job has hold in place)
available for 8 tasks - qnode[list truncated for this example] rejected for CPU - qnode[list truncated for this example]
rejected for State - qnode[list truncated for this example]
rejected for Reserved - qnode[list truncated for this example]

BLOCK MSG: job hold active - Batch (recorded at last scheduling iteration)

Message[0] server rejected request with status code 782 - Insufficient balance to reserve job
Message[1] cannot debit job account - no funds

These messages are exactly what they appear to be and one that many users can run into as they reach the bottom of their grant. As shown below, use the checkproject script to determine what resources are available. This job above is an extreme example. It requires 403200 hours (7 days * 24 hours * 2400 cores) of compute time.

[abc123@quser01 ~]$ module load utilities
[abc123@quser01 ~]$ checkproject

===================================

Reporting for project p20001

-----------------------------------

542.1GB used in 36417 files (54.2% of 1000GB quota)

Project p0002 has a balance of 25149.62 SU remaining( of 100000.00 assigned)
16107.39 SU are reserved by jobs in the queue

==============================

In this instance, user "abc123" only has 25149 hours and wants to use 403200 hours. If the job would run it would overdraw the account. This job is held.

Last Updated: 26 July 2017

Get Help Back to top