Job Execution
Once resources have been allocated through PBS, users have the option of serially running commands on the allocated resources’ head node or across all the resources in the allocated resource pool.
The Spider center-wide file system is now the operational work file system on Jaguar’s XT5 (jaguarpf) and XT4 (jaguar) partition, Lens, Smoky, and the dedicated data transfer nodes. It is the largest-scale Lustre file system in the world, with over 26,000 clients, and it is the fastest Lustre file system in the world, with a demonstrated bandwidth of 240 GB/s. Spider currently provides 5 PB of disk space.
More information on Spider including best practices can be found here .
Serial
- Batch Script
- The executable portion of batch scripts is interpreted by the shell specified on the first line of the script. If a shell is not specified, the submitting user’s default shell will be used. This portion of the script may contain comments, shell commands, executable scripts, and compiled executables. These can be used in combination to, for example, navigate file systems, set up job execution, run executables, and even submit other batch jobs.
- More information on batch scripts can be found on the batch scripts page.
- Batch Interactive
- While running in interactive mode, the submitting user’s default shell will be used.
- More information on interactive batch jobs can be found here .
Parallel
aprun
By default, commands will be executed on the job’s associate service node. The aprun command is used to execute a job on one or more compute nodes. The XT’s layout should be kept in mind when running a job using aprun. The XT5 partition currently contains two hex-core processors (a total of 12 cores) per compute node. While the XT4 partition currently contains one quad-core processor (a total of 4 cores) per compute node. The PBS size option requests compute cores.
NOTE: By default, aprun will not forward shell limits (set by ulimit for sh/ksh/bash or by limit for csh/tcsh). To pass these settings to your batch job, you should set the environment variable APRUN_XFER_LIMITS to 1. (export APRUN_XFER_LIMITS=1 for sh/ksh/bash and setenv APRUN_XFER_LIMITS 1 for csh/tcsh).
Basic aprun options:
| Option | Description |
|---|---|
| -D | Debug (shows the layout aprun will use) |
| -n | Number of MPI tasks. Note: If you do not specify the number of tasks to aprun, the system will default to 1. |
| -N | Number of tasks per Node. NOTE: Recall that the XT5 has two Opterons per compute node. On the XT5, to place one task per hex-core Opteron, use -S 1 (not -N 1 as on the XT4). On the XT4, because there is only one Opteron per node, the -S1 and -N1 will result in the same layout. |
| -m | Memory required per task A maximum of 2,000 MB per core; 2,100 MB will allocate two cores for the task |
| -d | Number of threads per MPI task. Note: As of CLE 2.1, this option is very important. If you specify OMP_NUM_THREADS but do not give a -d option, aprun will allocate your threads to a single core. You must use OMP_NUM_THREADS to specify the number of threads per MPI task, and you must use -d to tell aprun how to place those threads. |
| -S | Number of PEs to allocate per NUMA node. |
| -ss | Strict memory containment per NUMA node. |
Huge Pages
Huge pages (2MB) are now supported. Previous CNL versions only supported 4KB pages.
4KB pages remain the default; however, users now have the option to use 2MB pages.
To use 2MB pages:
- Link to the library libhugetlbfs.a.
- cc code.c -lhugetlbfs
- ftn code.f -lhugetlbfs
- Set the huge pages environment variable HUGETLB_MORECORE.
- setenv HUGETLB_MORECORE yes
- export HUGETLB_MORECORE=yes
- Add huge pages suffix to -m aprun argument.
- -m sizeh
- Requests size of huge pages to be allocated to each PE. All nodes use as much huge page memory as they are able to allocate and 4 KB pages thereafter.
- -m sizehs
- Requires size of huge pages to be allocated to each PE. If the request cannot be satisfied, an error message is issued and aprun terminates the request.
MPI Task Layout
The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one node before allocating tasks to another node.
- Changing/Viewing Layout
-
- The layout order can be changed using the environment variable
MPICH_RANK_REORDER_METHOD. See manintro_mpifor more information. - Task layout can be seen by setting
MPICH_RANK_REORDER_DISPLAYto 1.
- The layout order can be changed using the environment variable
- XT5 Example
-
aprun -n24 ./a.outwill runa.outacross 24 cores. This requires two compute nodes. The MPI task layout would be as follows:Compute Node 0 Opteron 0 Opteron 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 0 1 2 3 4 5 6 7 8 9 10 11 Compute Node 1 Opteron 0 Opteron 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 12 13 14 15 16 17 18 19 20 21 22 23 - The following will place tasks in a round robin fashion.
> setenv MPICH_RANK_REORDER_METHOD 0 > aprun -n 16 a.out Rank 0, Node 1, Opteron 0, Core 0 Rank 1, Node 2, Opteron 0, Core 0 Rank 2, Node 1, Opteron 0, Core 1 Rank 3, Node 2, Opteron 0, Core 1 Rank 4, Node 1, Opteron 0, Core 2 Rank 5, Node 2, Opteron 0, Core 2 Rank 6, Node 1, Opteron 0, Core 3 Rank 7, Node 2, Opteron 0, Core 3 Rank 8, Node 1, Opteron 0, Core 4 Rank 9, Node 2, Opteron 0, Core 4 Rank 10, Node 1, Opteron 0, Core 5 Rank 11, Node 2, Opteron 0, Core 5 Rank 12, Node 1, Opteron 1, Core 0 Rank 13, Node 2, Opteron 1, Core 0 Rank 14, Node 1, Opteron 1, Core 1 Rank 15, Node 2, Opteron 1, Core 1 Rank 16, Node 1, Opteron 1, Core 2 Rank 17, Node 2, Opteron 1, Core 2 Rank 18, Node 1, Opteron 1, Core 3 Rank 19, Node 2, Opteron 1, Core 3 Rank 20, Node 1, Opteron 1, Core 4 Rank 21, Node 2, Opteron 1, Core 4 Rank 22, Node 1, Opteron 1, Core 5 Rank 23, Node 2, Opteron 1, Core 5 >
- XT4 Example
-
aprun -n8 a.outwill run the MPI executablea.outon a total of eight cores, four cores on two compute nodes. The MPI tasks will be allocated in the following sequential fashion:Compute Node 0 Opteron 0 core 0 core 1 core 2 core 3 0 1 2 3 Compute Node 1 Opteron 0 core 0 core 1 core 2 core 3 4 5 6 7 - The following will place tasks in a round robin fashion.
> setenv MPICH_RANK_REORDER_METHOD 0 > aprun -n 8 a.out Rank 0, Node 1, Opteron 0, Core 0 Rank 1, Node 2, Opteron 0, Core 0 Rank 2, Node 1, Opteron 0, Core 1 Rank 3, Node 2, Opteron 0, Core 1 Rank 4, Node 1, Opteron 0, Core 2 Rank 5, Node 2, Opteron 0, Core 2 Rank 6, Node 1, Opteron 0, Core 3 Rank 7, Node 2, Opteron 0, Core 3 >
Memory Affinity
Memory affinity is supported under 2.1. Each Opteron on a node and its memory is organized into a NUMA node. A NUMA node is an Opteron and its memory.
- XT5
- Because the XT5 contains two Opterons per node, each XT5 node contains two NUMA nodes.
- XT4
- Because the XT4 contains one Opteron per node, each XT4 node contains one NUMA node.
Applications may use resources from one or both NUMA nodes. The following aprun options allow control of application NUMA node use.
- -S pes_per_numa_node
- Number of PEs to allocate per NUMA node.
- pes_per_numa_node can be 1, 2, 3, 4, 5, or 6.
- XT5 Example:
- The following will run a.out on 4 cores, one core per NUMA node.
> aprun -n 4 -S 1 a.out Rank 0, Node 0, Opteron 0, Core 0 Rank 1, Node 0, Opteron 1, Core 0 Rank 2, Node 1, Opteron 0, Core 0 Rank 3, Node 1, Opteron 1, Core 0 >
- The following will run a.out on 4 cores, one core per NUMA node.
- XT4 Example:
- The following will run a.out on 4 cores, all will be on one NUMA node.
> aprun -n 4 -S 4 a.out Rank 0, Node 0, Opteron 0, Core 0 Rank 1, Node 0, Opteron 0, Core 1 Rank 2, Node 0, Opteron 0, Core 2 Rank 3, Node 0, Opteron 0, Core 3 >
- The following will run a.out on 4 cores, all will be on one NUMA node.
- -ss
- Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.
- XT5 Only. Because the XT4 has only one NUMA node per node, this option does not apply to the XT4.
Threads
The system supports threaded programming within a compute node. On the XT5, threads may span across both Opterons within a single compute node, but cannot span compute nodes. Users have a great deal of flexibility in thread placement. Several examples are shown below.
-d depth option. The -d option specifies the number of threads per task. Without the option all threads will be started on the same core. Under previous CNL versions the option was not required. The number of cores used is calculated by multiplying the value of -d by the value of -n.
XT5 Examples
These examples are written for bash. If using csh/tcsh, you should change the export OMP_NUM_THREADS=x lines to setenv OMP_NUM_THREADS x
- Launch 2 MPI tasks, each with 12 threads (this requests 2 compute nodes and requires a size request of 24):
export OMP_NUM_THREADS=12 > aprun -n2 -d12 a.out Rank 0, Thread 0, Node 0, Opteron 0, Core 0 <-- MASTER Rank 0, Thread 1, Node 0, Opteron 0, Core 1 <-- slave Rank 0, Thread 2, Node 0, Opteron 0, Core 2 <-- slave Rank 0, Thread 3, Node 0, Opteron 0, Core 3 <-- slave Rank 0, Thread 4, Node 0, Opteron 0, Core 4 <-- slave Rank 0, Thread 5, Node 0, Opteron 0, Core 5 <-- slave Rank 0, Thread 6, Node 0, Opteron 1, Core 0 <-- slave Rank 0, Thread 7, Node 0, Opteron 1, Core 1 <-- slave Rank 0, Thread 8, Node 0, Opteron 1, Core 2 <-- slave Rank 0, Thread 9, Node 0, Opteron 1, Core 3 <-- slave Rank 0, Thread 10,Node 0, Opteron 1, Core 4 <-- slave Rank 0, Thread 11,Node 0, Opteron 1, Core 5 <-- slave Rank 1, Thread 0, Node 1, Opteron 0, Core 0 <-- MASTER Rank 1, Thread 1, Node 1, Opteron 0, Core 1 <-- slave Rank 1, Thread 2, Node 1, Opteron 0, Core 2 <-- slave Rank 1, Thread 3, Node 1, Opteron 0, Core 3 <-- slave Rank 1, Thread 4, Node 1, Opteron 0, Core 4 <-- slave Rank 1, Thread 5, Node 1, Opteron 0, Core 5 <-- slave Rank 1, Thread 6, Node 1, Opteron 1, Core 0 <-- slave Rank 1, Thread 7, Node 1, Opteron 1, Core 1 <-- slave Rank 1, Thread 8, Node 1, Opteron 1, Core 2 <-- slave Rank 1, Thread 9, Node 1, Opteron 1, Core 3 <-- slave Rank 1, Thread 10,Node 1, Opteron 1, Core 4 <-- slave Rank 1, Thread 11,Node 1, Opteron 1, Core 5 <-- slave >
- Launch 4 MPI tasks, each with 6 threads. Place 1 MPI task per Opteron (this requests 2 compute nodes and requires a size request of 24):
export OMP_NUM_THREADS=6 > aprun -n4 -d6 -S1 a.out Rank 0, Thread 0, Node 0, Opteron 0, Core 0 <-- MASTER Rank 0, Thread 1, Node 0, Opteron 0, Core 1 <-- slave Rank 0, Thread 2, Node 0, Opteron 0, Core 2 <-- slave Rank 0, Thread 3, Node 0, Opteron 0, Core 3 <-- slave Rank 0, Thread 4, Node 0, Opteron 0, Core 4 <-- slave Rank 0, Thread 5, Node 0, Opteron 0, Core 5 <-- slave Rank 1, Thread 0, Node 0, Opteron 1, Core 0 <-- MASTER Rank 1, Thread 1, Node 0, Opteron 1, Core 1 <-- slave Rank 1, Thread 2, Node 0, Opteron 1, Core 2 <-- slave Rank 1, Thread 3, Node 0, Opteron 1, Core 3 <-- slave Rank 1, Thread 4, Node 0, Opteron 1, Core 4 <-- slave Rank 1, Thread 5, Node 0, Opteron 1, Core 5 <-- slave Rank 2, Thread 0, Node 1, Opteron 0, Core 0 <-- MASTER Rank 2, Thread 1, Node 1, Opteron 0, Core 1 <-- slave Rank 2, Thread 2, Node 1, Opteron 0, Core 2 <-- slave Rank 2, Thread 3, Node 1, Opteron 0, Core 3 <-- slave Rank 2, Thread 4, Node 1, Opteron 0, Core 4 <-- slave Rank 2, Thread 5, Node 1, Opteron 0, Core 5 <-- slave Rank 3, Thread 0, Node 1, Opteron 1, Core 0 <-- MASTER Rank 3, Thread 1, Node 1, Opteron 1, Core 1 <-- slave Rank 3, Thread 2, Node 1, Opteron 1, Core 2 <-- slave Rank 3, Thread 3, Node 1, Opteron 1, Core 3 <-- slave Rank 3, Thread 4, Node 1, Opteron 1, Core 4 <-- slave Rank 3, Thread 5, Node 1, Opteron 1, Core 5 <-- slave >
- Launch 4 MPI tasks, each with 2 threads. Only place 1 MPI task (and its two threads) on each Opteron. (This requests 2 compute nodes and requires a size request of 24 even though only 8 cores are actually being used):
export OMP_NUM_THREADS=2 > aprun -n4 -d2 -S1 a.out Rank 0, Thread 0, Node 0, Opteron 0, Core 0 <-- MASTER Rank 0, Thread 1, Node 0, Opteron 0, Core 1 <-- slave Rank 1, Thread 0, Node 0, Opteron 1, Core 0 <-- MASTER Rank 1, Thread 1, Node 0, Opteron 1, Core 1 <-- slave Rank 2, Thread 0, Node 1, Opteron 0, Core 0 <-- MASTER Rank 2, Thread 1, Node 1, Opteron 0, Core 1 <-- slave Rank 3, Thread 0, Node 1, Opteron 1, Core 0 <-- MASTER Rank 3, Thread 1, Node 1, Opteron 1, Core 1 <-- slave >
XT4 Examples
These examples are written for bash. If using csh/tcsh, you should change the export OMP_NUM_THREADS=x lines to setenv OMP_NUM_THREADS x
- Launch 2 MPI tasks, each with 4 threads (this requests 2 compute nodes and requires a size request of 8):
export OMP_NUM_THREADS=4 > aprun -n2 -d4 a.out Rank 0, Thread 0, Node 0, Opteron 0, Core 0 <-- MASTER Rank 0, Thread 1, Node 0, Opteron 0, Core 1 <-- slave Rank 0, Thread 2, Node 0, Opteron 0, Core 2 <-- slave Rank 0, Thread 3, Node 0, Opteron 0, Core 3 <-- slave Rank 1, Thread 0, Node 1, Opteron 0, Core 0 <-- MASTER Rank 1, Thread 1, Node 1, Opteron 0, Core 1 <-- slave Rank 1, Thread 2, Node 1, Opteron 0, Core 2 <-- slave Rank 1, Thread 3, Node 1, Opteron 0, Core 3 <-- slave >
- Launch 2 MPI tasks, each with 2 threads. Place 1 MPI task per Opteron (this requests 2 compute nodes and requires a size request of 8):
export OMP_NUM_THREADS=4 > aprun -n2 -d2 -S1 a.out Rank 0, Thread 0, Node 0, Opteron 0, Core 0 <-- MASTER Rank 0, Thread 1, Node 0, Opteron 0, Core 1 <-- slave Rank 2, Thread 0, Node 1, Opteron 0, Core 0 <-- MASTER Rank 2, Thread 1, Node 1, Opteron 0, Core 1 <-- slave >
Lustre
Local Lustre areas available on the XT4:
- /tmp/work/$USER
- /lustre/scr144/$USER
Lustre area available on the XT5:
- /tmp/work/$USER
Notice:
Compute nodes can see only the Lustre work space.
The NFS-mounted home, project, and software directories are not accessible to the compute nodes.
- Executables must be executed from within the Lustre work space /tmp/work/$USER (XT4 and XT5) or /lustre/scr144/$USER (XT4 only).
- Batch jobs can be submitted from the home or work space. If submitted from a user’s home area, the user should cd into the Lustre work space directory prior to running the executable through
aprun. An error similar to the following may be returned if this is not done:aprun: [NID 94]Exec /tmp/work/userid/a.out failed: chdir /autofs/na1_home/userid No such file or directory - Input must reside in the Lustre work space /tmp/work/$USER (XT4 and XT5) or /lustre/scr144/$USER (XT4 only).
- Output must also be sent to the Lustre file system /tmp/work/$USER (XT4 and XT5) or /lustre/scr144/$USER (XT4 only).
