FAQ

For more general support questions see the General FAQ under the User Support tab.

Table of Contents

How do I switch compilers or library versions?

To switch compiler or library versions, use the module command. You can do a module list to show the currently loaded modules. The default (production) modules use a generic name with no version numbers, so doing a module list will not show what versions you have loaded. You can use the pe-version command to figure out what versions you currently have loaded.

Then you can do module avail to find out what other versions of the compilers and libraries are available. Typically, the default environment is one minor version behind what is available for testing. You can do a module swap <loadedmodule> <newermodule> to swap out a module for a newer or older one. A typical use of this command is to swap out the default PrgEnv for PrgEnv.<newer>.

For example, if you have a problem in which the compiler dies when compiling code, you can change versions of the PE with

module swap <loadedcompiler> <newercompiler>

and see if the newer compiler fixes it.

How do I use TotalView interactively to debug a parallel code?

Issue the command

totalview -app "-n 3" <program> [totalview options] [-a <program options>]

Why do I get Fortran module USE file "Modules.o" cannot be found ?

The Phoenix compilers hard-code paths in their .o files, so they look in the place where the .o file was built (that contained the module) for the modules. When they are not there, you get these warning messages. If you provide an alternate way of finding them, though, the code still builds.

Why does gdb fail with the error aprun: a.out: Not enough space?

If a program is compiled in command mode (with the -h command flag in C or Fortran or the -O command flag in Fortran), it will be given a 64 KB page size. When gdb tries to run the program, this limit causes gdb to fail with the Not enough space error. If the program is compiled in MSP or SSP mode instead, gdb will not give the error. If you want to debug a command mode executable, you should use gdb -launcher="" ./a.out or gdb -launcher="aprun -u" ./a.out.

What do I do when I get unable to launch application in 5 sec?

This typically happens when using pat_hwpc, for example,

phoenix% pat_hwpc aprun -n 4 hello
unable to launch application in 5 secs [No such process]
Process 121552 exiting with a non-zero status 0x9

A workaround for this is to set PAT_HWPC_APPLACE_TIME to a value larger than 5, say 20. This environment variable is the number of seconds allotted for the application to be scheduled for execution before pat_hwpc terminates.

How do I monitor memory usage?

You can use top to monitor memory usage. The “RES” column shows the memory usage for the process listed in that row. (Prior to UNICOS/mp 3.0, the value in the RES column was the sum of the memory usage of all the processes on a node [a node = 4 MSPs = 16 SSPs].)

You can also use sar, but it shows node usage. Do a sar -n 5 1, and it will show the memory usage in the second-to-last column. This memory usage is on a node basis (not per process.) You have to know what nodes you are using that can be obtained from psview -m or from apstat.

The above utilities let you interactively determine memory usage. If you want a high-memory watermark, you could use, for example, pat_hwpc. The summary report gives you a high-memory watermark, but only for selected processes (ssp 0 on each node).

How are memory limits enforced?

The mem option in PBS is not currently enforced. There is really no need. You can use only the memory on the node you have. You cannot use memory from other nodes unless you have processes on them. Also, aprun has default memory limits turned on. So, by default, for each msp you ask for, the limit is 2 GB, and for each ssp you ask for, the limit is 512 MB. You can ask for more with the -m option to aprun. However, if you use the -m to ask for more than the default limit, you then need to request more mppe in the pbs script than you ask for in with the aprun command. Also, note that you should ask for a little less than the theoretically available maximum because the OS uses up some, so instead of -m 8GB use -m 8000M for example.

If you use an option such as -C1 to spread out your processes AND you want more memory per process, then you will have to increase one or more of the following environment variables:

  X1_COMMON_STACK_SIZE=4000000000
  X1_PRIVATE_STACK_SIZE=4000000000
  X1_LOCAL_HEAP_SIZE=4000000000
  X1_SYMMETRIC_HEAP_SIZE=4000000000

Do a man 7 memory for more information on the these environment variables.

How do I get a trace back when my application crashes?

If you don’t automatically get a trace back when your application crashes, then set the environment variable TRACEBK to 30. The value tells how many levels in the trace back you will get, and 30 is the maximum.

What information does a trace back on Phoenix show?

A trace back will show information such as

Traceback for process 64311(ssp mode) apid 64184.229 on node 7
                    zgemv_+0x0754 (0x1CA012C8D34) at zgemv.f
                block_inv_+0x087C (0x1CA011FB71C) at block_inverse.f:118
                 gettaucl_+0x2568 (0x1CA01126B48) at tmp_gettaucl_c.f:453
                   gettau_+0x182C (0x1CA0103E52C) at gettau_c.f:311
               zplanint_d_+0x1AD4 (0x1CA01021054) at zplanint_d.f:413
Fault: unable to access memory address: 0x1CB410030F8
/var/spool/PBS/mom_priv/jobs/52744.phoen.SC[23]: 64184 Memory fault(coredump)

The first line, “Trace back for process 64311(ssp mode) apid 64184.229 on node 7,” tells the user several things. First, it indicates that if you got core files, you should start with core file 229. But the core for process 0 should also be investigated. The 229 also means it was process 5 on node 7 that failed. The program failed with an addressing error on node 7. Here “node” means an X1E module that contains eight MSPs. The node numbering starts at 0 because this was a 256 SSP job, which means the node 7 was the last node in the job. If the failing SSP is always on the last node, that fact could be important. Note that blocks of 16 SSPs (by default unless they use more than 512 MB of memory) end up sharing a local memory pool.

What is the behavior of a Fortran name list?

A run-time I/O error will appear if a nonmatching name list appears in an input file. It is a default behavior on Cray X1E. To avoid the error, you may do the following or read the manual page of assign for more options:

setenv FILENV $EVAR
eval `assign -Y on u:<input file unit number>`

The ability to “skip over” nonmatching name lists in an input file is an extension of the standard. The standard says you begin input at the beginning of the next record in the file. It does not say anything about how to handle bad data. Name list existed in various compilers long before it was made part of the language standard, so it is possible that some implementations have allowed this behavior for a long time, from when the entire feature was an extension. For compatibility with their previous implementations, most chose to continue with what they had done before.

Ultimately, the standards argument boils down to this: For conformance, the user is supposed to have name lists in the input file in the same order as the program reads that file. The question is how the PE reacts to nonconforming cases. Possible options are to give an error (our default) or to react in some way not specified by the standard (our optional, -Y case). The compiler/library is always allowed to do anything it wants if the program is not conforming, so the compilers/libraries in both cases are legal. It’s the program (in particular failing to read the name lists in order) that is nonconforming.

Why does qsub give me an error?

If you submit a job requesting more than four MSPs (or 16 SSPs) but not a multiple of eight (or 32 SSPs), qsub will reject the job with the following error message:

ERROR: When using more than one SMP node (4 MSPs or 16 SSPs), you must
       request a whole number of modules.  Please set "mppe" to a multiple of
       eight (or "mppssp" to a multiple of 32).

The Cray job scheduler can only start multinode jobs on a module boundary. (A multinode job is any job that requests more than four MSPs or 16 SSPs. A module contains two nodes or eight MSPs.) If a multinode job requests a number of MSPs that is not a multiple of eight, it introduces gaps in the system. PBS does not realize that these gaps are unusable by other multinode jobs, so it may schedule a job that cannot be run. That job will wait to run, but PBS will think it is running and will run the clock on its requested wall time.

To avoid this issue, we require that all multinode jobs request a number of MSPs that is a multiple of eight (or a number of SSPs that is a multiple of 32). Your aprun command is not required to use all the processors you request through PBS, but our accounting systems will record your PBS request, not your actual aprun usage. (Our accounting is not based on what resources you use but on what resources you deny to other users.)

Why do I get “UNRECOVERABLE error on system request” when I use assign?

The assign command will work only if FILENV is set to point to a file in /tmp/work/<user>; or in other words, assign will fail with the above error if you have FILENV set to a file name located in your home space.

Why aren’t system service calls in Fortran code found under UNICOS/mp?

System service calls such as GETARG and SYSTEM are part of the PXF POSIX library. If your Fortran code uses system service calls, you will most likely have to modify your code to call the PXF equivalent. It is common to find these when porting code from other systems.

Below is a list of common system calls and their equivalent PXF interfaces. Other interfaces may be found through man intro_pxf. Details on each interface may be found through its man page. Please note that in most cases the PXF calls require at least one additional argument than the common interfaces.

Common PXF
ACCESS PXFACCESS
CHDIR PXFCHDIR
CHMOD PXFCHMOD
CHOWN PXFCHOWN
CLOSE PXFCLOSE
GETARG PXFGETARG
GETCWD PXFGETCWD
GETENV PXFGETENV
GETGID PXFGETGID
GETHOST PXFUNAME
GETPGRP PXFGETPGRP
GETPID PXFGETPID
GETUID PXFGETUID
GETVARG PXFGETARG
GETVARGC PXFGETARG
IARGC IPXFARGC
LSEEK PXFLSEEK
OPEN PXFOPEN
PIPE PXFPIPE
SLEEP PXFSLEEP
STAT PXFSTAT
TIME PXFTIME
UTIME PXFUTIME
WAIT PXFWAIT

Whom do I contact for help?

Contact the User Assistance Center if you have questions.