CrayPAT

Contents


Recipes for Profiling

Simple Profiling

Use the following steps to profile your code:

  1. Remove all object files and any other user libraries that you want profiled. (Probably need to do a make clean.)

  2. Issue the command module load craypat.

  3. Compile your codes with -Mprof=func. This step is only necessary if you have Fortran modules in your code.

  4. Rebuild your code, probably via make.

    • You must rebuild to ensure that proper symbols are “put” into the code for profiling.

  5. Run the pat_build command to build an instrumented executable. The command will be of the form pat_build [-w -u -g <group>] a.out [a.out+pat].

    • Tracing is currently the only option.

    • -u : Indicates to trace all user functions.

    • -g : Possible groups are mpi, io, heap (blas and lapack coming).

    • Original .o files must still be around.

  6. Run the instrumented executable, such as yod a.out+pat

  7. Run pat_report on the data file, something like pat_report <datafile>.xf.

    • We highly recommend doing a pat_report -f ap2 <datafile>.xf to create an .ap2 compressed format file that can be used as input to pat_report or Cray Apprentice2.

      • The .ap2 file is portable; it can be moved to any other machine with Cray Apprentice2 and have reports run. That cannot be done with the .xf files.

    • See man pat_report for details.

Simple Hardware Performance Counter Data

  1. If you don’t have an instrumented code, complete steps 1–5 as above. If you already have done steps 1–5 above, go on to the next step below.

  2. Set the PAT_RT_HWPC environment variable to a value from 1 to 9.

    • 1 : FP, LS, L1 Misses & TLB Misses

    • 2 : L1 & L2 Data Accesses and Misses

    • 3 : L1 Accesses, Misses, and Bandwidth

    • 4 : Floating Point Mix

  3. Run instrumented code again.

  4. Run pat_report.


Overview of CrayPAT Tools

Profiling

Profiling with the Cray tools requires multiple steps. Unlike the X1E it does require you to recompile your code. First, to use the Cray profiling tools, you must load the craypat module such as module load craypat. Then you must recompile your code with ftn or cc (the Cray wrappers) to link in the appropriate Cray performance tools/libraries. If you are compiling with pgf95 or pgcc, these compilers are not automatically linking in the Cray performance libraries. Furthermore, if you use “Fortran modules,” then you have to compile and link your code with -Mprof=func to get a proper profile.

Two important man pages to check out pat and pat_build. And note that the Fortran application programming interface (API) is similar to the C API. All accept an additional argument for the status of the call (which in C is provided as the return value).

pat_build

Builds an instrumented version of an executable code.

> pat_build [options] <executable> <instrumented executable>

Supports

  • Fortran, C, C++

  • MPI, SHMEM

Performance measurements

  • Trace based

    • User functions

    • API for fine-grain instrumentation

    • Predefined function groups (mpi, shmem, io, etc.)

Source code mapping

  • Call stack

  • Line numbers

pat_run

User interface to simplify CrayPAT usage. Runs an instrumented executable and generates a report, all in one step.

The following executes a.out+pat and produces a report measuring the number of floating point operations, calculating the mflop rate, and determining the average number of results produced per vector operation for the traced functions.

pat_run -O flops,mflops,vl yod -sz 1 a.out+pat

The following produces a load-balance report showing average versus maximum time per processor (based on wall-clock time) for an MPI program:

pat_run -O balance yod -sz 4 a.out+pat

The “-O” option is a comma-list of keywords to specify the following:

  • How to record it: trace

  • Show callers: callers, calltree

  • Show source/line number: source, line

  • Show load balance: balance[.$data][.$by]

    • $data can be samples or time (default), cycles, etc.

    • $by can be pe (default), thread, or ssp

Examples

To get basic profile run, use the following:

pat_run -b [pe,]function:source,line [-s percent=relative] yod -sz 1 <instrumented executable>

In the output file, use the following:

 100.0% |    100.0% |  965 |Total|-------------------------------------|  88.2% |     88.2% |  851 |kron_matmull@module_kron_ 

||------------------------------------ 

||  40.4% |     40.4% |  344 |line.307 

||  37.0% |     77.4% |  315 |line.297

To get a calltree run, use the following:

pat_run -b [pe,] function:source,calltree [-s percent=relative] yod -sz 1 <instrumented executable>

pat_report

You can directly run an instrumented executable with yod, which will produce a performance-data file (ending in .xf). This file can then be processed into a human-readable text profile using the pat_report command.

Experiment Types

There is only one type of performance experiment that you can run, - trace.

See the pat man page for more information.

Run-Time Library

Use the PAT run-time library to get statistics on a specific region of code.

Example

program test_module_kronuse pat_apiinteger ierr 

… 

! Begin region of interest 

call PAT_region_begin ( 1, 'kron_matmul_kernel', ierr ) ! # and name must be unique to each region 

call kron_matmulL(…) 

! End region of interest 

call PAT_region_end   ( 1, ierr ) 

end program

Compile

ftn *.f  -o test.exe

Relink

pat_build -w test.exe test.exe+pat

Run and produce a report

pat_run -g normal [-b function,ssp=HIDE] yod -sz 1 test.exe+pat

Apprentice2 Visualizer

Apprentice2 is targeted to help identify and correct

  • Excessive communication

  • Network contention

  • Load imbalance

  • Excessive serialization

Supports

  • Call graph profile

  • Communication statistics

  • Timeline view (Must have PAT_RT_SUMMARY set to 0 before running instrumented code.)

    • Communication

    • I/O

  • Activity view

  • Pairwise communication statistics

  • Text reports

  • Source code mapping

Apprentice2 (invoked with app2) takes as input an XML file. The input file is generated as follows:

module load apprentice2pat_report –c records –f ap2 <perf.file>.xf > <perf.file>.ap2

Visualization is possible with both profiles and trace files, but Apprentice2 has less functionality with profiles. The following features are supported for profiles (run-time summaries):

  • Call graph view

  • Function statistics overview

  • Function report

  • Programming environment (PE) breakdown

  • General information


Hardware Performance Counters

pat_hwpc

pat_hwpc collects hardware performance counters information for an application. No instrumentation is required. Usage is as follows:

  pat_hwpc [options] yod <executable>

pat_hwpc accepts various hardware counters groups and produces a report with raw counts and derived metrics for the whole execution. The hardware counters are summed across all threads in each process. See the pat_hwpc man page.