Debugging & Optimization

Cray Performance Analysis Tools on Phoenix

Profiling

Profiling with the Cray tools requires multiple steps, but it does not require you to recompile your code.

Robin is a computer system that is used as a cross-compiler system for Phoenix. Because it is much faster at scalar operations, such as compiling and performance analysis, you may want to perform as much work on Robin as possible. On Robin, you can do the following:

  • Compile for Phoenix.

  • Run pat_build to generate instrumented executables.

  • Submit batch jobs using qsub.

  • Run pat_report to generate performance reports.

  • Run app2 to visualize performance.

You cannot run aprun or pat_run directly on Robin, but you can submit jobs that in turn use these commands.

pat_build

Builds an instrumented version of an executable code.

> pat_build [options] <executable> <instrumented executable>

Supports

  • Fortran, C, C++, CAF, UPC

  • MPI, SHMEM, OpenMP, pThreads

Performance measurements

  • Sample-based (operating system [OS] interrupts or hardware [HW] counters overflow)

  • Trace-based

    • User functions

    • Application programming interface (API) for fine-grain instrumentation

    • Predefined function groups (mpi, shmem, io, etc.)

Source code mapping

  • Call stack

  • Line numbers

pat_run

User interface to simplify Cray PAT usage. Runs an instrumented executable and generates a report, all in one step.

The following executes a.out_instr and produces a report measuring the number of floating point operations per second (flops), calculating the mega floating point operations per second (mflops) rate, and determining the average number of results produced per vector operation for the traced functions.

pat_run -O flops,mflops,vl a.out_instr

The following produces a load-balance report showing average versus maximum time per processor (based on wall-clock time) for an MPI program:

pat_run -O balance mpirun -np 4 a.out_instr

The -O option is a comma-delineated list of keywords to specify the following:

  • Specify data to be recorded: cycles, flops, mflops, vl, etc.

  • Show how to record it: sample, overflow, trace.

  • Show callers: callers, calltree.

  • Show source/line number: source, line.

  • Show load balance: balance[.$data][.$by].

    • $data can be samples or time (default), cycles, etc.

    • $by can be pe (default), thread, or ssp.

Examples

To get basic profile run

pat_run -b [pe,]function:source,line [-s percent=relative] aprun -n1 <instrumented executable>

In the output file

 100.0% |    100.0% |  965 |Total
|-------------------------------------
|  88.2% |     88.2% |  851 |kron_matmull@module_kron_
||------------------------------------
||  40.4% |     40.4% |  344 |line.307
||  37.0% |     77.4% |  315 |line.297

To get a call tree run

pat_run -b [pe,] function:source,calltree [-s percent=relative] aprun -n1 <instrumented executable>

pat_report

You can use aprun instead of pat_run on an instrumented executable, which will produce a performance-data file (ending in .xf). This file can then be processed into a human-readable text profile using the pat_report command.

Experiment Types

There are many types of performance experiments you can run. By default on Phoenix, the experiment type is profil, which has the lowest overhead. It samples the program counter by user and system CPU time. Another common experiment is samp_cs_time, which samples the call stack at a given time interval. This experiment returns the total program time and the absolute and relative times each call-stack counter was recorded but is otherwise identical to the samp_pc_time experiment. This later experiment is useful for obtaining profile reports with call-tree or caller information included.

See the pat man page for more information.

Run-Time Library

Use the PAT run-time library to get statistics on a specific region of code.

Example

program test_module_kron
use pat_api
…
! Begin region of interest
call PAT_region_begin ( 1, 'kron_matmul_kernel' ) ! # and name must be unique to each region
call kron_matmulL(…)
! End region of interest
call PAT_region_end   ( 1 )
end program

Compile

ftn *.f  -o test.exe

Relink

pat_build -w test.exe test.exe.trace

Run and produce a report

[setenv PAT_RT_RECORD_SSP 0-3]
pat_run -g normal [-b function,ssp=HIDE] aprun -n1 test.exe.trace

Apprentice2 Visualizer

Apprentice2 is targeted to help identify and correct

  • Excessive communication

  • Network contention

  • Load imbalance

  • Excessive serialization

Supports

  • Call graph profile

  • Communication statistics

  • Timeline view

    • Communication

    • Input/output (I/O)

  • Activity view

  • Pairwise communication statistics

  • Text reports

  • Source code mapping

Apprentice2 (invoked with app2) takes as input an XML file. The input file is generated as follows:

pat_report –c records –f xml <perf.file>.xf > <perf.file>.xml
gzip <file>.xml

The gzip part is optional but recommended because the XML files can get quite large.

Visualization is possible with both profiles and trace files, but Apprentice2 has less functionality with profiles. The following features are supported for profiles (run-time summaries):

  • Call graph view

  • Function statistics overview

  • Function report

  • PE breakdown

  • General information

Hardware Performance Counters

pat_hwpc

pat_hwpc collects hardware performance counters information for an application. No instrumentation is required. Usage is as follows:

  pat_hwpc [options] <executable>

pat_hwpc accepts various hardware counter groups and produces a report with raw counts and derived metrics for the whole execution. The hardware counters are summed across all threads in each process.

As an example for pat_hwpc, stats on an entire execution would be as follows:

  • No recompiling, relinking, or instrumentation required.

  • Can just run and produce a report: pat_hwpc [-d P –d E] aprun -n1 test.exe.

You may need to set PAT_HWPC_APPLACE_TIME to a value larger than the default of 5, say 20 or 60, to be able to use pat_hwpc. If not, you may get a message saying something to the effect of “not able to start application.” This environment variable is the number of seconds allotted for the application to be scheduled for execution before pat_hwpc terminates.

Totals for program
------------------------------------------------------------------------
Cycles                      9.219 secs    3687579361 cycles
Instructions graduated    625.832M/sec    5769509199 instr
Branches & Jumps            6.543M/sec      60322302 instr
Branches mispredicted       0.292M/sec       2692264 misses   4.5%
Correctly predicted         6.251M/sec      57630038         95.5%
Vector instructions       348.059M/sec    3208733836 instr   55.6%
Scalar instructions       277.773M/sec    2560775363 instr   44.4%
Vector ops              15528.961M/sec  143160693833 ops
Vector FP adds           7088.614M/sec   65349570400 ops
Vector FP multiplies     7106.030M/sec   65510120760 ops
Vector FP divides etc       0.001M/sec         12708 ops
Vector FP misc             10.220M/sec      94213419 ops
Vector FP ops           14204.865M/sec  130953917287 ops     100.0%
Scalar FP ops               0.003M/sec         28177 ops      0.0%
Total  FP ops           14204.868M/sec  130953945464 ops
FP ops per load                                12.18
Scalar integer ops          4.053M/sec      37368261 ops
Scalar memory refs         82.084M/sec     756725277 refs     7.0%
Vector TLB misses             202 /sec          1867 misses
Scalar TLB misses              53 /sec           496 misses
Instr  TLB misses              37 /sec           347 misses
Total  TLB misses             293 /sec          2710 misses
Dcache references          81.956M/sec     755550421 refs
Dcache bypass refs          0.127M/sec       1174856 refs
Dcache misses               3.508M/sec      32337195 misses
Vector integer adds         0.074M/sec        681031 ops
Vector logical ops        172.972M/sec    1594618026 ops
Vector shifts              57.677M/sec     531720865 ops
Vector loads              966.865M/sec    8913480904 refs
Vector stores             117.717M/sec    1085230909 refs
Vector memory refs       1084.583M/sec    9998711813 refs    93.0%
Scalar memory refs         82.084M/sec     756725277 refs     7.0%
Total memory refs        1166.666M/sec   10755437090 refs
Average vector length                          44.62
A-reg Instr               173.494M/sec    1599433367 instr
Scalar FP Instr             0.003M/sec         28177 instr
Syncs Instr                 0.725M/sec       6682128 instr
Stall VLSU                  7.705 secs    3082157516 clks
Stall VU                    6.220 secs    2487845272 clks
Vector Load Alloc         522.613M/sec    4817941865 refs
Vector Load Index           8.710M/sec      80301038 refs
Vector Load Stride          0.007M/sec         61414 refs
Vector Store Alloc         65.338M/sec     602349028 refs
Vector Store Stride           798 /sec          7359 refs