3399 lines
121 KiB
Text
3399 lines
121 KiB
Text
\input texinfo @c -*-texinfo-*-
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c This is the Texinfo source file for the GPROFNG manual.
|
|
@c
|
|
@c Author: Ruud van der Pas
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c %**start of header
|
|
|
|
@setfilename gprofng.info
|
|
@settitle GNU gprofng
|
|
|
|
@c -- Set the indent for the @example command to 1 space, not 5 ---------------
|
|
@exampleindent 1
|
|
|
|
@c %**end of header
|
|
|
|
@c -- Start a new chapter on a new, odd numbered, page ------------------------
|
|
@setchapternewpage odd
|
|
|
|
@c -- Merge all index entries into the Concepts Index -------------------------
|
|
@syncodeindex fn cp
|
|
@syncodeindex ky cp
|
|
@syncodeindex pg cp
|
|
@syncodeindex vr cp
|
|
|
|
@c -- Macro definitions -------------------------------------------------------
|
|
@c
|
|
@c Since only letters can be used, we use capitalization to distinguish
|
|
@c different words.
|
|
@c ----------------------------------------------------------------------------
|
|
@macro CollectApp{}
|
|
@command{gprofng collect app}
|
|
@end macro
|
|
|
|
@macro DisplayHTML{}
|
|
@command{gprofng display html}
|
|
@end macro
|
|
|
|
@macro DisplayText{}
|
|
@command{gprofng display text}
|
|
@end macro
|
|
|
|
@macro Driver{}
|
|
@command{gprofng}
|
|
@end macro
|
|
|
|
@macro ProductName{}
|
|
gprofng
|
|
@end macro
|
|
|
|
@macro ToolName{}
|
|
@command{gprofng}
|
|
@end macro
|
|
|
|
@macro IndexSubentry{label, string}
|
|
@c -- @cindex \label\ @subentry \string\
|
|
@cindex \label\, \string\
|
|
@end macro
|
|
|
|
@c -- Get the version information ---------------------------------------------
|
|
@include version.texi
|
|
|
|
@c -- Entry for the Info dir structure ----------------------------------------
|
|
@ifnottex
|
|
@dircategory Software development
|
|
@direntry
|
|
* gprofng: (gprofng). The next generation profiling tool for Linux
|
|
@end direntry
|
|
@end ifnottex
|
|
|
|
@c -- Copyright stuff ---------------------------------------------------------
|
|
@copying
|
|
This document is the manual for @ProductName{}, last updated @value{UPDATED}.
|
|
|
|
Copyright @copyright{} 2022 Free Software Foundation, Inc.
|
|
|
|
@c -- @quotation
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License,
|
|
Version 1.3 or any later version published by the Free Software
|
|
Foundation; with no Invariant Sections, with no Front-Cover texts,
|
|
and with no Back-Cover Texts. A copy of the license is included in the
|
|
section entitled ``GNU Free Documentation License.''
|
|
|
|
@c -- @end quotation
|
|
@end copying
|
|
|
|
@finalout
|
|
@smallbook
|
|
|
|
@c -- Define the title page ---------------------------------------------------
|
|
@titlepage
|
|
@title GNU gprofng
|
|
@subtitle The next generation profiling tool for Linux
|
|
@subtitle version @value{VERSION} (last updated @value{UPDATED})
|
|
@author Ruud van der Pas
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@insertcopying
|
|
@end titlepage
|
|
|
|
@c -- Generate the Table of Contents ------------------------------------------
|
|
@contents
|
|
|
|
@c -- The Top node ------------------------------------------------------------
|
|
@c Should contain a short summary, copying permissions and a master menu.
|
|
@c ----------------------------------------------------------------------------
|
|
@ifnottex
|
|
@node Top
|
|
@top GNU Gprofng
|
|
|
|
@insertcopying
|
|
@end ifnottex
|
|
|
|
@ifinfo
|
|
@c -- The menu entries --------------------------------------------------------
|
|
|
|
@menu
|
|
* Introduction:: About this manual.
|
|
* Overview:: A brief overview of @ProductName{}.
|
|
* A Mini Tutorial:: A short tutorial covering the key features.
|
|
* Terminology:: Various concepts and some terminology explained.
|
|
* Other Document Formats:: How to create this document in other formats.
|
|
* Index:: The index.
|
|
|
|
@detailmenu
|
|
|
|
--- The Detailed Node Listing ---
|
|
|
|
Introduction
|
|
|
|
Overview
|
|
|
|
* Main Features:: A high level overview.
|
|
* Sampling versus Tracing:: The pros and cons of sampling versus tracing.
|
|
* Steps Needed to Create a Profile:: How to create a profile.
|
|
|
|
A Mini Tutorial
|
|
|
|
* Getting Started:: The basics of profiling with @ProductName().
|
|
* Support for Multithreading:: Commands specific to multithreaded applications.
|
|
* Viewing Multiple Experiments:: Analyze multiple experiments.
|
|
* Profile Hardware Event Counters:: How to use hardware event counters.
|
|
* Java Profiling:: How to profile a Java application.
|
|
|
|
Terminology
|
|
|
|
* The Program Counter:: What is a Program Counter?
|
|
* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics.
|
|
* Metric Definitions:: Definitions associated with metrics.
|
|
* The Viewmode:: Select the way call stacks are presented.
|
|
* The Selection List:: How to define a selection.
|
|
* Load Objects and Functions:: The components in an application.
|
|
* The Concept of a CPU in @ProductName{}:: The definition of a CPU.
|
|
* Hardware Event Counters Explained:: What are event counters?
|
|
* apath:: Our generic definition of a path.
|
|
|
|
@c -- Index
|
|
|
|
@end detailmenu
|
|
@end menu
|
|
@end ifinfo
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Introduction
|
|
@chapter Introduction
|
|
@c ----------------------------------------------------------------------------
|
|
The @ProductName{} tool is the next generation profiler for Linux. It consists
|
|
of various commands to generate and display profile information.
|
|
|
|
This manual starts with a tutorial how to create and interpret a profile. This
|
|
part is highly practical and has the goal to get users up to speed as quickly
|
|
as possible. As soon as possible, we would like to show you how to get your
|
|
first profile on your screen.
|
|
|
|
This is followed by more examples, covering many of the features. At the
|
|
end of this tutorial, you should feel confident enough to tackle the more
|
|
complex tasks.
|
|
|
|
In a future update a more formal reference manual will be included as well.
|
|
Since even in this tutorial we use certain terminology, we have included a
|
|
chapter with descriptions at the end. In case you encounter unfamiliar
|
|
wordings or terminology, please check this chapter.
|
|
|
|
One word of caution. In several cases we had to somewhat tweak the screen
|
|
output in order to make it fit. This is why the output may look somewhat
|
|
different when you try things yourself.
|
|
|
|
For now, we wish you a smooth profiling experience with @ProductName{} and
|
|
good luck tackling performance bottlenecks.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c cccccc @node A Brief Overview of @ProductName{}
|
|
@node Overview
|
|
@chapter A Brief Overview of @ProductName{}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@menu
|
|
* Main Features:: A high level overview.
|
|
* Sampling versus Tracing:: The pros and cons of sampling versus tracing.
|
|
* Steps Needed to Create a Profile:: How to create a profile.
|
|
@end menu
|
|
|
|
Before we cover this tool in quite some detail, we start with a brief overview
|
|
of what it is, and the main features. Since we know that many of you would
|
|
like to get started rightaway, already in this first chapter we explain the
|
|
basics of profiling with @ToolName{}.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD Review this text. Probably be more specific on the gcc releases and
|
|
@c processor specifics.
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Main Features
|
|
@section Main Features
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@noindent
|
|
These are the main features of the @ProductName{} tool:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
Profiling is supported for an application written in C, C++, Java, or Scala.
|
|
|
|
@c TBD Java: up to 1.8 full support, support other than for modules
|
|
|
|
@item
|
|
Shared libraries are supported. The information is presented at the instruction
|
|
level.
|
|
|
|
@item
|
|
The following multithreading programming models are supported: Pthreads,
|
|
OpenMP, and Java threads.
|
|
|
|
@item
|
|
This tool works with unmodified production level executables. There is no need to
|
|
recompile the code, but if the @code{-g} option has been used when building
|
|
the application, source line level information is available.
|
|
|
|
@item
|
|
The focus is on support for code generated with the @code{gcc} compiler, but
|
|
there is some limited support for the @code{icc} compiler as well. Future
|
|
improvements and enhancements will focus on @code{gcc} though.
|
|
|
|
@item
|
|
Processors from Intel, AMD, and Arm are supported, but the level of support
|
|
depends on the architectural details. In particular, hardware event counters
|
|
may not be supported.
|
|
|
|
@item
|
|
Several views into the data are supported. For example, a function overview
|
|
where the time is spent, but also a source line, disassembly, call tree and
|
|
a caller-callees overview are available.
|
|
|
|
@item
|
|
Through filters, the user can zoom in on an area of interest.
|
|
|
|
@item
|
|
Two or more profiles can be aggregated, or used in a comparison. This comparison
|
|
can be obtained at the function, source line, and disassembly level.
|
|
|
|
@item
|
|
Through a scripting language, and customization of the metrics shown,
|
|
the generation and creation of a profile can be fully automated and provide
|
|
tailored output.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Sampling versus Tracing
|
|
@section Sampling versus Tracing
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
A key difference with some other profiling tools is that the main data
|
|
collection command @CollectApp{} mostly uses
|
|
@cindex Program Counter sampling
|
|
@cindex PC sampling
|
|
Program Counter (PC) sampling
|
|
under the hood.
|
|
|
|
With @emph{sampling}, the executable is stopped at regular intervals. Each time
|
|
it is halted, key information is gathered and stored. This includes the Program
|
|
Counter that keeps track of where the execution is. Hence the name.
|
|
|
|
Together with operational
|
|
data, this information is stored in the experiment directory and can be
|
|
viewed in the second phase.
|
|
|
|
For example, the PC information is used to derive where the program was when
|
|
it was halted. Since the sampling interval is known, it is relatively easy to
|
|
derive how much time was spent in the various parts of the program.
|
|
|
|
The opposite technique is generally referred to as @emph{tracing}. With
|
|
tracing, the target is instrumented with specific calls that collect the
|
|
requested information.
|
|
|
|
These are some of the pros and cons of PC sampling verus tracing:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
Since there is no need to recompile, existing executables can be used
|
|
and the profile measures the behaviour of exactly the same executable that is
|
|
used in production runs.
|
|
|
|
With sampling, one inherently profiles a different executable because
|
|
the calls to the instrumentation library may affect the compiler optimizations
|
|
and run time behaviour.
|
|
|
|
@item
|
|
With sampling, there are very few restrictions on what can be profiled and even without
|
|
access to the source code, a basic profile can be made.
|
|
|
|
@item
|
|
A downside of sampling is that, depending on the sampling frequency, small
|
|
functions may be missed or not captured accurately. Although this is rare,
|
|
this may happen and is the reason why the user has control over the sampling rate.
|
|
|
|
@item
|
|
While tracing produces precise information, sampling is statistical in nature.
|
|
As a result, small variations may occur across seemingly identical runs. We
|
|
have not observed more than a few percent deviation though. Especially if
|
|
the target job executed for a sufficiently long time.
|
|
|
|
@item
|
|
With sampling, it is not possible to get an accurate count how often
|
|
functions are called.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Steps Needed to Create a Profile
|
|
@section Steps Needed to Create a Profile
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Creating a profile takes two steps. First the profile data needs to be
|
|
generated. This is followed by a viewing step to create a report from the
|
|
information that has been gathered.
|
|
|
|
Every @ProductName{} command starts with @ToolName{}, the name of the driver. This is followed
|
|
by a keyword to define the high level functionality. Depending on this
|
|
keyword, a third qualifier may be needed to further narrow down the request.
|
|
This combination is then followed by options that are specific to the functionality
|
|
desired.
|
|
|
|
The command to gather, or ``collect'', the performance data is called
|
|
@CollectApp{}. Aside from numerous options, this command takes the name
|
|
of the target executable as an input parameter.
|
|
|
|
Upon completion of the run, the performance data can be
|
|
found in the newly created
|
|
@cindex Experiment directory
|
|
experiment directory.
|
|
|
|
Unless explicitly specified otherwise, a default
|
|
name for this directory is chosen. The name is @code{test.<n>.er} where
|
|
@code{n} is the first integer number not in use yet for such a name.
|
|
|
|
For example, the first time @CollectApp{} is invoked, an experiment
|
|
directory with the name @code{test.1.er} is created.
|
|
|
|
Upon a subsequent invocation of @CollectApp{} in the same directory,
|
|
an experiment directory with the name @code{test.2.er} will be created,
|
|
and so forth.
|
|
|
|
Note that @CollectApp{} supports an option to explicitly name the experiment directory.
|
|
Outside of the restriction that the name of this directory has to end
|
|
with @code{.er}, any valid directory name can be used for this.
|
|
|
|
Now that we have the performance data, the next step is to display it.
|
|
|
|
@pindex @DisplayText{}
|
|
The most commonly used command to view the performance information is
|
|
@DisplayText{}. This is a very extensive and customizable tool that
|
|
produces the information in ASCII format.
|
|
|
|
@pindex @DisplayHTML{}
|
|
Another option is to use @DisplayHTML{}. This tool generates a directory with
|
|
files in html format. These can be viewed in a browser, allowing for easy
|
|
navigation through the profile data.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A Mini Tutorial
|
|
@chapter A Mini Tutorial
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In this chapter we present and discuss the main functionality of @ToolName{}.
|
|
This will be a practical approach, using an example code to generate profile
|
|
data and show how to get various performance reports.
|
|
|
|
@menu
|
|
* Getting Started:: The basics of profiling with @ProductName().
|
|
* Support for Multithreading:: Commands specific to multithreaded applications.
|
|
* Viewing Multiple Experiments:: Analyze multiple experiments.
|
|
* Profile Hardware Event Counters:: How to use hardware event counters.
|
|
* Java Profiling:: How to profile a Java application.
|
|
@end menu
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Getting Started
|
|
@section Getting Started
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The information presented here provides a good and common basis for many
|
|
profiling tasks, but there are more features that you may want to leverage.
|
|
|
|
These are covered in subsequent sections in this chapter.
|
|
|
|
@menu
|
|
* The Example Program:: A description of the example program used.
|
|
* A First Profile:: How to get the first profile.
|
|
* The Source Code View:: Display the metrics in the source code.
|
|
* The Disassembly View:: Display the metrics at the instruction level.
|
|
* Display and Define the Metrics:: An example how to customize the metrics.
|
|
* A First Customization of the Output:: An example how to customize the output.
|
|
* Name the Experiment Directory:: Change the name of the experiment directory.
|
|
* Control the Number of Lines in the Output:: Change the number of lines in the tables.
|
|
* Sorting the Performance Data:: How to set the metric to sort by.
|
|
* Scripting:: Use a script to execute the commands.
|
|
* A More Elaborate Example:: An example of customization.
|
|
* The Call Tree:: Display the dynamic call tree.
|
|
* More Information on the Experiment:: How to get additional statistics.
|
|
* Control the Sampling Frequency:: How to control the sampling granularity.
|
|
* Information on Load Objects:: How to get more information on load objects.
|
|
@end menu
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Example Program
|
|
@subsection The Example Program
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Throughout this guide we use the same example C code that implements the
|
|
multiplication of a vector of length @math{n} by an @math{m} by @math{n}
|
|
matrix. The result is stored in a vector of length @math{m}.
|
|
@cindex Pthreads
|
|
@cindex Posix Threads
|
|
The algorithm has been parallelized using Posix Threads, or Pthreads for short.
|
|
|
|
The code was built using the @code{gcc} compiler and the name of the executable
|
|
is
|
|
@cindex mxv-pthreads.exe
|
|
mxv-pthreads.exe.
|
|
|
|
The matrix sizes can be set through the @code{-m} and @code{-n} options. The
|
|
number of threads is set with the @code{-t} option. To increase the duration
|
|
of the run, the multiplication is executed repeatedly.
|
|
|
|
This is an example that multiplies a @math{3000} by @math{2000} matrix with
|
|
a vector of length @math{2000} using @math{2} threads:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ ./mxv-pthreads.exe -m 3000 -n 2000 -t 2
|
|
mxv: error check passed - rows = 3000 columns = 2000 threads = 2
|
|
$
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The program performs an internal check to verify the results are correct.
|
|
The result of this check is printed, followed by the matrix sizes and the
|
|
number of threads used.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A First Profile
|
|
@subsection A First Profile
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The first step is to collect the performance data. It is important to remember
|
|
that much more information is gathered than may be shown by default. Often a
|
|
single data collection run is sufficient to get a lot of insight.
|
|
|
|
The @CollectApp{} command is used for the data collection. Nothing needs to be
|
|
changed in the way the application is executed. The only difference is that it
|
|
is now run under control of the tool, as shown below:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app ./mxv.pthreads.exe -m 3000 -n 2000 -t 1
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This command produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Creating experiment database test.1.er (Process ID: 2416504) ...
|
|
mxv: error check passed - rows = 3000 columns = 2000 threads = 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
We see the message that a directory with the name @code{test.1.er}
|
|
has been created.
|
|
The application then completes as usual and we have our first experiment
|
|
directory that can be analyzed.
|
|
|
|
The tool we use for this is called @DisplayText{}. It takes the name of
|
|
the experiment directory as an argument.
|
|
|
|
@cindex Interpreter mode
|
|
If invoked this way, the tool starts in the interactive @emph{interpreter} mode.
|
|
While in this environment, commands can be given and the tool responds. This is
|
|
illustrated below:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ gprofng display text test.1.er
|
|
Warning: History and command editing is not supported on this system.
|
|
(gp-display-text) quit
|
|
$
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@cindex Command line mode
|
|
While useful in certain cases, we prefer to use this tool in command line mode,
|
|
by specifying the commands to be issued when invoking the tool. The way to do
|
|
this is to prepend the command with a hyphen (@code{-}) if used on the command
|
|
line.
|
|
|
|
For example,
|
|
@IndexSubentry{Commands, @code{functions}}
|
|
with the @code{functions} command we request a list of the functions that
|
|
have been executed and their respective CPU times:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -functions test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ gprofng display text -functions test.1.er
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Incl. Name
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
2.272 2.272 <Total>
|
|
2.160 2.160 mxv_core
|
|
0.047 0.103 init_data
|
|
0.030 0.043 erand48_r
|
|
0.013 0.013 __drand48_iterate
|
|
0.013 0.056 drand48
|
|
0.008 0.010 _int_malloc
|
|
0.001 0.001 brk
|
|
0.001 0.002 sysmalloc
|
|
0. 0.001 __default_morecore
|
|
0. 0.113 __libc_start_main
|
|
0. 0.010 allocate_data
|
|
0. 2.160 collector_root
|
|
0. 2.160 driver_mxv
|
|
0. 0.113 main
|
|
0. 0.010 malloc
|
|
0. 0.001 sbrk
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As easy and simple as these steps are, we do have a first profile of our program!
|
|
There are three columns. The first two contain the
|
|
@cindex Total CPU time
|
|
@emph{Total CPU Time},
|
|
which
|
|
is the sum of the user and system time. @xref{Inclusive and Exclusive Metrics}
|
|
for an explanation of ``exclusive'' and ``inclusive'' times.
|
|
|
|
The first line echoes the metric that is used to sort the output. By default, this
|
|
is the exclusive CPU time, but the sort metric can be changed by the user.
|
|
|
|
We then see three columns with the exclusive and inclusive CPU times, plus the
|
|
name of the function.
|
|
|
|
@IndexSubentry{Miscellaneous, @code{<Total>}}
|
|
The function with the name @code{<Total>} is not a user function, but is introduced
|
|
by @ToolName{} and is used to display the accumulated metric values. In this case,
|
|
we see that the total CPU time of this job was @code{2.272} seconds.
|
|
|
|
With @code{2.160} seconds, function @code{mxv_core} is the most time
|
|
consuming function. It is also a leaf function.
|
|
|
|
The next function in the list is @code{init_data}. Although the CPU time spent in
|
|
this part is negligible, this is an interesting entry because the inclusive CPU
|
|
time of @code{0.103} seconds is higher than the exclusive CPU time of @code{0.047}
|
|
seconds. Clearly it is calling another function,
|
|
or even more than one function.
|
|
@xref{The Call Tree} for the details how to get more information on this.
|
|
|
|
The function @code{collector_root} does not look familiar. It is one of the internal
|
|
functions used by @CollectApp{} and can be ignored. While the inclusive time is high,
|
|
the exclusive time is zero. This means it doesn't contribute to the performance.
|
|
|
|
The question is how we know where this function originates from? There is a very useful
|
|
command to get more details on a function. @xref{Information on Load Objects}.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Source Code View
|
|
@subsection The Source Code View
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In general, you would like to focus the tuning efforts on the most time
|
|
consuming part(s) of the program. In this case that is easy, since 2.160
|
|
seconds on a total of 2.272 seconds is spent in function @code{mxv_core}.
|
|
That is 95% of the total and it is time to dig deeper and look
|
|
@cindex Source level timings
|
|
at the time distribution at the source code level.
|
|
|
|
@IndexSubentry{Commands, @code{source}}
|
|
The @code{source} command is used to accomplish this. It takes the name of the
|
|
function, not the source filename, as an argument. This is demonstrated
|
|
below, where the @DisplayText{} command is used to show the annotated
|
|
source listing of function @code{mxv_core}.
|
|
|
|
Please note that the source code has to be compiled with the @code{-g}
|
|
option in order for the source code feature to work. Otherwise the
|
|
location can not be determined.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -source mxv_core test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The slightly modified output is as follows:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Source file: <apath>/mxv.c
|
|
Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
|
|
Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
|
|
|
|
Excl. Incl.
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
|
|
<lines deleted>
|
|
<Function: mxv_core>
|
|
0. 0. 32. void __attribute__ ((noinline))
|
|
mxv_core (
|
|
uint64_t row_index_start,
|
|
uint64_t row_index_end,
|
|
uint64_t m, uint64_t n,
|
|
double **restrict A,
|
|
double *restrict b,
|
|
double *restrict c)
|
|
0. 0. 33. {
|
|
0. 0. 34. for (uint64_t i=row_index_start;
|
|
i<=row_index_end; i++) {
|
|
0. 0. 35. double row_sum = 0.0;
|
|
## 1.687 1.687 36. for (int64_t j=0; j<n; j++)
|
|
0.473 0.473 37. row_sum += A[i][j]*b[j];
|
|
0. 0. 38. c[i] = row_sum;
|
|
39. }
|
|
0. 0. 40. }
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first three lines provide information on the location of the source file,
|
|
the object file and the load object (@xref{Load Objects and Functions}).
|
|
|
|
Function @code{mxv_core} is part of a source file that has other functions
|
|
as well. These functions will be shown, but without timing information. They
|
|
have been removed in the output shown above.
|
|
|
|
This is followed by the annotated source code listing. The selected metrics
|
|
are shown first, followed by a source line number, and the source code.
|
|
@IndexSubentry{Miscellaneous ,@code{##}}
|
|
The most time consuming line(s) are marked with the @code{##} symbol. In
|
|
this way they are easier to find.
|
|
|
|
What we see is that all of the time is spent in lines 36-37.
|
|
|
|
@IndexSubentry{Commands, @code{lines}}
|
|
A related command sometimes comes handy as well. It is called @code{lines}
|
|
and displays a list of the source lines and their metrics, ordered according
|
|
to the current sort metric (@xref{Sorting the Performance Data}).
|
|
|
|
Below the command and the output. For lay-out reasons, only the top 10 is
|
|
shown here and the last part of the text on some lines has been replaced
|
|
by dots.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -lines test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Lines sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Incl. Name
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
2.272 2.272 <Total>
|
|
1.687 1.687 mxv_core, line 36 in "mxv.c"
|
|
0.473 0.473 mxv_core, line 37 in "mxv.c"
|
|
0.032 0.088 init_data, line 72 in "manage_data.c"
|
|
0.030 0.043 <Function: erand48_r, instructions without line numbers>
|
|
0.013 0.013 <Function: __drand48_iterate, instructions without ...>
|
|
0.013 0.056 <Function: drand48, instructions without line numbers>
|
|
0.012 0.012 init_data, line 77 in "manage_data.c"
|
|
0.008 0.010 <Function: _int_malloc, instructions without ...>
|
|
0.003 0.003 init_data, line 71 in "manage_data.c"
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
What this overview immediately highlights is that the next most time consuming
|
|
source line takes 0.032 seconds only. With an inclusive time of 0.088 seconds,
|
|
it is also clear that this branch of the code does not impact the performance.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Disassembly View
|
|
@subsection The Disassembly View
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The source view is very useful to obtain more insight where the time is spent,
|
|
but sometimes this is not sufficient. This is when the disassembly view comes
|
|
in. It is activated with the
|
|
@IndexSubentry{Commands, @code{disasm}}
|
|
@code{disasm}
|
|
command and as with the source view, it displays an annotated listing. In this
|
|
@cindex Instruction level timings
|
|
case it shows the instructions with the metrics, interleaved with the
|
|
source lines. The
|
|
instructions have a reference in square brackets (@code{[} and @code{]})
|
|
to the source line they correspond to.
|
|
|
|
@noindent
|
|
This is what we get for our example:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -disasm mxv_core test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Source file: <apath>/mxv.c
|
|
Object file: mxv-pthreads.exe (found as test.1.er/archives/...)
|
|
Load Object: mxv-pthreads.exe (found as test.1.er/archives/...)
|
|
|
|
Excl. Incl.
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
|
|
<lines deleted>
|
|
32. void __attribute__ ((noinline))
|
|
mxv_core (
|
|
uint64_t row_index_start,
|
|
uint64_t row_index_end,
|
|
uint64_t m, uint64_t n,
|
|
double **restrict A,
|
|
double *restrict b,
|
|
double *restrict c)
|
|
33. {
|
|
<Function: mxv_core>
|
|
0. 0. [33] 4021ba: mov 0x8(%rsp),%r10
|
|
34. for (uint64_t i=row_index_start;
|
|
i<=row_index_end; i++) {
|
|
0. 0. [34] 4021bf: cmp %rsi,%rdi
|
|
0. 0. [34] 4021c2: jbe 0x37
|
|
0. 0. [34] 4021c4: ret
|
|
35. double row_sum = 0.0;
|
|
36. for (int64_t j=0; j<n; j++)
|
|
37. row_sum += A[i][j]*b[j];
|
|
0. 0. [37] 4021c5: mov (%r8,%rdi,8),%rdx
|
|
0. 0. [36] 4021c9: mov $0x0,%eax
|
|
0. 0. [35] 4021ce: pxor %xmm1,%xmm1
|
|
0.002 0.002 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
|
|
0.096 0.096 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
|
|
0.375 0.375 [37] 4021dd: addsd %xmm0,%xmm1
|
|
## 1.683 1.683 [36] 4021e1: add $0x1,%rax
|
|
0.004 0.004 [36] 4021e5: cmp %rax,%rcx
|
|
0. 0. [36] 4021e8: jne 0xffffffffffffffea
|
|
38. c[i] = row_sum;
|
|
0. 0. [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
|
|
0. 0. [34] 4021f0: add $0x1,%rdi
|
|
0. 0. [34] 4021f4: cmp %rdi,%rsi
|
|
0. 0. [34] 4021f7: jb 0xd
|
|
0. 0. [35] 4021f9: pxor %xmm1,%xmm1
|
|
0. 0. [36] 4021fd: test %rcx,%rcx
|
|
0. 0. [36] 402200: jne 0xffffffffffffffc5
|
|
0. 0. [36] 402202: jmp 0xffffffffffffffe8
|
|
39. }
|
|
40. }
|
|
0. 0. [40] 402204: ret
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
For each instruction, the timing values are given and we can exactly which ones
|
|
are the most expensive. As with the source level view, the most expensive
|
|
instructions are market with the @code{##} symbol.
|
|
|
|
As illustrated below and similar to the @code{lines} command, we can get
|
|
an overview of the instructions executed by using the
|
|
@IndexSubentry{Commands, @code{pcs}}
|
|
@code{pcs}
|
|
command.
|
|
|
|
@noindent
|
|
Below the command and the output, which again has been restricted
|
|
to 10 lines:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -pcs test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
PCs sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Incl. Name
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
2.272 2.272 <Total>
|
|
1.683 1.683 mxv_core + 0x00000027, line 36 in "mxv.c"
|
|
0.375 0.375 mxv_core + 0x00000023, line 37 in "mxv.c"
|
|
0.096 0.096 mxv_core + 0x0000001D, line 37 in "mxv.c"
|
|
0.027 0.027 init_data + 0x000000BD, line 72 in "manage_data.c"
|
|
0.012 0.012 init_data + 0x00000117, line 77 in "manage_data.c"
|
|
0.008 0.008 _int_malloc + 0x00000A45
|
|
0.007 0.007 erand48_r + 0x00000062
|
|
0.006 0.006 drand48 + 0x00000000
|
|
0.005 0.005 __drand48_iterate + 0x00000005
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Display and Define the Metrics
|
|
@subsection Display and Define the Metrics
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The default metrics shown by @DisplayText{} are useful, but there is more
|
|
recorded than displayed. We can customize the values shown by defining the
|
|
metrics ourselves.
|
|
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
There are two commands related to changing the metrics shown: @code{metric_list}
|
|
and
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
@code{metrics}.
|
|
|
|
The first command shows the metrics in use, plus all the metrics that have
|
|
been stored as part of the experiment. The second command may be used to
|
|
define the metric list.
|
|
|
|
In our example we get the following values for the metrics:
|
|
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metric_list test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.totalcpu:i.totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
|
|
Available metrics:
|
|
Exclusive Total CPU Time: e.%totalcpu
|
|
Inclusive Total CPU Time: i.%totalcpu
|
|
Size: size
|
|
PC Address: address
|
|
Name: name
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This shows the metrics currently in use, the metric that is used to sort
|
|
the data and all the metrics that have been recorded, but are not necessarily
|
|
shown.
|
|
|
|
@cindex Default metrics
|
|
In this case, the default metrics are set to the exclusive and inclusive
|
|
total CPU times, plus the name of the function, or load object.
|
|
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
The @code{metrics} command is used to define the metrics that need to be
|
|
displayed.
|
|
|
|
For example, to display the exclusive total CPU time, both as a number and a
|
|
percentage, use the following metric definition: @code{e.%totalcpu}
|
|
|
|
Since the metrics can be tailored for different views, there is a way
|
|
to reset them to the default. This is done through the special keyword
|
|
@code{default}.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A First Customization of the Output
|
|
@subsection A First Customization of the Output
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
With the information just given, we can customize the function overview.
|
|
For sake of the example, we would like to display the name of the function
|
|
first, followed by the exclusive CPU time, given as an absolute number and
|
|
a percentage.
|
|
|
|
Note that the commands are parsed in order of appearance. This is why we
|
|
need to define the metrics @emph{before} requesting the function overview:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics name:e.%totalcpu -functions test.1.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: name:e.%totalcpu
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Name Excl. Total
|
|
CPU
|
|
sec. %
|
|
<Total> 2.272 100.00
|
|
mxv_core 2.160 95.04
|
|
init_data 0.047 2.06
|
|
erand48_r 0.030 1.32
|
|
__drand48_iterate 0.013 0.57
|
|
drand48 0.013 0.57
|
|
_int_malloc 0.008 0.35
|
|
brk 0.001 0.04
|
|
sysmalloc 0.001 0.04
|
|
__default_morecore 0. 0.
|
|
__libc_start_main 0. 0.
|
|
allocate_data 0. 0.
|
|
collector_root 0. 0.
|
|
driver_mxv 0. 0.
|
|
main 0. 0.
|
|
malloc 0. 0.
|
|
sbrk 0. 0.
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This was a first and simple example how to customize the output. Note that we
|
|
did not rerun our profiling job and merely modified the display settings.
|
|
Below we will show other and also more advanced examples of customization.
|
|
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Name the Experiment Directory
|
|
@subsection Name the Experiment Directory
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
When using @CollectApp{}, the default names for experiments work fine, but
|
|
they are quite generic. It is often more convenient to select a more
|
|
descriptive name. For example, one that reflects conditions for the experiment
|
|
conducted.
|
|
|
|
For this, the mutually exclusive @code{-o} and @code{-O} options come in handy.
|
|
Both may be used to provide a name for the experiment directory, but the
|
|
behaviour of @CollectApp{} is different.
|
|
|
|
With the
|
|
@IndexSubentry{Options, @code{-o}}
|
|
@code{-o}
|
|
option, an existing experiment directory is not overwritten. You either
|
|
need to explicitly remove an existing directory first, or use a name that is not
|
|
in use yet.
|
|
|
|
This is in contrast with the behaviour for the
|
|
@IndexSubentry{Options, @code{-O}}
|
|
@code{-O}
|
|
option. Any existing (experiment) directory with the same name is silently
|
|
overwritten.
|
|
|
|
Be aware that the name of the experiment directory has to end with @code{.er}.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Control the Number of Lines in the Output
|
|
@subsection Control the Number of Lines in the Output
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Commands, @code{limit}}
|
|
The @code{limit <n>} command can be used to control the number of lines printed
|
|
in various overviews, including the function view, but it also takes effect
|
|
for other display commands, like @code{lines}.
|
|
|
|
The argument @code{<n>} should be a positive integer number. It sets the number
|
|
of lines in the function view. A value of zero resets the limit to the default.
|
|
|
|
Be aware that the pseudo-function @code{<Total>} counts as a regular function.
|
|
For example @code{limit 10} displays nine user level functions.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Sorting the Performance Data
|
|
@subsection Sorting the Performance Data
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Commands, @code{sort}}
|
|
The @code{sort <key>} command sets the key to be used when sorting the
|
|
performance data.
|
|
|
|
The key is a valid metric definition, but the
|
|
@cindex Visibility field
|
|
visibility field
|
|
(@xref{Metric Definitions})
|
|
in the metric
|
|
definition is ignored since this does not affect the outcome of the sorting
|
|
operation.
|
|
For example if we set the sort key to @code{e.totalcpu}, the values
|
|
will be sorted in descending order with respect to the exclusive total
|
|
CPU time.
|
|
|
|
The data can be sorted in reverse order by prepending the metric definition
|
|
with a minus (@code{-}) sign. For example @code{sort -e.totalcpu}.
|
|
|
|
A default metric for the sort operation has been defined and since this is
|
|
a persistent command, this default can be restored with @code{default} as
|
|
the key.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Scripting
|
|
@subsection Scripting
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
As is probably clear by now, the list with commands for @DisplayText{} can be
|
|
very long. This is tedious and also error prone. Luckily, there is an easier and
|
|
more elegant way to control the behaviour of this tool.
|
|
|
|
@IndexSubentry{Commands, @code{script}}
|
|
Through the @code{script} command, the name of a file with commands can be
|
|
passed in. These commands are parsed and executed as if they appeared on
|
|
the command line in the same order as encountered in the file. The commands
|
|
in this script file can actually be mixed with commands on the command line.
|
|
|
|
The difference between the commands in the script file and those used on the
|
|
command line is that the latter require a leading dash (@code{-}) symbol.
|
|
|
|
Comment lines are supported. They need to start with the @code{#} symbol.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node A More Elaborate Example
|
|
@subsection A More Elaborate Example
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
With the information presented so far, we can customize our data
|
|
gathering and display commands.
|
|
|
|
As an example, to reflect the name of the algorithm and the number of threads
|
|
that were used in the experiment, we select @code{mxv.1.thr.er}
|
|
as the name of the experiment directory.
|
|
All we then need to
|
|
do is to add the
|
|
@IndexSubentry{Options, @code{-O}}
|
|
@code{-O}
|
|
option followed by this name on the command line when running @CollectApp{}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The commands to generate the profile are put into a file that we simply call
|
|
@code{my-script}:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ cat my-script
|
|
# This is my first gprofng script
|
|
# Set the metrics
|
|
metrics i.%totalcpu:e.%totalcpu:name
|
|
# Use the exclusive time to sort
|
|
sort e.totalcpu
|
|
# Limit the function list to 5 lines
|
|
limit 5
|
|
# Show the function list
|
|
functions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This script file is then specified as input to the @DisplayText{} command
|
|
that is used to display the performance information stored in
|
|
@code{mxv.1.thr.er}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The command above produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# This is my first gprofng script
|
|
# Set the metrics
|
|
Current metrics: i.%totalcpu:e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Use the exclusive time to sort
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Limit the function list to 5 lines
|
|
Print limit set to 5
|
|
# Show the function list
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Incl. Total Excl. Total Name
|
|
CPU CPU
|
|
sec. % sec. %
|
|
2.272 100.00 2.272 100.00 <Total>
|
|
2.159 95.00 2.159 95.00 mxv_core
|
|
0.102 4.48 0.054 2.37 init_data
|
|
0.035 1.54 0.025 1.10 erand48_r
|
|
0.048 2.11 0.013 0.57 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In the first part of the output, our comment lines in the script file are
|
|
shown. These are interleaved with an acknowledgement message for the commands.
|
|
|
|
This is followed by a profile consisting of 5 lines only. For both metrics,
|
|
the percentages plus the timings are given. The numbers are sorted with respect
|
|
to the exclusive total CPU time.
|
|
|
|
It is now immediately clear that function @code{mxv_core} is responsbile for
|
|
95% of the CPU time and @code{init_data} takes 4.5% only.
|
|
|
|
This is also where we see sampling in action. Although this is exactly the
|
|
same job we profiled before, the timings are somewhat different, but the
|
|
differences are very small.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node The Call Tree
|
|
@subsection The Call Tree
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The call tree shows the dynamic hierarchy of the application by displaying the
|
|
functions executed and their parent. It helps to find the most expensive path
|
|
in the program.
|
|
|
|
@IndexSubentry{Commands, @code{calltree}}
|
|
This feature is enabled through the @code{calltree} command. This is how to get
|
|
this tree for our current experiment:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -calltree mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This displays the following structure:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions Call Tree. Metric: Attributed Total CPU Time
|
|
|
|
Attr. Name
|
|
Total
|
|
CPU sec.
|
|
2.272 +-<Total>
|
|
2.159 +-collector_root
|
|
2.159 | +-driver_mxv
|
|
2.159 | +-mxv_core
|
|
0.114 +-__libc_start_main
|
|
0.114 +-main
|
|
0.102 +-init_data
|
|
0.048 | +-drand48
|
|
0.035 | +-erand48_r
|
|
0.010 | +-__drand48_iterate
|
|
0.011 +-allocate_data
|
|
0.011 | +-malloc
|
|
0.011 | +-_int_malloc
|
|
0.001 | +-sysmalloc
|
|
0.001 +-check_results
|
|
0.001 +-malloc
|
|
0.001 +-_int_malloc
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
At first sight this may not be what you expected and some explanation is in
|
|
place.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD: Revise this text when we have user and machine mode.
|
|
@c ----------------------------------------------------------------------------
|
|
First of all, function @code{collector_root} is internal to @ToolName{} and
|
|
should be hidden to the user. This is part of a planned future enhancement.
|
|
|
|
Recall that the @code{objects} and @code{fsingle} commands are very useful
|
|
to find out more about load objects in general, but also to help identify
|
|
an unknown entry in the function overview. @xref{Load Objects and Functions}.
|
|
|
|
Another thing to note is that there are two main branches. The one under
|
|
@code{collector_root} and the second one under @code{__libc_start_main}.
|
|
This reflects the fact that we are executing a parallel program. Even though
|
|
we only used one thread for this run, this is still executed in a separate
|
|
path.
|
|
|
|
The main, sequential part of the program is displayed under @code{main} and
|
|
shows the functions called and the time they took.
|
|
|
|
There are two things worth noting for the call tree feature:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
This is a dynamic tree and since sampling is used, it most likely looks
|
|
slighlty different across seemingly identical profile runs. In case the
|
|
run times are short, it is worth considering to use a high resolution
|
|
through the
|
|
@IndexSubentry{Options, @code{-p}}
|
|
@code{-p}
|
|
option. For example to use @code{-p hi} to increase the sampling rate.
|
|
|
|
@item
|
|
In case hardware event counters have been enabled
|
|
(@xref{Profile Hardware Event Counters}), these values are also displayed
|
|
in the call tree view.
|
|
|
|
@end itemize
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node More Information on the Experiment
|
|
@subsection More Information on the Experiment
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The experiment directory not only contains performance related data. Several
|
|
system characteristics, the actually command executed, and some global
|
|
performance statistics can be displayed.
|
|
|
|
@IndexSubentry{Commands, @code{header}}
|
|
The @code{header} command displays information about the experiment(s).
|
|
For example, this is the command to extract this data from for our experiment
|
|
directory:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -header mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The above command prints the following information. Note that some of the
|
|
lay-out and the information has been modified. The textual changes are
|
|
marked with the @code{<} and @code{>} symbols.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Experiment: mxv.1.thr.er
|
|
No errors
|
|
No warnings
|
|
Archive command `gp-archive -n -a on
|
|
--outfile <exp_dir>/archive.log <exp_dir>'
|
|
|
|
Target command (64-bit): './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
|
|
Process pid 30591, ppid 30589, pgrp 30551, sid 30468
|
|
Current working directory: <cwd>
|
|
Collector version: `2.36.50'; experiment version 12.4 (64-bit)
|
|
Host `<hostname>', OS `Linux <version>', page size 4096,
|
|
architecture `x86_64'
|
|
16 CPUs, clock speed 1995 MHz.
|
|
Memory: 30871514 pages @ 4096 = 120591 MB.
|
|
Data collection parameters:
|
|
Clock-profiling, interval = 997 microsecs.
|
|
Periodic sampling, 1 secs.
|
|
Follow descendant processes from: fork|exec|combo
|
|
|
|
Experiment started <date and time>
|
|
|
|
Experiment Ended: 2.293162658
|
|
Data Collection Duration: 2.293162658
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The output above may assist in troubleshooting, or to verify some of the
|
|
operational conditions and we recommand to include this command when
|
|
generating a profile.
|
|
|
|
@IndexSubentry{Options, @code{-C}}
|
|
Related to this command there is a useful option to record your own comment(s) in
|
|
an experiment.
|
|
To this end, use the @code{-C} option on the @CollectApp{} tool to
|
|
specify a comment string. Up to ten comment lines can be included.
|
|
These comments are displayed with the @code{header} command on
|
|
the @DisplayText{} tool.
|
|
|
|
@IndexSubentry{Commands, @code{overview}}
|
|
The @code{overview} command displays information on the experiment(s) and also
|
|
shows a summary of the values for the metric(s) used. This is an example how to
|
|
use it on our newly created experiment directory:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -overview mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Experiment(s):
|
|
|
|
Experiment :mxv.1.thr.er
|
|
Target : './mxv-pthreads.exe -m 3000 -n 2000 -t 1'
|
|
Host : <hostname> (<ISA>, Linux <version>)
|
|
Start Time : <date and time>
|
|
Duration : 2.293 Seconds
|
|
|
|
Metrics:
|
|
|
|
Experiment Duration (Seconds): [2.293]
|
|
Clock Profiling
|
|
[X]Total CPU Time - totalcpu (Seconds): [*2.272]
|
|
|
|
Notes: '*' indicates hot metrics, '[X]' indicates currently enabled
|
|
metrics.
|
|
The metrics command can be used to change selections. The
|
|
metric_list command lists all available metrics.
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This command provides a dashboard overview that helps to easily identify
|
|
where the time is spent and in case hardware event counters are used, it
|
|
shows their total values.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Control the Sampling Frequency
|
|
@subsection Control the Sampling Frequency
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
So far we did not talk about the frequency of the sampling process, but in
|
|
some cases it is useful to change the default of 10 milliseconds.
|
|
|
|
The advantage of increasing the sampling frequency is that functions that
|
|
do not take much time per invocation are more accurately captured. The
|
|
downside is that more data is gathered. This has an impact on the overhead
|
|
of the collection process and more disk space is required.
|
|
|
|
In general this is not an immediate concern, but with heavily threaded
|
|
applications that run for an extended period of time, increasing the
|
|
frequency may have a more noticeable impact.
|
|
|
|
@IndexSubentry{Options, @code{-p}}
|
|
The @code{-p} option on the @CollectApp{} tool is used to enable or disable
|
|
clock based profiling, or to explicitly set the sampling rate.
|
|
@cindex Sampling interval
|
|
This option takes one of the following keywords:
|
|
|
|
@table @code
|
|
|
|
@item off
|
|
Disable clock based profiling.
|
|
|
|
@item on
|
|
Enable clock based profiling with a per thread sampling interval of 10 ms. This is the default.
|
|
|
|
@item lo
|
|
Enable clock based profiling with a per thread sampling interval of 100 ms.
|
|
|
|
@item hi
|
|
Enable clock based profiling with a per thread sampling interval of 1 ms.
|
|
|
|
@item <value>
|
|
Enable clock based profiling with a per thread sampling interval of <value>.
|
|
|
|
@end table
|
|
|
|
One may wonder why there is an option to disable clock based profiling. This
|
|
is because by default, it is enabled when conducting hardware event counter
|
|
experiments (@xref{Profile Hardware Event Counters}).
|
|
With the @code{-p off} option, this can be disabled.
|
|
|
|
If an explicit value is set for the sampling, the number can be an integer or a
|
|
floating-point number.
|
|
A suffix of @code{u} for microseconds, or @code{m} for milliseconds is supported.
|
|
If no suffix is used, the value is assumed to be in milliseconds.
|
|
|
|
If the value is smaller than the clock profiling minimum, a warning message is issued
|
|
and it is set to the minimum.
|
|
In case it is not a multiple of the clock profiling resolution, it is silently rounded
|
|
down to the nearest multiple of the clock resolution.
|
|
|
|
If the value exceeds the clock profiling maximum, is negative, or zero, an error is
|
|
reported.
|
|
|
|
@IndexSubentry{Commands, @code{header}}
|
|
Note that the @code{header} command echoes the sampling rate used.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Information on Load Objects
|
|
@subsection Information on Load Objects
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
It may happen that the function list contains a function that is not known to
|
|
the user. This can easily happen with library functions for example.
|
|
Luckily there are three commands that come in handy then.
|
|
|
|
@IndexSubentry{Commands, @code{objects}}
|
|
@IndexSubentry{Commands, @code{fsingle}}
|
|
@IndexSubentry{Commands, @code{fsummary}}
|
|
These commands are @code{objects}, @code{fsingle}, and @code{fsummary}.
|
|
They provide details on
|
|
@cindex Load objects
|
|
load objects (@xref{Load Objects and Functions}).
|
|
|
|
The @code{objects} command lists all load objects that have been referenced
|
|
during the performance experiment.
|
|
Below we show the command and the result for our profile job. Like before,
|
|
the (long) path names in the output have been shortened and replaced by the
|
|
@IndexSubentry{Miscellaneous, @code{<apath>}}
|
|
@code{<apath>} symbol that represents an absolute directory path.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -objects mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The output includes the name and path of the target executable:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
<Unknown> (<Unknown>)
|
|
<mxv-pthreads.exe> (<apath>/mxv-pthreads.exe)
|
|
<librt-2.17.so> (/usr/lib64/librt-2.17.so)
|
|
<libdl-2.17.so> (/usr/lib64/libdl-2.17.so)
|
|
<libbfd-2.36.50.20210505.so> (<apath>/libbfd-2.36.50 <etc>)
|
|
<libopcodes-2.36.50.20210505.so> (<apath>/libopcodes-2. <etc>)
|
|
<libc-2.17.so> (/usr/lib64/libc-2.17.so)
|
|
<libpthread-2.17.so> (/usr/lib64/libpthread-2.17.so)
|
|
<libm-2.17.so> (/usr/lib64/libm-2.17.so)
|
|
<libgp-collector.so> (<apath>/libgp-collector.so)
|
|
<ld-2.17.so> (/usr/lib64/ld-2.17.so)
|
|
<DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS)
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@IndexSubentry{Commands, @code{fsingle}}
|
|
The @code{fsingle} command may be used to get more details on a specific entry
|
|
in the function view, say. For example, the command below provides additional
|
|
information on the @code{collector_root} function shown in the function overview.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -fsingle collector_root mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Below the output from this command. It has been somewhat modified to match the
|
|
display requirements.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
collector_root
|
|
Exclusive Total CPU Time: 0. ( 0. %)
|
|
Inclusive Total CPU Time: 2.159 ( 95.0%)
|
|
Size: 401
|
|
PC Address: 10:0x0001db60
|
|
Source File: <apath>/dispatcher.c
|
|
Object File: mxv.1.thr.er/archives/libgp-collector.so_HpzZ6wMR-3b
|
|
Load Object: <apath>/libgp-collector.so
|
|
Mangled Name:
|
|
Aliases:
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In this table we not only see how much time was spent in this function, we
|
|
also see where it originates from. In addition to this, the size and start
|
|
address are given as well. If the source code location is known it is also
|
|
shown here.
|
|
|
|
@IndexSubentry{Commands, @code{fsummary}}
|
|
The related @code{fsummary} command displays the same information as
|
|
@code{fsingle}, but for all functions in the function overview,
|
|
including @code{<Total>}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -fsummary mxv.1.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
<Total>
|
|
Exclusive Total CPU Time: 2.272 (100.0%)
|
|
Inclusive Total CPU Time: 2.272 (100.0%)
|
|
Size: 0
|
|
PC Address: 1:0x00000000
|
|
Source File: (unknown)
|
|
Object File: (unknown)
|
|
Load Object: <Total>
|
|
Mangled Name:
|
|
Aliases:
|
|
|
|
mxv_core
|
|
Exclusive Total CPU Time: 2.159 ( 95.0%)
|
|
Inclusive Total CPU Time: 2.159 ( 95.0%)
|
|
Size: 75
|
|
PC Address: 2:0x000021ba
|
|
Source File: <apath>/mxv.c
|
|
Object File: mxv.1.thr.er/archives/mxv-pthreads.exe_hRxWdccbJPc
|
|
Load Object: <apath>/mxv-pthreads.exe
|
|
Mangled Name:
|
|
Aliases:
|
|
|
|
... etc ...
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Support for Multithreading
|
|
@section Support for Multithreading
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In this chapter we introduce and discuss the support for multithreading. As
|
|
is shown below, nothing needs to be changed when collecting the performance
|
|
data.
|
|
|
|
The difference is that additional commands are available to get more
|
|
information on the parallel environment, plus that several filters allow
|
|
the user to zoom in on specific threads.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Creating a Multithreading Experiment
|
|
@subsection Creating a Multithreading Experiment
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
We demonstrate the support for multithreading using the same code and settings
|
|
as before, but this time we use 2 threads:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
First of all, note that we did not change anything, other than setting the
|
|
number of threads to 2. Nothing special is needed to profile a multithreaded
|
|
job when using @ToolName{}.
|
|
|
|
The same is true when displaying the performance results. The same commands
|
|
that we used before work unmodified. For example, this is all that is needed to
|
|
get a function overview:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gpprofng display text -limit 10 -functions mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This produces the following familiar looking output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Print limit set to 10
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Incl. Name
|
|
Total Total
|
|
CPU sec. CPU sec.
|
|
2.268 2.268 <Total>
|
|
2.155 2.155 mxv_core
|
|
0.044 0.103 init_data
|
|
0.030 0.046 erand48_r
|
|
0.016 0.016 __drand48_iterate
|
|
0.013 0.059 drand48
|
|
0.008 0.011 _int_malloc
|
|
0.003 0.003 brk
|
|
0. 0.003 __default_morecore
|
|
0. 0.114 __libc_start_main
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Commands Specific to Multithreading
|
|
@subsection Commands Specific to Multithreading
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The function overview shown above shows the results aggregated over all the
|
|
threads. The interesting new element is that we can also look at the
|
|
performance data for the individual threads.
|
|
|
|
@IndexSubentry{Commands, @code{thread_list}}
|
|
The @code{thread_list} command displays how many threads have been used:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -thread_list mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This produces the following output, showing that three threads have
|
|
been used:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 all 3
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The output confirms there is one experiment and that by default all
|
|
threads are selected.
|
|
|
|
It may seem surprising to see three threads here, since we used the
|
|
@code{-t 2} option, but it is common for a Pthreads program to use one
|
|
additional thread. This is typically the thread that runs from start to
|
|
finish and handles the sequential portions of the code, as well as takes
|
|
care of managing the threads.
|
|
|
|
It is no different in our example code. At some point, the main thread
|
|
creates and activates the two threads that perform the multiplication
|
|
of the matrix with the vector. Upon completion of this computation,
|
|
the main thread continues.
|
|
|
|
@IndexSubentry{Commands, @code{threads}}
|
|
The @code{threads} command is simple, yet very powerful. It shows the
|
|
total value of the metrics for each thread. To make it easier to
|
|
interpret the data, we modify the metrics to include percentages:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics e.%totalcpu -threads mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The command above produces the following overview:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
2.258 100.00 <Total>
|
|
1.075 47.59 Process 1, Thread 3
|
|
1.070 47.37 Process 1, Thread 2
|
|
0.114 5.03 Process 1, Thread 1
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first line gives the total CPU time accumulated over the threads
|
|
selected. This is followed by the metric value(s) for each thread.
|
|
|
|
From this it is clear that the main thread is responsible for 5% of
|
|
the total CPU time, while the other two threads take 47% each.
|
|
|
|
This view is ideally suited to verify if there any load balancing
|
|
issues and also to find the most time consuming thread(s).
|
|
|
|
@IndexSubentry{Filters, Thread selection}
|
|
While useful, often more information than this is needed. This is
|
|
@IndexSubentry{Commands, @code{thread_select}}
|
|
where the thread selection filter comes in. Through the @code{thread_select}
|
|
command, one or more threads may be selected
|
|
(@xref{The Selection List} how to define the selection list).
|
|
|
|
Since it is most common to use this command in a script, we do so as
|
|
well here. Below the script we are using:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 10 lines
|
|
limit 10
|
|
# Get the function overview for thread 1
|
|
thread_select 1
|
|
functions
|
|
# Get the function overview for thread 2
|
|
thread_select 2
|
|
functions
|
|
# Get the function overview for thread 3
|
|
thread_select 3
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The definition of the metrics and the output limiter has been shown and
|
|
explained before and will be ignored. The new command we focus on is
|
|
@IndexSubentry{Commands, @code{thread_select}}
|
|
@code{thread_select}.
|
|
|
|
This command takes a list (@xref{The Selection List}) to select specific
|
|
threads. In this case we simply use the individual thread numbers that we
|
|
obtained with the @code{thread_list} command earlier.
|
|
|
|
This restricts the output of the @code{functions} command to the thread
|
|
number(s) specified. This means that the script above shows which
|
|
function(s) each thread executes and how much CPU time they consumed.
|
|
Both the timings and their percentages are given.
|
|
|
|
This is the relevant part of the output for the first thread:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Get the function overview for thread 1
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 1 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
0.114 100.00 <Total>
|
|
0.051 44.74 init_data
|
|
0.028 24.56 erand48_r
|
|
0.017 14.91 __drand48_iterate
|
|
0.010 8.77 _int_malloc
|
|
0.008 7.02 drand48
|
|
0. 0. __libc_start_main
|
|
0. 0. allocate_data
|
|
0. 0. main
|
|
0. 0. malloc
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As usual, the comment lines are echoed. This is followed by a confirmation
|
|
of our selection. We see that indeed thread 1 has been selected. What is
|
|
displayed next is the function overview for this particular thread. Due to
|
|
the @code{limit 10} command, there are ten entries in this list.
|
|
|
|
Below are the overviews for threads 2 and 3 respectively. We see that all
|
|
of the CPU time is spent in function @code{mxv_core} and that this time
|
|
is approximately the same for both threads.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Get the function overview for thread 2
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 2 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
1.072 100.00 <Total>
|
|
1.072 100.00 mxv_core
|
|
0. 0. collector_root
|
|
0. 0. driver_mxv
|
|
|
|
# Get the function overview for thread 3
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 3 3
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
1.076 100.00 <Total>
|
|
1.076 100.00 mxv_core
|
|
0. 0. collector_root
|
|
0. 0. driver_mxv
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
When analyzing the performance of a multithreaded application, it is sometimes
|
|
useful to know whether threads have mostly executed on the same core, say, or
|
|
if they have wandered across multiple cores. This sort of stickiness is usually
|
|
referred to as
|
|
@cindex Thread affinity
|
|
@emph{thread affinity}.
|
|
|
|
Similar to the commands for the threads, there are several commands related
|
|
to the usage of the cores, or @emph{CPUs} as they are called in @ToolName{}
|
|
(@xref{The Concept of a CPU in @ProductName{}}).
|
|
|
|
In order to have some more interesting data to look at, we created a new
|
|
experiment, this time using 8 threads:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ gprofng collect app -O mxv.8.thr.er ./$exe -m $m -n $n -t 8
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@IndexSubentry{Commands, @code{cpu_list}}
|
|
Similar to the @code{thread_list} command, the @code{cpu_list} command
|
|
displays how many CPUs have been used.
|
|
@IndexSubentry{Commands, @code{cpus}}
|
|
The equivalent of the @code{threads} threads command, is the @code{cpus}
|
|
command, which shows the CPU numbers that were used and how much time was
|
|
spent on each of them. Both are demonstrated below.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metrics e.%totalcpu -cpu_list -cpus mxv.8.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This command produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 all 10
|
|
Objects sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
2.310 100.00 <Total>
|
|
0.286 12.39 CPU 7
|
|
0.284 12.30 CPU 13
|
|
0.282 12.21 CPU 5
|
|
0.280 12.13 CPU 14
|
|
0.266 11.52 CPU 9
|
|
0.265 11.48 CPU 2
|
|
0.264 11.44 CPU 11
|
|
0.194 8.42 CPU 0
|
|
0.114 4.92 CPU 1
|
|
0.074 3.19 CPU 15
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c TBD - Ruud
|
|
@c I'd like to improve this and have a way to see where a thread has executed.
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
What we see in this table is that a total of 10 CPUs have been used. This is
|
|
followed by a list with all the CPU numbers that have been used during the
|
|
run. For each CPU it is shown how much time was spent on it.
|
|
|
|
While the table with thread times shown earlier may point at a load imbalance
|
|
in the application, this overview has a different purpose.
|
|
|
|
For example, we see that 10 CPUs have been used, but we know that the
|
|
application uses 9 threads only.
|
|
This means that at least one thread has executed on more than one CPU. In
|
|
itself this is not something to worry about, but warrants a deeper
|
|
investigation.
|
|
|
|
Honesty dictates that next we performed a pre-analysis to find out
|
|
which thread(s) have been running on more than one CPU. We found this
|
|
to be thread 7. It has executed on CPUs 0 and 15.
|
|
|
|
With this knowledge, we wrote the script shown below. It zooms in on
|
|
the behaviour of thread 7.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 10 lines
|
|
limit 10
|
|
functions
|
|
# Get the function overview for CPU 0
|
|
cpu_select 0
|
|
functions
|
|
# Get the function overview for CPU 15
|
|
cpu_select 15
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
From the earlier shown threads overview, we know that thread 7 has
|
|
used @code{0.268} seconds of CPU time..
|
|
|
|
By selecting CPUs 0 and 15, respectively, we get the following
|
|
function overviews:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Get the function overview for CPU 0
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 0 10
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
0.194 100.00 <Total>
|
|
0.194 100.00 mxv_core
|
|
0. 0. collector_root
|
|
0. 0. driver_mxv
|
|
|
|
# Get the function overview for CPU 15
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 15 10
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
0.074 100.00 <Total>
|
|
0.074 100.00 mxv_core
|
|
0. 0. collector_root
|
|
0. 0. driver_mxv
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This shows that thread 7 spent @code{0.194} seconds on CPU 0 and
|
|
@code{0.074} seconds on CPU 15.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Viewing Multiple Experiments
|
|
@section Viewing Multiple Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
One thing we did not cover sofar is that @ToolName{} fully supports the analysis
|
|
of multiple experiments. The @DisplayText{} tool accepts a list of experiments.
|
|
The data can either be aggregated across the experiments, or used in a
|
|
comparison.
|
|
|
|
Mention @code{experiment_list}
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Aggregation of Experiments
|
|
@subsection Aggregation of Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
By default, the data for multiple experiments is aggregrated and the display
|
|
commands shows these combined results.
|
|
|
|
For example, we can aggregate the data for our single and dual thread
|
|
experiments. Below is the script we used for this:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 10 lines
|
|
limit 10
|
|
# Get the list with experiments
|
|
experiment_list
|
|
# Get the function overview
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@IndexSubentry{Commands, @code{experiment_list}}
|
|
With the exception of the @code{experiment_list} command, all commands
|
|
used have been discussed earlier.
|
|
|
|
The @code{experiment_list} command provides a list of the experiments
|
|
that have been loaded. This is is used to verify we are looking at the
|
|
experiments we intend to aggregate.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-agg mxv.1.thr.er mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
With the command above, we get the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
# Define the metrics
|
|
Current metrics: e.%totalcpu:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu )
|
|
# Limit the output to 10 lines
|
|
Print limit set to 10
|
|
# Get the list with experiments
|
|
ID Sel PID Experiment
|
|
== === ===== ============
|
|
1 yes 30591 mxv.1.thr.er
|
|
2 yes 11629 mxv.2.thr.er
|
|
# Get the function overview
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
Excl. Total Name
|
|
CPU
|
|
sec. %
|
|
4.533 100.00 <Total>
|
|
4.306 94.99 mxv_core
|
|
0.105 2.31 init_data
|
|
0.053 1.17 erand48_r
|
|
0.027 0.59 __drand48_iterate
|
|
0.021 0.46 _int_malloc
|
|
0.021 0.46 drand48
|
|
0.001 0.02 sysmalloc
|
|
0. 0. __libc_start_main
|
|
0. 0. allocate_data
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first five lines should look familiar. The five lines following, echo
|
|
the comment line in the script and show the overview of the experiments.
|
|
This confirms two experiments have been loaded and that both are active.
|
|
|
|
This is followed by the function overview. The timings have been summed
|
|
up and the percentages are adjusted accordingly. For example, the total
|
|
accumulated time is indeed 2.272 + 2.261 = 4.533 seconds.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Comparison of Experiments
|
|
@subsection Comparison of Experiments
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The support for multiple experiments really shines in comparison mode. This
|
|
feature is enabled through the command
|
|
@IndexSubentry{Commands, @code{compare on/off}}
|
|
@code{compare on}
|
|
and is disabled
|
|
by setting
|
|
@code{compare off}.
|
|
|
|
@cindex Compare experiments
|
|
In comparison mode, the data for the various experiments is shown side by
|
|
side, as illustrated below where we compare the results for the multithreaded
|
|
experiments using one and two threads respectively:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -compare on -functions mxv.1.thr.er mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@noindent
|
|
This produces the following output:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
|
|
mxv.1.thr.er mxv.2.thr.er mxv.1.thr.er mxv.2.thr.er
|
|
Excl. Total Excl. Total Incl. Total Incl. Total Name
|
|
CPU CPU CPU CPU
|
|
sec. sec. sec. sec.
|
|
2.272 2.261 2.272 2.261 <Total>
|
|
2.159 2.148 2.159 2.148 mxv_core
|
|
0.054 0.051 0.102 0.104 init_data
|
|
0.025 0.028 0.035 0.045 erand48_r
|
|
0.013 0.008 0.048 0.053 drand48
|
|
0.011 0.010 0.012 0.010 _int_malloc
|
|
0.010 0.017 0.010 0.017 __drand48_iterate
|
|
0.001 0. 0.001 0. sysmalloc
|
|
0. 0. 0.114 0.114 __libc_start_main
|
|
0. 0. 0.011 0.010 allocate_data
|
|
0. 0. 0.001 0. check_results
|
|
0. 0. 2.159 2.148 collector_root
|
|
0. 0. 2.159 2.148 driver_mxv
|
|
0. 0. 0.114 0.114 main
|
|
0. 0. 0.012 0.010 malloc
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This table is already helpful to more easily compare (two) profiles, but
|
|
there is more that we can do here.
|
|
|
|
By default, in comparison mode, all measured values are shown. Often
|
|
profiling is about comparing performance data. It is therefore
|
|
more useful to look at differences, or ratios, using one experiment as
|
|
a reference.
|
|
|
|
The values shown are relative to this difference. For example if a ratio
|
|
is below one, it means the reference value was higher.
|
|
|
|
@IndexSubentry{Commands, @code{compare on/off}}
|
|
This feature is supported on the @code{compare} command. In addition to @code{on},
|
|
or @code{off}, this command also supports
|
|
@IndexSubentry{Commands, @code{compare delta}}
|
|
@code{delta}, or
|
|
@IndexSubentry{Commands, @code{compare ratio}}
|
|
@code{ratio}.
|
|
|
|
Usage of one of these two keywords enables the comparison feature and shows
|
|
either the difference, or the ratio, relative to the reference data.
|
|
|
|
In the example below, we use the same two experiments used in the comparison
|
|
above, but as before, the number of lines is restricted to 10 and we focus on
|
|
the exclusive timings plus percentages. For the comparison part we are
|
|
interested in the differences.
|
|
|
|
This is the script that produces such an overview:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.%totalcpu
|
|
# Limit the output to 10 lines
|
|
limit 10
|
|
# Set the comparison mode to differences
|
|
compare delta
|
|
# Get the function overview
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Assuming this script file is called @code{my-script-comp}, this is how we
|
|
get the table displayed on our screen:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-comp mxv.1.thr.er mxv.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Leaving out some of the lines printed, but we have seen before, we get
|
|
the following table:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
mxv.1.thr.er mxv.2.thr.er
|
|
Excl. Total Excl. Total Name
|
|
CPU CPU
|
|
sec. % delta %
|
|
2.272 100.00 -0.011 100.00 <Total>
|
|
2.159 95.00 -0.011 94.97 mxv_core
|
|
0.054 2.37 -0.003 2.25 init_data
|
|
0.025 1.10 +0.003 1.23 erand48_r
|
|
0.013 0.57 -0.005 0.35 drand48
|
|
0.011 0.48 -0.001 0.44 _int_malloc
|
|
0.010 0.44 +0.007 0.75 __drand48_iterate
|
|
0.001 0.04 -0.001 0. sysmalloc
|
|
0. 0. +0. 0. __libc_start_main
|
|
0. 0. +0. 0. allocate_data
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
It is now easy to see that the CPU times for the most time consuming
|
|
functions in this code are practically the same.
|
|
|
|
While in this case we used the delta as a comparison,
|
|
|
|
Note that the comparison feature is supported at the function, source, and
|
|
disassembly level. There is no practical limit on the number of experiments
|
|
that can be used in a comparison.
|
|
|
|
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Profile Hardware Event Counters
|
|
@section Profile Hardware Event Counters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
Many processors provide a set of hardware event counters and @ToolName{}
|
|
provides support for this feature.
|
|
@xref{Hardware Event Counters Explained} for those readers that are not
|
|
familiar with such counters and like to learn more.
|
|
|
|
In this section we explain how to get the details on the event counter
|
|
support for the processor used in the experiment(s), and show several
|
|
examples.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Getting Information on the Counters Supported
|
|
@subsection Getting Information on the Counters Supported
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The first step is to check if the processor used for the experiments is
|
|
supported by @ToolName{}.
|
|
|
|
@IndexSubentry{Options, @code{-h}}
|
|
The @code{-h} option on @CollectApp{} will show the event counter
|
|
information:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app -h
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In case the counters are supported, a list with the events is printed.
|
|
Otherwise, a warning message will be issued.
|
|
|
|
For example, below we show this command and the output on an Intel Xeon
|
|
Platinum 8167M (aka ``Skylake'') processor. The output has been split
|
|
into several sections and each section is commented upon separately.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Run "gprofng collect app --help" for a usage message.
|
|
|
|
Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85'
|
|
(cpuver=2499):
|
|
|
|
-h {auto|lo|on|hi}
|
|
turn on default set of HW counters at the specified rate
|
|
-h <ctr_def> [-h <ctr_def>]...
|
|
-h <ctr_def>[,<ctr_def>]...
|
|
specify HW counter profiling for up to 4 HW counters
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The first line shows how to get a usage overview. This is followed by
|
|
some information on the target processor.
|
|
|
|
The next five lines explain in what ways the @code{-h} option can be
|
|
used to define the events to be monitored.
|
|
|
|
The first version shown above enables a default set of counters. This
|
|
default depends on the processor this command is executed on. The
|
|
keyword following the @code{-h} option defines the sampling rate:
|
|
|
|
@table @code
|
|
|
|
@item auto
|
|
Match the sample rate of used by clock profiling. If the latter is disabled,
|
|
Use a per thread sampling rate of approximately 100 samples per second.
|
|
This setting is the default and preferred.
|
|
|
|
@item on
|
|
Use a per thread sampling rate of approximately 100 samples per second.
|
|
|
|
@item lo
|
|
Use a per thread sampling rate of approximately 10 samples per second.
|
|
|
|
@item hi
|
|
Use a per thread sampling rate of approximately 1000 samples per second.
|
|
|
|
@end table
|
|
|
|
The second and third variant define the events to be monitored. Note
|
|
that the number of simultaneous events supported is printed. In this
|
|
case we can monitor four events in a single profiling job.
|
|
|
|
It is a matter of preference whether you like to use the @code{-h}
|
|
option for each event, or use it once, followed by a comma separated
|
|
list.
|
|
|
|
There is one slight catch though. The counter definition below has
|
|
mandatory comma (@code{,}) between the event and the rate. While a
|
|
default can be used for the rate, the comma cannot be omitted.
|
|
This may result in a somewhat awkward counter definition in case
|
|
the default sampling rate is used.
|
|
|
|
For example, the following two commands are equivalent. Note
|
|
the double comma in the second command. This is not a typo.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng collect app -h cycles -h insts ...
|
|
$ gprofng collect app -h cycles,,insts ...
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In the first command this comma is not needed, because a
|
|
comma (``@code{,}'') immediately followed by white space may
|
|
be omitted.
|
|
|
|
This is why we prefer the this syntax and in the remainder will
|
|
use the first version of this command.
|
|
|
|
@IndexSubentry{Hardware event counters, counter definition}
|
|
The counter definition takes an event name, plus optionally one or
|
|
more attributes, followed by a comma, and optionally the sampling rate.
|
|
The output section below shows the formal definition.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
<ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>]
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The printed help then explains this syntax. Below we have summarized
|
|
and expanded this output:
|
|
|
|
@table @code
|
|
|
|
@item <ctr>
|
|
The counter name must be selected from the available counters listed
|
|
as part of the output printed with the @code{-h} option.
|
|
On most systems, if a counter is not listed, it may still be specified
|
|
by its numeric value.
|
|
|
|
@item ~<attr>=<val>
|
|
This is an optional attribute that depends on the processor. The list
|
|
of supported attributes is printed in the output. Examples of
|
|
attributes are ``user'', or ``system''. The value can given in decimal
|
|
or hexadecimal format.
|
|
Multiple attributes may be specified, and each must be preceded
|
|
by a ~.
|
|
|
|
@item <rate>
|
|
|
|
The sampling rate is one of the following:
|
|
|
|
@table @code
|
|
|
|
@item auto
|
|
This is the default and matches the rate used by clock profiling.
|
|
If clock profiling is disabled, use @code{on}.
|
|
|
|
@item on
|
|
Set the per thread maximum sampling rate to ~100 samples/second
|
|
|
|
@item lo
|
|
Set the per thread maximum sampling rate to ~10 samples/second
|
|
|
|
@item hi
|
|
Set the per thread maximum sampling rate to ~1000 samples/second
|
|
|
|
@item <interval>
|
|
Define the sampling interval.
|
|
@xref{Control the Sampling Frequency} how to define this.
|
|
|
|
@end table
|
|
|
|
@end table
|
|
|
|
After the section with the formal definition of events and counters, a
|
|
processor specific list is displayed. This part starts with an overview
|
|
of the default set of counters and the aliased names supported
|
|
@emph{on this specific processor}.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Default set of HW counters:
|
|
|
|
-h cycles,,insts,,llm
|
|
|
|
Aliases for most useful HW counters:
|
|
|
|
alias raw name type units regs description
|
|
|
|
cycles unhalted-core-cycles CPU-cycles 0123 CPU Cycles
|
|
insts instruction-retired events 0123 Instructions Executed
|
|
llm llc-misses events 0123 Last-Level Cache Misses
|
|
br_msp branch-misses-retired events 0123 Branch Mispredict
|
|
br_ins branch-instruction-retired events 0123 Branch Instructions
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
The definitions given above may or may not be available on other processors,
|
|
but we try to maximize the overlap across alias sets.
|
|
|
|
The table above shows the default set of counters defined for this processor,
|
|
and the aliases. For each alias the full ``raw'' name is given, plus the
|
|
unit of the number returned by the counter (CPU cycles, or a raw count),
|
|
the hardware counter the event is allowed to be mapped onto, and a short
|
|
description.
|
|
|
|
The last part of the output contains all the events that can be monitored:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Raw HW counters:
|
|
|
|
name type units regs description
|
|
|
|
unhalted-core-cycles CPU-cycles 0123
|
|
unhalted-reference-cycles events 0123
|
|
instruction-retired events 0123
|
|
llc-reference events 0123
|
|
llc-misses events 0123
|
|
branch-instruction-retired events 0123
|
|
branch-misses-retired events 0123
|
|
ld_blocks.store_forward events 0123
|
|
ld_blocks.no_sr events 0123
|
|
ld_blocks_partial.address_alias events 0123
|
|
dtlb_load_misses.miss_causes_a_walk events 0123
|
|
dtlb_load_misses.walk_completed_4k events 0123
|
|
|
|
<many lines deleted>
|
|
|
|
l2_lines_out.silent events 0123
|
|
l2_lines_out.non_silent events 0123
|
|
l2_lines_out.useless_hwpf events 0123
|
|
sq_misc.split_lock events 0123
|
|
|
|
See Chapter 19 of the "Intel 64 and IA-32 Architectures Software
|
|
Developer's Manual Volume 3B: System Programming Guide"
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As can be seen, these names are not always easy to correlate to a specific
|
|
event of interest. The processor manual should provide more clarity on this.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Examples Using Hardware Event Counters
|
|
@subsection Examples Using Hardware Event Counters
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
The previous section may give the impression that these counters are hard to
|
|
use, but as we will show now, in practice it is quite simple.
|
|
|
|
With the information from the @code{-h} option, we can easily set up our first
|
|
event counter experiment.
|
|
|
|
We start by using the default set of counters defined for our processor and we
|
|
use 2 threads:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@IndexSubentry{Options, @code{-h}}
|
|
@IndexSubentry{Hardware event counters, @code{auto} option}
|
|
The new option here is @code{-h auto}. The @code{auto} keyword enables
|
|
hardware event counter profiling and selects the default set of counters
|
|
defined for this processor.
|
|
|
|
As before, we can display the information, but there is one practical hurdle
|
|
to take. Unless we like to view all metrics recorded, we would need to know
|
|
the names of the events that have been enabled. This is tedious and also not
|
|
portable in case we would like to repeat this experiment on another processor.
|
|
|
|
@IndexSubentry{Hardware event counters, @code{hwc} metric}
|
|
This is where the special @code{hwc} metric comes very handy. It
|
|
automatically expands to the active set of events used.
|
|
|
|
With this, it is very easy to display the event counter values. Note that
|
|
although the regular clock based profiling was enabled, we only want to see
|
|
the counter values. We also request to see the percentages and limit the
|
|
output to the first 5 lines:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.%hwc -limit 5 -functions $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.%cycles:e+%insts:e+%llm:name
|
|
Current Sort Metric: Exclusive CPU Cycles ( e.%cycles )
|
|
Print limit set to 5
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. CPU Excl. Instructions Excl. Last-Level Name
|
|
Cycles Executed Cache Misses
|
|
sec. % % %
|
|
2.691 100.00 7906475309 100.00 122658983 100.00 <Total>
|
|
2.598 96.54 7432724378 94.01 121745696 99.26 mxv_core
|
|
0.035 1.31 188860269 2.39 70084 0.06 erand48_r
|
|
0.026 0.95 73623396 0.93 763116 0.62 init_data
|
|
0.018 0.66 76824434 0.97 40040 0.03 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
As we have seen before, the first few lines echo the settings.
|
|
This includes a list with the hardware event counters used by
|
|
default.
|
|
|
|
The table that follows makes it very easy to get an overview where the
|
|
time is spent and how many of the target events have occurred.
|
|
|
|
As before, we can drill down deeper and see the same metrics at the source
|
|
line and instruction level. Other than using @code{hwc} in the metrics
|
|
definitions, nothing has changed compared to the previous examples:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.hwc -source mxv_core $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This is the relevant part of the output. Since the lines get very long,
|
|
we have somewhat modified the lay-out:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Excl. CPU Excl. Excl.
|
|
Cycles Instructions Last-Level
|
|
sec. Executed Cache Misses
|
|
<Function: mxv_core>
|
|
0. 0 0 32. void __attribute__ ((noinline))
|
|
mxv_core(...)
|
|
0. 0 0 33. {
|
|
0. 0 0 34. for (uint64_t i=...) {
|
|
0. 0 0 35. double row_sum = 0.0;
|
|
## 1.872 7291879319 88150571 36. for (int64_t j=0; j<n; j++)
|
|
0.725 140845059 33595125 37. row_sum += A[i][j]*b[j];
|
|
0. 0 0 38. c[i] = row_sum;
|
|
39. }
|
|
0. 0 0 40. }
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
In a smiliar way we can display the event counter values at the instruction
|
|
level. Again we have modified the lay-out due to page width limitations:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exp=mxv.hwc.def.2.thr.er
|
|
$ gprofng display text -metrics e.hwc -disasm mxv_core $exp
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Excl. CPU Excl. Excl.
|
|
Cycles Instructions Last-Level
|
|
sec. Executed Cache Misses
|
|
<Function: mxv_core>
|
|
0. 0 0 [33] 4021ba: mov 0x8(%rsp),%r10
|
|
34. for (uint64_t i=...) {
|
|
0. 0 0 [34] 4021bf: cmp %rsi,%rdi
|
|
0. 0 0 [34] 4021c2: jbe 0x37
|
|
0. 0 0 [34] 4021c4: ret
|
|
35. double row_sum = 0.0;
|
|
36. for (int64_t j=0; j<n; j++)
|
|
37. row_sum += A[i][j]*b[j];
|
|
0. 0 0 [37] 4021c5: mov (%r8,%rdi,8),%rdx
|
|
0. 0 0 [36] 4021c9: mov $0x0,%eax
|
|
0. 0 0 [35] 4021ce: pxor %xmm1,%xmm1
|
|
0.002 12804230 321394 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0
|
|
0.141 60819025 3866677 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0
|
|
0.582 67221804 29407054 [37] 4021dd: addsd %xmm0,%xmm1
|
|
## 1.871 7279075109 87989870 [36] 4021e1: add $0x1,%rax
|
|
0.002 12804210 80351 [36] 4021e5: cmp %rax,%rcx
|
|
0. 0 0 [36] 4021e8: jne 0xffffffffffffffea
|
|
38. c[i] = row_sum;
|
|
0. 0 0 [38] 4021ea: movsd %xmm1,(%r10,%rdi,8)
|
|
0. 0 0 [34] 4021f0: add $0x1,%rdi
|
|
0. 0 0 [34] 4021f4: cmp %rdi,%rsi
|
|
0. 0 0 [34] 4021f7: jb 0xd
|
|
0. 0 0 [35] 4021f9: pxor %xmm1,%xmm1
|
|
0. 0 0 [36] 4021fd: test %rcx,%rcx
|
|
0. 0 80350 [36] 402200: jne 0xffffffffffffffc5
|
|
0. 0 0 [36] 402202: jmp 0xffffffffffffffe8
|
|
39. }
|
|
40. }
|
|
0. 0 0 [40] 402204: ret
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
So far we have used the default settings for the event counters. It is
|
|
quite straightforward to select specific counters. For sake of the
|
|
example, let's assume we would like to count how many branch instructions
|
|
and retired memory load instructions that missed in the L1 cache have been
|
|
executed. We also want to count these events with a high resolution.
|
|
|
|
This is the command to do so:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ exp=mxv.hwc.sel.2.thr.er
|
|
$ hwc1=br_ins,hi
|
|
$ hwc2=mem_load_retired.l1_miss,hi
|
|
$ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
As before, we get a table with the event counts. Due to the very
|
|
long name for the second counter, we have somewhat modified the
|
|
output.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Total CPU Time
|
|
Excl. Incl. Excl. Branch Excl. Name
|
|
Total Total Instructions mem_load_retired.l1_miss
|
|
CPU sec. CPU sec. Events
|
|
2.597 2.597 1305305319 4021340 <Total>
|
|
2.481 2.481 1233233242 3982327 mxv_core
|
|
0.040 0.107 19019012 9003 init_data
|
|
0.028 0.052 23023048 15006 erand48_r
|
|
0.024 0.024 19019008 9004 __drand48_iterate
|
|
0.015 0.067 11011009 2998 drand48
|
|
0.008 0.010 0 3002 _int_malloc
|
|
0.001 0.001 0 0 brk
|
|
0.001 0.002 0 0 sysmalloc
|
|
0. 0.001 0 0 __default_morecore
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@IndexSubentry{Commands, @code{compare ratio}}
|
|
When using event counters, the values could be very large and it is not easy
|
|
to compare the numbers. As we will show next, the @code{ratio} feature is
|
|
very useful when comparing such profiles.
|
|
|
|
To demonstrate this, we have set up another event counter experiment where
|
|
we would like to compare the number of last level cache miss and the number
|
|
of branch instructions executed when using a single thread, or two threads.
|
|
|
|
These are the commands used to generate the experiment directories:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ exe=./mxv-pthreads.exe
|
|
$ m=3000
|
|
$ n=2000
|
|
$ exp1=mxv.hwc.comp.1.thr.er
|
|
$ exp2=mxv.hwc.comp.2.thr.er
|
|
$ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1
|
|
$ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
The following script has been used to get the tables. Due to lay-out
|
|
restrictions, we have to create two tables, one for each counter.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Define the metrics
|
|
metrics name:e.llm
|
|
# Set the comparison to ratio
|
|
compare ratio
|
|
functions
|
|
# Define the metrics
|
|
metrics name:e.br_ins
|
|
# Set the comparison to ratio
|
|
compare ratio
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Note that we print the name of the function first, followed by the counter
|
|
data.
|
|
The new element is that we set the comparison mode to @code{ratio}. This
|
|
divides the data in a column by its counterpart in the reference experiment.
|
|
|
|
This is the command using this script and the two experiment directories as
|
|
input:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-comp-counters \
|
|
mxv.hwc.comp.1.thr.er \
|
|
mxv.hwc.comp.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
By design, we get two tables, one for each counter:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive Last-Level Cache Misses
|
|
|
|
mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er
|
|
Name Excl. Last-Level Excl. Last-Level
|
|
Cache Misses Cache Misses
|
|
ratio
|
|
<Total> 122709276 x 0.788
|
|
mxv_core 121796001 x 0.787
|
|
init_data 723064 x 1.055
|
|
erand48_r 100111 x 0.500
|
|
drand48 60065 x 1.167
|
|
|
|
Functions sorted by metric: Exclusive Branch Instructions
|
|
|
|
mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er
|
|
Name Excl. Branch Excl. Branch
|
|
Instructions Instructions
|
|
ratio
|
|
<Total> 1307307316 x 0.997
|
|
mxv_core 1235235239 x 0.997
|
|
erand48_r 23023033 x 0.957
|
|
drand48 20020009 x 0.600
|
|
__drand48_iterate 17017028 x 0.882
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
A ratio less than one in the second column, means that this counter
|
|
value was smaller than the value from the reference experiment shown
|
|
in the first column.
|
|
|
|
This kind of presentation of the results makes it much easier to
|
|
quickly interpret the data.
|
|
|
|
We conclude this section with thread-level event counter overviews,
|
|
but before we go into this, there is an important metric we need to
|
|
mention.
|
|
|
|
@IndexSubentry{Hardware event counters, IPC}
|
|
In case it is known how many instructions and CPU cycles have been executed,
|
|
the value for the IPC (``Instructions Per Clockycle'') can be computed.
|
|
@xref{Hardware Event Counters Explained}.
|
|
This is a derived metric that gives an indication how well the processor
|
|
is utilized. The inverse of the IPC is called CPI.
|
|
|
|
The @DisplayText{} command automatically computes the IPC and CPI values
|
|
if an experiment contains the event counter values for the instructions
|
|
and CPU cycles executed. These are part of the metric list and can be
|
|
displayed, just like any other metric.
|
|
|
|
@IndexSubentry{Commands, @code{metric_list}}
|
|
This can be verified through the @code{metric_list} command. If we go
|
|
back to our earlier experiment with the default event counters, we get
|
|
the following result.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -metric_list mxv.hwc.def.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name
|
|
Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu )
|
|
Available metrics:
|
|
Exclusive Total CPU Time: e.%totalcpu
|
|
Inclusive Total CPU Time: i.%totalcpu
|
|
Exclusive CPU Cycles: e.+%cycles
|
|
Inclusive CPU Cycles: i.+%cycles
|
|
Exclusive Instructions Executed: e+%insts
|
|
Inclusive Instructions Executed: i+%insts
|
|
Exclusive Last-Level Cache Misses: e+%llm
|
|
Inclusive Last-Level Cache Misses: i+%llm
|
|
Exclusive Instructions Per Cycle: e+IPC
|
|
Inclusive Instructions Per Cycle: i+IPC
|
|
Exclusive Cycles Per Instruction: e+CPI
|
|
Inclusive Cycles Per Instruction: i+CPI
|
|
Size: size
|
|
PC Address: address
|
|
Name: name
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Among the other metrics, we see the new metrics for the IPC and CPI
|
|
listed.
|
|
|
|
In the script below, we use this information and add the IPC and CPI
|
|
to the metrics to be displayed. We also use a the thread filter to
|
|
display these values for the individual threads.
|
|
|
|
This is the complete script we have used. Other than a different selection
|
|
of the metrics, there are no new features.
|
|
|
|
@cartouche
|
|
@smallexample
|
|
# Define the metrics
|
|
metrics e.insts:e.%cycles:e.IPC:e.CPI
|
|
# Sort with respect to cycles
|
|
sort e.cycles
|
|
# Limit the output to 5 lines
|
|
limit 5
|
|
# Get the function overview for all threads
|
|
functions
|
|
# Get the function overview for thread 1
|
|
thread_select 1
|
|
functions
|
|
# Get the function overview for thread 2
|
|
thread_select 2
|
|
functions
|
|
# Get the function overview for thread 3
|
|
thread_select 3
|
|
functions
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
In the metrics definition on the second line, we explicitly request the
|
|
counter values for the instructions (@code{e.insts}) and CPU cycles
|
|
(@code{e.cycles}) executed. These names can be found in output from the
|
|
@code{metric_list} commad above.
|
|
In addition to these metrics, we also request the IPC and CPI to be shown.
|
|
|
|
As before, we used the @code{limit} command to control the number of
|
|
functions displayed. We then request an overview for all the threads,
|
|
followed by three sets of two commands to select a thread and display the
|
|
function overview.
|
|
|
|
The script above is used as follows:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
This script produces four tables. We list them separately below,
|
|
and have left out the additional output.
|
|
|
|
The first table shows the accumulated values across the three
|
|
threads that have been active.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
7906475309 2.691 100.00 1.473 0.679 <Total>
|
|
7432724378 2.598 96.54 1.434 0.697 mxv_core
|
|
188860269 0.035 1.31 2.682 0.373 erand48_r
|
|
73623396 0.026 0.95 1.438 0.696 init_data
|
|
76824434 0.018 0.66 2.182 0.458 drand48
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
This shows that IPC of this program is completely dominated
|
|
by function @code{mxv_core}. It has a fairly low IPC value
|
|
of 1.43.
|
|
|
|
The next table is for thread 1 and shows the values for the
|
|
main thread.
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 1 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
473750931 0.093 100.00 2.552 0.392 <Total>
|
|
188860269 0.035 37.93 2.682 0.373 erand48_r
|
|
73623396 0.026 27.59 1.438 0.696 init_data
|
|
76824434 0.018 18.97 2.182 0.458 drand48
|
|
134442832 0.013 13.79 5.250 0.190 __drand48_iterate
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
Although this thread hardly uses any CPU cycles, the overall IPC
|
|
of 2.55 is not all that bad.
|
|
|
|
Last, we show the tables for threads 2 and 3:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 2 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
3716362189 1.298 100.00 1.435 0.697 <Total>
|
|
3716362189 1.298 100.00 1.435 0.697 mxv_core
|
|
0 0. 0. 0. 0. collector_root
|
|
0 0. 0. 0. 0. driver_mxv
|
|
|
|
Exp Sel Total
|
|
=== === =====
|
|
1 3 3
|
|
Functions sorted by metric: Exclusive CPU Cycles
|
|
|
|
Excl. Excl. CPU Excl. Excl. Name
|
|
Instructions Cycles IPC CPI
|
|
Executed sec. %
|
|
3716362189 1.300 100.00 1.433 0.698 <Total>
|
|
3716362189 1.300 100.00 1.433 0.698 mxv_core
|
|
0 0. 0. 0. 0. collector_root
|
|
0 0. 0. 0. 0. driver_mxv
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
It is seen that both execute the same number of instructions and
|
|
take about the same number of CPU cycles. As a result, the IPC is
|
|
the same for both threads.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c TBD @node Additional Features
|
|
@c TBD @section Additional Features
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c TBD @node More Filtering Capabilities
|
|
@c TBD @subsection More Filtering Capabilities
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c TBD Cover @code{samples} and @code{seconds}
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Java Profiling
|
|
@section Java Profiling
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Java profiling, @code{-j on/off}}
|
|
The @CollectApp{} command supports Java profiling. The @code{-j on} option
|
|
can be used for this, but since this feature is enabled by default, there is
|
|
no need to set this explicitly. Java profiling may be disabled through the
|
|
@code{-j off} option.
|
|
|
|
The program is compiled as usual and the experiment directory is created
|
|
similar to what we have seen before. The only difference with a C/C++
|
|
application is that the program has to be explicitly executed by java.
|
|
|
|
For example, this is how to generate the experiment data for a Java
|
|
program that has the source code stored in file @code{Pi.java}:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ javac Pi.java
|
|
$ gprofng collect app -j on -O pi.demo.er java Pi < pi.in
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
Regarding which java is selected to generate the data, @ToolName{}
|
|
first looks for the JDK in the path set in either the
|
|
@IndexSubentry{Java profiling, @code{JDK_HOME}}
|
|
@code{JDK_HOME} environment variable, or in the
|
|
@IndexSubentry{Java profiling, @code{JAVA_PATH}}
|
|
@code{JAVA_PATH} environment variable. If neither of these variables is
|
|
set, it checks for a JDK in the search path (set in the PATH
|
|
environment variable). If there is no JDK in this path, it checks for
|
|
the java executable in @code{/usr/java/bin/java}.
|
|
|
|
In case additional options need to be passed on to the JVM, the
|
|
@IndexSubentry{Java profiling, @code{-J <string>}}
|
|
@code{-J <string>} option can be used. The string with the
|
|
option(s) has to be delimited by quotation marks in case
|
|
there is more than one argument.
|
|
|
|
The @DisplayText{} command may be used to view the performance data. There is
|
|
no need for any special options and the same commands as previously discussed
|
|
are supported.
|
|
|
|
@IndexSubentry{Commands, @code{viewmode}}
|
|
@IndexSubentry{Java profiling, different view modes}
|
|
The @code{viewmode} command
|
|
@xref{The Viewmode}
|
|
is very useful to examine the call stacks.
|
|
|
|
For example, this is how one can see the native call stacks. For
|
|
lay-out purposes we have restricted the list to the first five entries:
|
|
|
|
@cartouche
|
|
@smallexample
|
|
$ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er
|
|
@end smallexample
|
|
@end cartouche
|
|
|
|
@smallexample
|
|
@verbatim
|
|
Print limit set to 5
|
|
Viewmode set to machine
|
|
Functions Call Tree. Metric: Attributed Total CPU Time
|
|
|
|
Attr. Name
|
|
Total
|
|
CPU sec.
|
|
1.381 +-<Total>
|
|
1.171 +-Pi.calculatePi(double)
|
|
0.110 +-collector_root
|
|
0.110 | +-JavaMain
|
|
0.070 | +-jni_CallStaticVoidMethod
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
Note that the selection of the viewmode is echoed in the output.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@c TBD @node Summary of Options and Commands
|
|
@c TBD @chapter Summary of Options and Commands
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Terminology
|
|
@chapter Terminology
|
|
|
|
Throughout this manual, certain terminology specific to profiling tools,
|
|
or @ToolName{}, or even to this document only, is used. In this chapter we
|
|
explain this terminology in detail.
|
|
|
|
@menu
|
|
* The Program Counter:: What is a Program Counter?
|
|
* Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics.
|
|
* Metric Definitions:: Definitions associated with metrics.
|
|
* The Viewmode:: Select the way call stacks are presented.
|
|
* The Selection List:: How to define a selection.
|
|
* Load Objects and Functions:: The components in an application.
|
|
* The Concept of a CPU in @ProductName{}:: The definition of a CPU.
|
|
* Hardware Event Counters Explained:: What are event counters?
|
|
* apath:: Our generic definition of a path.
|
|
@end menu
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Program Counter
|
|
@section The Program Counter
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex PC
|
|
@cindex Program Counter
|
|
The @emph{Program Counter}, or PC for short, keeps track where program execution is.
|
|
The address of the next instruction to be executed is stored in a special
|
|
purpose register in the processor, or core.
|
|
|
|
@cindex Instruction pointer
|
|
The PC is sometimes also referred to as the @emph{instruction pointer}, but
|
|
we will use Program Counter or PC throughout this document.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Inclusive and Exclusive Metrics
|
|
@section Inclusive and Exclusive Metrics
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In the remainder, these two concepts occur quite often and for lack of a better
|
|
place, they are explained here.
|
|
|
|
@cindex Inclusive metric
|
|
The @emph{inclusive} value for a metric includes all values that are part of
|
|
the dynamic extent of the target function. For example if function @code{A}
|
|
calls functions @code{B} and @code{C}, the inclusive CPU time for @code{A}
|
|
includes the CPU time spent in @code{B} and @code{C}.
|
|
|
|
@cindex Exclusive metric
|
|
In contrast with this, the @emph{exclusive} value for a metric is computed
|
|
by excluding the metric values used by other functions called. In our imaginary
|
|
example, the exclusive CPU time for function @code{A} is the time spent outside
|
|
calling functions @code{B} and @code{C}.
|
|
|
|
@cindex Leaf function
|
|
In case of a @emph{leaf function}, the inclusive and exclusive values for the
|
|
metric are the same since by definition, it is not calling any other
|
|
function(s).
|
|
|
|
Why do we use these two different values? The inclusive metric shows the most
|
|
expensive path, in terms of this metric, in the application. For example, if
|
|
the metric is cache misses, the function with the highest inclusive metric
|
|
tells you where most of the cache misses come from.
|
|
|
|
Within this branch of the application, the exclusive metric points to the
|
|
functions that contribute and help to identify which part(s) to consider
|
|
for further analysis.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Metric Definitions
|
|
@section Metric Definitions
|
|
@c ----------------------------------------------------------------------------
|
|
The metrics to be shown are highly customizable. In this section we explain
|
|
the definitions associated with metrics.
|
|
|
|
@IndexSubentry{Commands, @code{metrics}}
|
|
The @code{metrics} command takes a colon (:) separated list with special
|
|
keywords. This keyword consists of the following three fields:
|
|
@code{<flavor>}@code{<visibility>}@code{<metric_name>}.
|
|
|
|
@cindex Flavor field
|
|
@cindex Visibility field
|
|
@cindex Metric name field
|
|
The @emph{<flavor>} field is either an @code{e} for ``exclusive'', or @code{i}
|
|
for ``inclusive''. The @code{<metric_name>} field is the name of the metric
|
|
request. The @emph{<visibility>} field consists of one ore more characters
|
|
from the following table:
|
|
|
|
@table @code
|
|
|
|
@item .
|
|
Show the metric as time. This applies to timing metrics and hardware event counters
|
|
that measure cycles. Interpret as @code{+} for other metrics.
|
|
|
|
@item %
|
|
Show the metric as a percentage of the total value for this metric.
|
|
|
|
@item +
|
|
Show the metric as an absolute value. For hardware event counters this is
|
|
the event count. Interpret as @code{.} for timing metrics.
|
|
|
|
@item |
|
|
Do not show any metric value. Cannot be used with other visibility characters.
|
|
|
|
@end table
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Viewmode
|
|
@section The Viewmode
|
|
|
|
@cindex Viewmode
|
|
@IndexSubentry{Commands, @code{viewmode}}
|
|
|
|
There are different ways to view a call stack in Java. In @ToolName{}, this
|
|
is called the @emph{viewmode} and the setting is controlled through a command
|
|
with the same name.
|
|
|
|
The @code{viewmode} command takes one of the following keywords:
|
|
|
|
@table @code
|
|
|
|
@item user
|
|
This is the default and shows the Java call stacks for Java threads.
|
|
No call stacks for any housekeeping threads are shown. The function
|
|
list contains a function
|
|
@IndexSubentry{Java profiling, @code{<JVM-System>}}
|
|
@code{<JVM-System>} that represents the aggregated time from non-Java
|
|
threads.
|
|
When the JVM software does not report a Java call stack, time is reported
|
|
against the function
|
|
@IndexSubentry{Java profiling, @code{<no Java callstack recorded>}}
|
|
@code{<no Java callstack recorded>}.
|
|
|
|
|
|
@item expert
|
|
Show the Java call stacks for Java threads when the Java code from the
|
|
user is executed and machine call stacks when JVM code is executed, or
|
|
when the JVM software does not report a Java call stack.
|
|
Show the machine call stacks for housekeeping threads.
|
|
|
|
@item machine
|
|
Show the actual native call stacks for all threads.
|
|
|
|
@end table
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Selection List
|
|
@section The Selection List
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex Selection list
|
|
@cindex List specification
|
|
Several commands allow the user to specify a subset of a list. For example,
|
|
to select specific threads from all the threads that have been used when
|
|
conducting the experiment(s).
|
|
|
|
Such a selection list (or ``list'' in the remainder of this section) can be a
|
|
single number, a contiguous range of numbers with the start and end numbers
|
|
separated by a hyphen (@code{-}), a comma-separated list of numbers and
|
|
ranges, or the @code{all} keyword. Lists must not contain spaces.
|
|
|
|
Each list can optionally be preceded by an experiment list with a similar
|
|
format, separated from the list by a colon (:).
|
|
If no experiment list is included, the list applies to all experiments.
|
|
|
|
Multiple lists can be concatenated by separating the individual lists
|
|
by a plus sign.
|
|
|
|
These are some examples of various filters using a list:
|
|
|
|
@table @code
|
|
|
|
@item thread_select 1
|
|
Select thread 1 from all experiments.
|
|
|
|
@item thread_select all:1
|
|
Select thread 1 from all experiments.
|
|
|
|
@item thread_select 1:1+2:2
|
|
Select thread 1 from experiment 1 and thread 2 from experiment 2.
|
|
|
|
@item cpu_select all:1,3,5
|
|
Selects cores 1, 3, and 5 from all experiments.
|
|
|
|
@item cpu_select 1,2:all
|
|
Select all cores from experiments 1 and 2, as listed by the @code{by exp_list} command.
|
|
|
|
@end table
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Load Objects and Functions
|
|
@section Load Objects and Functions
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
An application consists of various components. The source code files are
|
|
compiled into object files. These are then glued together at link time to form
|
|
the executable.
|
|
During execution, the program may also dynamically load objects.
|
|
|
|
@cindex Load object
|
|
A @emph{load object} is defined to be an executable, or shared object. A shared
|
|
library is an example of a load object in @ToolName{}.
|
|
|
|
Each load object, contains a text section with the instructions generated by the
|
|
compiler, a data section for data, and various symbol tables.
|
|
All load objects must contain an
|
|
@cindex ELF
|
|
ELF
|
|
symbol table, which gives the names and addresses of all the globally known
|
|
functions in that object.
|
|
|
|
Load objects compiled with the -g option contain additional symbolic information
|
|
that can augment the ELF symbol table and provide information about functions that
|
|
are not global, additional information about object modules from which the functions
|
|
came, and line number information relating addresses to source lines.
|
|
|
|
The term
|
|
@cindex Function
|
|
@emph{function}
|
|
is used to describe a set of instructions that represent a high-level operation
|
|
described in the source code. The term also covers methods as used in C++ and in
|
|
the Java programming language.
|
|
|
|
In the @ToolName{} context, functions are provided in source code format.
|
|
Normally their names appear in the symbol table representing a set of addresses.
|
|
@cindex Program Counter
|
|
@cindex PC
|
|
If the Program Counter (PC) is within that set, the program is executing within that function.
|
|
|
|
In principle, any address within the text segment of a load object can be mapped to a
|
|
function. Exactly the same mapping is used for the leaf PC and all the other PCs on the
|
|
call stack.
|
|
|
|
Most of the functions correspond directly to the source model of the program, but
|
|
there are exceptions. This topic is however outside of the scope of this guide.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node The Concept of a CPU in @ProductName{}
|
|
@section The Concept of a CPU in @ProductName{}
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@cindex CPU
|
|
In @ProductName{}, there is the concept of a CPU. Admittedly, this is not the
|
|
best word to describe what is meant here and may be replaced in the future.
|
|
|
|
The word CPU is used in many of the displays.
|
|
In the context of @ProductName{}, it is meant to denote a part of the
|
|
processor that is capable of executing instructions and with its own state,
|
|
like the program counter.
|
|
|
|
For example, on a contemporary processor, a CPU could be a core. In case
|
|
hardware threads are supported within a core, it could be one of those
|
|
hardware threads.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node Hardware Event Counters Explained
|
|
@section Hardware Event Counters Explained
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
@IndexSubentry{Hardware event counters, description}
|
|
For quite a number of years now, many microprocessors have supported hardware
|
|
event counters.
|
|
|
|
On the hardware side, this means that in the processor there are one or more
|
|
registers dedicated to count certain activities, or ``events''.
|
|
Examples of such events are the number of instructions executed, or the number
|
|
of cache misses at level 2 in the memory hierarchy.
|
|
|
|
While there is a limited set of such registers, the user can map events onto
|
|
them. In case more than one register is available, this allows for the
|
|
simultaenous measurement of various events.
|
|
|
|
A simple, yet powerful, example is to simultaneously count the number of CPU
|
|
cycles and the number of instructions excuted. These two numbers can then be
|
|
used to compute the
|
|
@cindex IPC
|
|
@emph{IPC} value. IPC stands for ``Instructions Per Clockcycle'' and each processor
|
|
has a maximum. For example, if this maximum number is 2, it means the
|
|
processor is capable of executing two instructions every clock cycle.
|
|
|
|
Whether this is actually achieved, depends on several factors, including the
|
|
instruction characteristics.
|
|
However, in case the IPC value is well below this maximum in a time critical
|
|
part of the application and this cannot be easily explained, further
|
|
investigation is probably warranted.
|
|
|
|
@cindex CPI
|
|
A related metric is called @emph{CPI}, or ``Clockcycles Per Instruction''.
|
|
It is the inverse of the CPI and can be compared against the theoretical
|
|
value(s) of the target instruction(s). A significant difference may point
|
|
at a bottleneck.
|
|
|
|
One thing to keep in mind is that the value returned by a counter can either
|
|
be the number of times the event occured, or a CPU cycle count. In case of
|
|
the latter it is possible to convert this number to time.
|
|
|
|
@IndexSubentry{Hardware event counters, variable CPU frequency}
|
|
This is often easier to interpret than a simple count, but there is one
|
|
caveat to keep in mind. The CPU frequency may not have been constant while
|
|
the experimen was recorded and this impacts the time reported.
|
|
|
|
These event counters, or ``counters'' for short, provide great insight into
|
|
what happens deep inside the processor. In case higher level information does
|
|
not provide the insight needed, the counters provide the information to get
|
|
to the bottom of a performance problem.
|
|
|
|
There are some things to consider though.
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
The event definitions and names vary across processors and it may even happen
|
|
that some events change with an update.
|
|
Unfortunately and this is luckily rare, there are sometimes bugs causing the
|
|
wrong count to be returned.
|
|
|
|
@IndexSubentry{Hardware event counters, alias name}
|
|
In @ToolName{}, some of the processor specific event names have an alias
|
|
name. For example @code{insts} measures the instructions executed.
|
|
These aliases not only makes it easier to identify the functionality, but also
|
|
provide portability of certain events across processors.
|
|
|
|
@item
|
|
Another complexity is that there are typically many events one can monitor.
|
|
There may up to hundreds of events available and it could require several
|
|
experiments to zoom in on the root cause of a performance problem.
|
|
|
|
@item
|
|
There may be restrictions regarding the mapping of event(s) onto the
|
|
counters. For example, certain events may be restricted to specific
|
|
counters only. As a result, one may have to conduct additional experiments
|
|
to cover all the events of interest.
|
|
|
|
@item
|
|
The names of the events may also not be easy to interpret. In such cases,
|
|
the description can be found in the architecture manual for the processor.
|
|
|
|
@end itemize
|
|
|
|
Despite these drawbacks, hardware event counters are extremely useful and
|
|
may even turn out to be indispensable.
|
|
|
|
@c ----------------------------------------------------------------------------
|
|
@node apath
|
|
@section What is <apath>?
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
In most cases, @ToolName{} shows the absolute pathnames of directories. These
|
|
tend to be rather long, causing display issues in this document.
|
|
|
|
Instead of wrapping these long pathnames over multiple lines, we decided to
|
|
represent them by the @code{<apath>} symbol, which stands for ``an absolute
|
|
pathname''.
|
|
|
|
Note that different occurrences of @code{<apath>} may represent different
|
|
absolute pathnames.
|
|
|
|
@c -- A new node --------------------------------------------------------------
|
|
@node Other Document Formats
|
|
@chapter Other Document Formats
|
|
@c ----------------------------------------------------------------------------
|
|
|
|
This document is written in Texinfo and the source text is made available as
|
|
part of the binutils distribution. The file name is @code{gprofng.texi} and
|
|
can be found in subdirectory @code{doc} under directory @code{gprofng} in the
|
|
top level directory.
|
|
|
|
This file can be used to generate the document in the @code{info}, @code{html},
|
|
and @code{pdf} formats.
|
|
The default installation procedure creates a file in the @code{info} format and
|
|
stores it in the documentation section of binutils.
|
|
|
|
The probably easiest way to generate a different format from this Texinfo
|
|
document is to go to the distribution directory that was created when the
|
|
tools were built.
|
|
This is either the default distribution directory, or the one that has been set
|
|
with the @code{--prefix} option as part of the @code{configure} command.
|
|
In this example we symbolize this location with @code{<dist>}.
|
|
|
|
The make file called @code{Makefile} in directory @code{<dist>/gprofng/doc}
|
|
supports several commands to generate this document in different formats.
|
|
We recommend to use these commands.
|
|
|
|
They create the file(s) and install it in the documentation directory of binutils,
|
|
which is @code{<dist>/share/doc} in case @code{html} or @code{pdf} is selected and
|
|
@code{<dist>/share/info} for the file in the @code{info} format.
|
|
|
|
To generate this document in the requested format and install it in the documentation
|
|
directory, the commands below should be executed. In this notation, @code{<format>}
|
|
is one of @code{info}, @code{html}, or @code{pdf}:
|
|
|
|
@smallexample
|
|
@verbatim
|
|
$ cd <dist>/gprofng/doc
|
|
$ make install-<format>
|
|
@end verbatim
|
|
@end smallexample
|
|
|
|
@noindent
|
|
Some things to note:
|
|
|
|
@itemize
|
|
|
|
@item
|
|
For the @code{pdf} file to be generated, the
|
|
@cindex TeX
|
|
TeX document formatting software is required and the relevant commmands need
|
|
to be included in the search path. An example of a popular TeX implementation
|
|
is @emph{TexLive}. It is beyond the scope of this document to go into the
|
|
details of installing and using TeX, but it is well documented elsewhere.
|
|
|
|
@item
|
|
Instead of generating a single file in the @code{html} format, it is also
|
|
possible to create a directory with individual files for the various chapters.
|
|
To do so, remove the use of @code{--no-split} in variable @code{MAKEINFOHTML}
|
|
in the make file in the @code{doc} directory.
|
|
|
|
@item
|
|
The make file also supports commands to only generate the file in the desired
|
|
format and not move them to the documentation directory. This is
|
|
accomplished through the @code{make <format>} command.
|
|
|
|
@end itemize
|
|
|
|
@ifnothtml
|
|
@node Index
|
|
@unnumbered Index
|
|
@printindex cp
|
|
@end ifnothtml
|
|
|
|
@bye
|