1# Executable commands reference
2
3## Table of Contents
4
5- [Executable commands reference](#executable-commands-reference)
6  - [Table of Contents](#table-of-contents)
7  - [How simpleperf works](#how-simpleperf-works)
8  - [Commands](#commands)
9  - [The list command](#the-list-command)
10  - [The stat command](#the-stat-command)
11    - [Select events to stat](#select-events-to-stat)
12    - [Select target to stat](#select-target-to-stat)
13    - [Decide how long to stat](#decide-how-long-to-stat)
14    - [Decide the print interval](#decide-the-print-interval)
15    - [Display counters in systrace](#display-counters-in-systrace)
16    - [Show event count per thread](#show-event-count-per-thread)
17    - [Show event count per core](#show-event-count-per-core)
18  - [The record command](#the-record-command)
19    - [Select events to record](#select-events-to-record)
20    - [Select target to record](#select-target-to-record)
21    - [Set the frequency to record](#set-the-frequency-to-record)
22    - [Decide how long to record](#decide-how-long-to-record)
23    - [Set the path to store profiling data](#set-the-path-to-store-profiling-data)
24      - [Record call graphs](#record-call-graphs)
25    - [Record both on CPU time and off CPU time](#record-both-on-cpu-time-and-off-cpu-time)
26  - [The report command](#the-report-command)
27    - [Set the path to read profiling data](#set-the-path-to-read-profiling-data)
28    - [Set the path to find binaries](#set-the-path-to-find-binaries)
29    - [Filter samples](#filter-samples)
30    - [Group samples into sample entries](#group-samples-into-sample-entries)
31      - [Report call graphs](#report-call-graphs)
32
33
34## How simpleperf works
35
36Modern CPUs have a hardware component called the performance monitoring unit (PMU). The PMU has
37several hardware counters, counting events like how many cpu cycles have happened, how many
38instructions have executed, or how many cache misses have happened.
39
40The Linux kernel wraps these hardware counters into hardware perf events. In addition, the Linux
41kernel also provides hardware independent software events and tracepoint events. The Linux kernel
42exposes all events to userspace via the perf_event_open system call, which is used by simpleperf.
43
44Simpleperf has three main commands: stat, record and report.
45
46The stat command gives a summary of how many events have happened in the profiled processes in a
47time period. Here’s how it works:
481. Given user options, simpleperf enables profiling by making a system call to the kernel.
492. The kernel enables counters while the profiled processes are running.
503. After profiling, simpleperf reads counters from the kernel, and reports a counter summary.
51
52The record command records samples of the profiled processes in a time period. Here’s how it works:
531. Given user options, simpleperf enables profiling by making a system call to the kernel.
542. Simpleperf creates mapped buffers between simpleperf and the kernel.
553. The kernel enables counters while the profiled processes are running.
564. Each time a given number of events happen, the kernel dumps a sample to the mapped buffers.
575. Simpleperf reads samples from the mapped buffers and stores profiling data in a file called
58   perf.data.
59
60The report command reads perf.data and any shared libraries used by the profiled processes,
61and outputs a report showing where the time was spent.
62
63## Commands
64
65Simpleperf supports several commands, listed below:
66
67```
68The debug-unwind command: debug/test dwarf based offline unwinding, used for debugging simpleperf.
69The dump command: dumps content in perf.data, used for debugging simpleperf.
70The help command: prints help information for other commands.
71The kmem command: collects kernel memory allocation information (will be replaced by Python scripts).
72The list command: lists all event types supported on the Android device.
73The record command: profiles processes and stores profiling data in perf.data.
74The report command: reports profiling data in perf.data.
75The report-sample command: reports each sample in perf.data, used for supporting integration of
76                           simpleperf in Android Studio.
77The stat command: profiles processes and prints counter summary.
78
79```
80
81Each command supports different options, which can be seen through help message.
82
83```sh
84# List all commands.
85$ simpleperf --help
86
87# Print help message for record command.
88$ simpleperf record --help
89```
90
91Below describes the most frequently used commands, which are list, stat, record and report.
92
93## The list command
94
95The list command lists all events available on the device. Different devices may support different
96events because they have different hardware and kernels.
97
98```sh
99$ simpleperf list
100List of hw-cache events:
101  branch-loads
102  ...
103List of hardware events:
104  cpu-cycles
105  instructions
106  ...
107List of software events:
108  cpu-clock
109  task-clock
110  ...
111```
112
113On ARM/ARM64, the list command also shows a list of raw events, they are the events supported by
114the ARM PMU on the device. The kernel has wrapped part of them into hardware events and hw-cache
115events. For example, raw-cpu-cycles is wrapped into cpu-cycles, raw-instruction-retired is wrapped
116into instructions. The raw events are provided in case we want to use some events supported on the
117device, but unfortunately not wrapped by the kernel.
118
119## The stat command
120
121The stat command is used to get event counter values of the profiled processes. By passing options,
122we can select which events to use, which processes/threads to monitor, how long to monitor and the
123print interval.
124
125```sh
126# Stat using default events (cpu-cycles,instructions,...), and monitor process 7394 for 10 seconds.
127$ simpleperf stat -p 7394 --duration 10
128Performance counter statistics:
129
130 1,320,496,145  cpu-cycles         # 0.131736 GHz                     (100%)
131   510,426,028  instructions       # 2.587047 cycles per instruction  (100%)
132     4,692,338  branch-misses      # 468.118 K/sec                    (100%)
133886.008130(ms)  task-clock         # 0.088390 cpus used               (100%)
134           753  context-switches   # 75.121 /sec                      (100%)
135           870  page-faults        # 86.793 /sec                      (100%)
136
137Total test time: 10.023829 seconds.
138```
139
140### Select events to stat
141
142We can select which events to use via -e.
143
144```sh
145# Stat event cpu-cycles.
146$ simpleperf stat -e cpu-cycles -p 11904 --duration 10
147
148# Stat event cache-references and cache-misses.
149$ simpleperf stat -e cache-references,cache-misses -p 11904 --duration 10
150```
151
152When running the stat command, if the number of hardware events is larger than the number of
153hardware counters available in the PMU, the kernel shares hardware counters between events, so each
154event is only monitored for part of the total time. In the example below, there is a percentage at
155the end of each row, showing the percentage of the total time that each event was actually
156monitored.
157
158```sh
159# Stat using event cache-references, cache-references:u,....
160$ simpleperf stat -p 7394 -e cache-references,cache-references:u,cache-references:k \
161      -e cache-misses,cache-misses:u,cache-misses:k,instructions --duration 1
162Performance counter statistics:
163
1644,331,018  cache-references     # 4.861 M/sec    (87%)
1653,064,089  cache-references:u   # 3.439 M/sec    (87%)
1661,364,959  cache-references:k   # 1.532 M/sec    (87%)
167   91,721  cache-misses         # 102.918 K/sec  (87%)
168   45,735  cache-misses:u       # 51.327 K/sec   (87%)
169   38,447  cache-misses:k       # 43.131 K/sec   (87%)
1709,688,515  instructions         # 10.561 M/sec   (89%)
171
172Total test time: 1.026802 seconds.
173```
174
175In the example above, each event is monitored about 87% of the total time. But there is no
176guarantee that any pair of events are always monitored at the same time. If we want to have some
177events monitored at the same time, we can use --group.
178
179```sh
180# Stat using event cache-references, cache-references:u,....
181$ simpleperf stat -p 7964 --group cache-references,cache-misses \
182      --group cache-references:u,cache-misses:u --group cache-references:k,cache-misses:k \
183      -e instructions --duration 1
184Performance counter statistics:
185
1863,638,900  cache-references     # 4.786 M/sec          (74%)
187   65,171  cache-misses         # 1.790953% miss rate  (74%)
1882,390,433  cache-references:u   # 3.153 M/sec          (74%)
189   32,280  cache-misses:u       # 1.350383% miss rate  (74%)
190  879,035  cache-references:k   # 1.251 M/sec          (68%)
191   30,303  cache-misses:k       # 3.447303% miss rate  (68%)
1928,921,161  instructions         # 10.070 M/sec         (86%)
193
194Total test time: 1.029843 seconds.
195```
196
197### Select target to stat
198
199We can select which processes or threads to monitor via -p or -t. Monitoring a
200process is the same as monitoring all threads in the process. Simpleperf can also fork a child
201process to run the new command and then monitor the child process.
202
203```sh
204# Stat process 11904 and 11905.
205$ simpleperf stat -p 11904,11905 --duration 10
206
207# Stat thread 11904 and 11905.
208$ simpleperf stat -t 11904,11905 --duration 10
209
210# Start a child process running `ls`, and stat it.
211$ simpleperf stat ls
212
213# Stat the process of an Android application. This only works for debuggable apps on non-rooted
214# devices.
215$ simpleperf stat --app com.example.simpleperf.simpleperfexamplewithnative
216
217# Stat system wide using -a.
218$ simpleperf stat -a --duration 10
219```
220
221### Decide how long to stat
222
223When monitoring existing threads, we can use --duration to decide how long to monitor. When
224monitoring a child process running a new command, simpleperf monitors until the child process ends.
225In this case, we can use Ctrl-C to stop monitoring at any time.
226
227```sh
228# Stat process 11904 for 10 seconds.
229$ simpleperf stat -p 11904 --duration 10
230
231# Stat until the child process running `ls` finishes.
232$ simpleperf stat ls
233
234# Stop monitoring using Ctrl-C.
235$ simpleperf stat -p 11904 --duration 10
236^C
237```
238
239If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
240SIGHUP signals to simpleperf to stop monitoring.
241
242### Decide the print interval
243
244When monitoring perf counters, we can also use --interval to decide the print interval.
245
246```sh
247# Print stat for process 11904 every 300ms.
248$ simpleperf stat -p 11904 --duration 10 --interval 300
249
250# Print system wide stat at interval of 300ms for 10 seconds. Note that system wide profiling needs
251# root privilege.
252$ su 0 simpleperf stat -a --duration 10 --interval 300
253```
254
255### Display counters in systrace
256
257Simpleperf can also work with systrace to dump counters in the collected trace. Below is an example
258to do a system wide stat.
259
260```sh
261# Capture instructions (kernel only) and cache misses with interval of 300 milliseconds for 15
262# seconds.
263$ su 0 simpleperf stat -e instructions:k,cache-misses -a --interval 300 --duration 15
264# On host launch systrace to collect trace for 10 seconds.
265(HOST)$ external/chromium-trace/systrace.py --time=10 -o new.html sched gfx view
266# Open the collected new.html in browser and perf counters will be shown up.
267```
268
269### Show event count per thread
270
271By default, stat cmd outputs an event count sum for all monitored targets. But when `--per-thread`
272option is used, stat cmd outputs an event count for each thread in monitored targets. It can be
273used to find busy threads in a process or system wide. With `--per-thread` option, stat cmd opens
274a perf_event_file for each exisiting thread. If a monitored thread creates new threads, event
275count for new threads will be added to the monitored thread by default, otherwise omitted if
276`--no-inherit` option is also used.
277
278```sh
279# Print event counts for each thread in process 11904. Event counts for threads created after
280# stat cmd will be added to threads creating them.
281$ simpleperf stat --per-thread -p 11904 --duration 1
282
283# Print event counts for all threads running in the system every 1s. Threads not running will not
284# be reported.
285$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values
286
287# Print event counts for all threads running in the system every 1s. Event counts for threads
288# created after stat cmd will be omitted.
289$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values --no-inherit
290```
291
292### Show event count per core
293
294By default, stat cmd outputs an event count sum for all monitored cpu cores. But when `--per-core`
295option is used, stat cmd outputs an event count for each core. It can be used to see how events
296are distributed on different cores.
297When stating non-system wide with `--per-core` option, simpleperf creates a perf event for each
298monitored thread on each core. When a thread is in running state, perf events on all cores are
299enabled, but only the perf event on the core running the thread is in running state. So the
300percentage comment shows runtime_on_a_core / runtime_on_all_cores. Note that, percentage is still
301affected by hardware counter multiplexing. Check simpleperf log output for ways to distinguish it.
302
303```sh
304# Print event counts for each cpu running threads in process 11904.
305# A percentage shows runtime_on_a_cpu / runtime_on_all_cpus.
306$ simpleperf stat --per-core -p 11904 --duration 1
307Performance counter statistics:
308
309# cpu       count  event_name   # percentage = event_run_time / enabled_time
310  7    56,552,838  cpu-cycles   #   (60%)
311  3    25,958,605  cpu-cycles   #   (20%)
312  0    22,822,698  cpu-cycles   #   (15%)
313  1     6,661,495  cpu-cycles   #   (5%)
314  4     1,519,093  cpu-cycles   #   (0%)
315
316Total test time: 1.001082 seconds.
317
318# Print event counts for each cpu system wide.
319$ su 0 simpleperf stat --per-core -a --duration 1
320
321# Print cpu-cycle event counts for each cpu for each thread running in the system.
322$ su 0 simpleperf stat -e cpu-cycles -a --per-thread --per-core --duration 1
323```
324
325## The record command
326
327The record command is used to dump samples of the profiled processes. Each sample can contain
328information like the time at which the sample was generated, the number of events since last
329sample, the program counter of a thread, the call chain of a thread.
330
331By passing options, we can select which events to use, which processes/threads to monitor,
332what frequency to dump samples, how long to monitor, and where to store samples.
333
334```sh
335# Record on process 7394 for 10 seconds, using default event (cpu-cycles), using default sample
336# frequency (4000 samples per second), writing records to perf.data.
337$ simpleperf record -p 7394 --duration 10
338simpleperf I cmd_record.cpp:316] Samples recorded: 21430. Samples lost: 0.
339```
340
341### Select events to record
342
343By default, the cpu-cycles event is used to evaluate consumed cpu cycles. But we can also use other
344events via -e.
345
346```sh
347# Record using event instructions.
348$ simpleperf record -e instructions -p 11904 --duration 10
349
350# Record using task-clock, which shows the passed CPU time in nanoseconds.
351$ simpleperf record -e task-clock -p 11904 --duration 10
352```
353
354### Select target to record
355
356The way to select target in record command is similar to that in the stat command.
357
358```sh
359# Record process 11904 and 11905.
360$ simpleperf record -p 11904,11905 --duration 10
361
362# Record thread 11904 and 11905.
363$ simpleperf record -t 11904,11905 --duration 10
364
365# Record a child process running `ls`.
366$ simpleperf record ls
367
368# Record the process of an Android application. This only works for debuggable apps on non-rooted
369# devices.
370$ simpleperf record --app com.example.simpleperf.simpleperfexamplewithnative
371
372# Record system wide.
373$ simpleperf record -a --duration 10
374```
375
376### Set the frequency to record
377
378We can set the frequency to dump records via -f or -c. For example, -f 4000 means
379dumping approximately 4000 records every second when the monitored thread runs. If a monitored
380thread runs 0.2s in one second (it can be preempted or blocked in other times), simpleperf dumps
381about 4000 * 0.2 / 1.0 = 800 records every second. Another way is using -c. For example, -c 10000
382means dumping one record whenever 10000 events happen.
383
384```sh
385# Record with sample frequency 1000: sample 1000 times every second running.
386$ simpleperf record -f 1000 -p 11904,11905 --duration 10
387
388# Record with sample period 100000: sample 1 time every 100000 events.
389$ simpleperf record -c 100000 -t 11904,11905 --duration 10
390```
391
392To avoid taking too much time generating samples, kernel >= 3.10 sets the max percent of cpu time
393used for generating samples (default is 25%), and decreases the max allowed sample frequency when
394hitting that limit. Simpleperf uses --cpu-percent option to adjust it, but it needs either root
395privilege or to be on Android >= Q.
396
397```sh
398# Record with sample frequency 10000, with max allowed cpu percent to be 50%.
399$ simpleperf record -f 1000 -p 11904,11905 --duration 10 --cpu-percent 50
400```
401
402### Decide how long to record
403
404The way to decide how long to monitor in record command is similar to that in the stat command.
405
406```sh
407# Record process 11904 for 10 seconds.
408$ simpleperf record -p 11904 --duration 10
409
410# Record until the child process running `ls` finishes.
411$ simpleperf record ls
412
413# Stop monitoring using Ctrl-C.
414$ simpleperf record -p 11904 --duration 10
415^C
416```
417
418If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
419SIGHUP signals to simpleperf to stop monitoring.
420
421### Set the path to store profiling data
422
423By default, simpleperf stores profiling data in perf.data in the current directory. But the path
424can be changed using -o.
425
426```sh
427# Write records to data/perf2.data.
428$ simpleperf record -p 11904 -o data/perf2.data --duration 10
429```
430
431#### Record call graphs
432
433A call graph is a tree showing function call relations. Below is an example.
434
435```
436main() {
437    FunctionOne();
438    FunctionTwo();
439}
440FunctionOne() {
441    FunctionTwo();
442    FunctionThree();
443}
444a call graph:
445    main-> FunctionOne
446       |    |
447       |    |-> FunctionTwo
448       |    |-> FunctionThree
449       |
450       |-> FunctionTwo
451```
452
453A call graph shows how a function calls other functions, and a reversed call graph shows how
454a function is called by other functions. To show a call graph, we need to first record it, then
455report it.
456
457There are two ways to record a call graph, one is recording a dwarf based call graph, the other is
458recording a stack frame based call graph. Recording dwarf based call graphs needs support of debug
459information in native binaries. While recording stack frame based call graphs needs support of
460stack frame registers.
461
462```sh
463# Record a dwarf based call graph
464$ simpleperf record -p 11904 -g --duration 10
465
466# Record a stack frame based call graph
467$ simpleperf record -p 11904 --call-graph fp --duration 10
468```
469
470[Here](README.md#suggestions-about-recording-call-graphs) are some suggestions about recording call graphs.
471
472### Record both on CPU time and off CPU time
473
474Simpleperf is a CPU profiler, it generates samples for a thread only when it is running on a CPU.
475However, sometimes we want to figure out where the time of a thread is spent, whether it is running
476on a CPU, or staying in the kernel's ready queue, or waiting for something like I/O events.
477
478To support this, the record command uses --trace-offcpu to trace both on CPU time and off CPU time.
479When --trace-offcpu is used, simpleperf generates a sample when a running thread is scheduled out,
480so we know the callstack of a thread when it is scheduled out. And when reporting a perf.data
481generated with --trace-offcpu, we use time to the next sample (instead of event counts from the
482previous sample) as the weight of the current sample. As a result, we can get a call graph based
483on timestamps, including both on CPU time and off CPU time.
484
485trace-offcpu is implemented using sched:sched_switch tracepoint event, which may not be supported
486on old kernels. But it is guaranteed to be supported on devices >= Android O MR1. We can check
487whether trace-offcpu is supported as below.
488
489trace-offcpu only works with recording one of events: cpu-cycles, cpu-clock, task-clock. Using it
490with other events or multiple events doesn't make much sense and makes the report confusing.
491
492```sh
493$ simpleperf list --show-features
494dwarf-based-call-graph
495trace-offcpu
496```
497
498If trace-offcpu is supported, it will be shown in the feature list. Then we can try it.
499
500```sh
501# Record with --trace-offcpu.
502$ simpleperf record -g -p 11904 --duration 10 --trace-offcpu
503
504# Record with --trace-offcpu using app_profiler.py.
505$ python app_profiler.py -p com.example.simpleperf.simpleperfexamplewithnative -a .SleepActivity \
506    -r "-g -e task-clock:u -f 1000 --duration 10 --trace-offcpu"
507```
508
509Below is an example comparing the profiling result with / without --trace-offcpu.
510First we record without --trace-offcpu.
511
512```sh
513$ python app_profiler.py -p com.example.simpleperf.simpleperfexamplewithnative -a .SleepActivity
514
515$ python report_html.py --add_disassembly --add_source_code --source_dirs ../demo
516```
517
518The result is [here](./without_trace_offcpu.html).
519In the result, all time is taken by RunFunction(), and sleep time is ignored.
520But if we add --trace-offcpu, the result changes.
521
522```sh
523$ python app_profiler.py -p com.example.simpleperf.simpleperfexamplewithnative -a .SleepActivity \
524    -r "-g -e task-clock:u --trace-offcpu -f 1000 --duration 10"
525
526$ python report_html.py --add_disassembly --add_source_code --source_dirs ../demo
527```
528
529The result is [here](./trace_offcpu.html).
530In the result, half of the time is taken by RunFunction(), and the other half is taken by
531SleepFunction(). So it traces both on CPU time and off CPU time.
532
533## The report command
534
535The report command is used to report profiling data generated by the record command. The report
536contains a table of sample entries. Each sample entry is a row in the report. The report command
537groups samples belong to the same process, thread, library, function in the same sample entry. Then
538sort the sample entries based on the event count a sample entry has.
539
540By passing options, we can decide how to filter out uninteresting samples, how to group samples
541into sample entries, and where to find profiling data and binaries.
542
543Below is an example. Records are grouped into 4 sample entries, each entry is a row. There are
544several columns, each column shows piece of information belonging to a sample entry. The first
545column is Overhead, which shows the percentage of events inside the current sample entry in total
546events. As the perf event is cpu-cycles, the overhead is the percentage of CPU cycles used in each
547function.
548
549```sh
550# Reports perf.data, using only records sampled in libsudo-game-jni.so, grouping records using
551# thread name(comm), process id(pid), thread id(tid), function name(symbol), and showing sample
552# count for each row.
553$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so \
554      --sort comm,pid,tid,symbol -n
555Cmdline: /data/data/com.example.sudogame/simpleperf record -p 7394 --duration 10
556Arch: arm64
557Event: cpu-cycles (type 0, config 0)
558Samples: 28235
559Event count: 546356211
560
561Overhead  Sample  Command    Pid   Tid   Symbol
56259.25%    16680   sudogame  7394  7394  checkValid(Board const&, int, int)
56320.42%    5620    sudogame  7394  7394  canFindSolution_r(Board&, int, int)
56413.82%    4088    sudogame  7394  7394  randomBlock_r(Board&, int, int, int, int, int)
5656.24%     1756    sudogame  7394  7394  @plt
566```
567
568### Set the path to read profiling data
569
570By default, the report command reads profiling data from perf.data in the current directory.
571But the path can be changed using -i.
572
573```sh
574$ simpleperf report -i data/perf2.data
575```
576
577### Set the path to find binaries
578
579To report function symbols, simpleperf needs to read executable binaries used by the monitored
580processes to get symbol table and debug information. By default, the paths are the executable
581binaries used by monitored processes while recording. However, these binaries may not exist when
582reporting or not contain symbol table and debug information. So we can use --symfs to redirect
583the paths.
584
585```sh
586# In this case, when simpleperf wants to read executable binary /A/b, it reads file in /A/b.
587$ simpleperf report
588
589# In this case, when simpleperf wants to read executable binary /A/b, it prefers file in
590# /debug_dir/A/b to file in /A/b.
591$ simpleperf report --symfs /debug_dir
592
593# Read symbols for system libraries built locally. Note that this is not needed since Android O,
594# which ships symbols for system libraries on device.
595$ simpleperf report --symfs $ANDROID_PRODUCT_OUT/symbols
596```
597
598### Filter samples
599
600When reporting, it happens that not all records are of interest. The report command supports four
601filters to select samples of interest.
602
603```sh
604# Report records in threads having name sudogame.
605$ simpleperf report --comms sudogame
606
607# Report records in process 7394 or 7395
608$ simpleperf report --pids 7394,7395
609
610# Report records in thread 7394 or 7395.
611$ simpleperf report --tids 7394,7395
612
613# Report records in libsudo-game-jni.so.
614$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so
615```
616
617### Group samples into sample entries
618
619The report command uses --sort to decide how to group sample entries.
620
621```sh
622# Group records based on their process id: records having the same process id are in the same
623# sample entry.
624$ simpleperf report --sort pid
625
626# Group records based on their thread id and thread comm: records having the same thread id and
627# thread name are in the same sample entry.
628$ simpleperf report --sort tid,comm
629
630# Group records based on their binary and function: records in the same binary and function are in
631# the same sample entry.
632$ simpleperf report --sort dso,symbol
633
634# Default option: --sort comm,pid,tid,dso,symbol. Group records in the same thread, and belong to
635# the same function in the same binary.
636$ simpleperf report
637```
638
639#### Report call graphs
640
641To report a call graph, please make sure the profiling data is recorded with call graphs,
642as [here](#record-call-graphs).
643
644```
645$ simpleperf report -g
646```
647