High Performance Computing - Charles Severance [46]
tcov
tcov, available on Sun workstations and other SPARC machines that run SunOS, gives execution statistics that describe the number of times each source statement was executed. It is very easy to use. Assume for illustration that we have a source program called foo.c. The following steps create a basic block profile:
% cc -a foo.c -o foo
% foo
% tcov foo.c
The -a option tells the compiler to include the necessary support for tcov.[32] Several files are created in the process. One called foo.d accumulates a history of the exe- cution frequencies within the program foo. That is, old data is updated with new data each time foo is run, so you can get an overall picture of what happens inside foo, given a variety of data sets. Just remember to clean out the old data if you want to start over. The profile itself goes into a file called foo.tcov.
Let’s look at an illustration. Below is a short C program that performs a bubble sort of 10 integers:
int n[] = {23,12,43,2,98,78,2,51,77,8};
main ()
{
int i, j, ktemp;
for (i=10; i>0; i--) {
for (j=0; jif (n[j] < n[j+1]) {
ktemp = n[j+1], n[j+1] = n[j], n[j] = ktemp;
}
}
}
}
tcov produces a basic block profile that contains execution counts for each source line, plus some summary statistics (not shown):
int n[] = {23,12,43,2,98,78,2,51,77,8};
main ()
1 -> {
int i, j, ktemp;
10 -> for (i=10; i>0; i--) {
10, 55 -> for (j=0; j55 -> if (n[j] < n[j+1]) {
23 -> ktemp = n[j+1], n[j+1] = n[j], n[j] = ktemp;
}
}
}
1 -> }
The numbers to the left tell you the number of times each block was entered. For instance, you can see that the routine was entered just once, and that the highest count occurs at the test n[j] < n[j+1]. tcov shows more than one count on a line in places where the compiler has created more than one block.
pixie
pixie is a little different from tcov. Rather than reporting the number of times each source line was executed, pixie reports the number of machine clock cycles devoted to executing each line. In theory, you could use this to calculate the amount of time spent per statement, although anomalies like cache misses are not represented.
pixie works by “pixifying” an executable file that has been compiled and linked in the normal way. Below we run pixie on foo to create a new executable called foo.pixie:
% cc foo.c -o foo
% pixie foo
% foo.pixie
% prof -pixie foo
Also created was a file named foo.Addrs, which contains addresses for the basic blocks within foo. When the new program, foo.pixie , is run, it creates a file called foo.Counts , containing execution counts for the basic blocks whose addresses are stored in foo.Addrs. pixie data accumulates from run to run. The statistics are retrieved using prof and a special –pixie flag.
pixie’s default output comes in three sections and shows:
Cycles per routine
Procedure invocation counts
Cycles per basic line
Below, we have listed the output of the third section for the bubble sort:
procedure (file) line bytes cycles % cum %
main (foo.c) 7 44 605 12.11 12.11
_cleanup (flsbuf.c) 59 20 500 10.01 22.13
fclose (flsbuf.c) 81 20 500 10.01 32.14
fclose (flsbuf.c) 94 20 500 10.01 42.15
_cleanup (flsbuf.c) 54 20 500 10.01 52.16
fclose (flsbuf.c) 76 16 400 8.01 60.17
main (foo.c) 10 24 298 5.97 66.14
main (foo.c) 8 36 207 4.14 70.28
.... .. .. .. ... ...
Here you can see three entries for the main routine from foo.c, plus a number of system library routines. The entries show the associated line number and the number of machine cycles dedicated to executing that line as the program ran. For instance, line 7 of foo.c took 605 cycles (12% of the runtime).
Virtual Memory*
In addition to the negative performance impact due to cache misses, the virtual memory system can also slow your program down if it is too large to fit in the memory of the system