2009年10月13日 星期二

Oprofile 0xdroid Android on Beagleboard

Android supports oprofile actually. And you can play happily with that with some oprofile knowledge on G1. However the external/oprofile in Android does not support ARM_V7 for now. To play with it patch the following type and trigger support of ARM_V7


diff --git a/libop/op_cpu_type.c b/libop/op_cpu_type.c
index b9d13de..737f63e 100644
--- a/libop/op_cpu_type.c
+++ b/libop/op_cpu_type.c
@@ -74,6 +74,7 @@ static struct cpu_descr const cpu_descrs[MAX_CPU_TYPE] = {
{ "ppc64 POWER5++", "ppc64/power5++", CPU_PPC64_POWER5pp, 6 },
{ "e300", "ppc/e300", CPU_PPC_E300, 4 },
{ "AVR32", "avr32", CPU_AVR32, 3 },
+ { "ARM V7 PMNC", "arm/armv7", CPU_ARM_V7, 5},
};

static size_t const nr_cpu_descrs = sizeof(cpu_descrs) / sizeof(struct cpu_descr);
diff --git a/libop/op_cpu_type.h b/libop/op_cpu_type.h
index be95ae2..f4db260 100644
--- a/libop/op_cpu_type.h
+++ b/libop/op_cpu_type.h
@@ -72,6 +72,7 @@ typedef enum {
CPU_PPC64_POWER5pp, /**< ppc64 Power5++ family */
CPU_PPC_E300, /**< e300 */
CPU_AVR32, /**< AVR32 */
+ CPU_ARM_V7, /**< ARM V7 */
MAX_CPU_TYPE
} op_cpu;

diff --git a/libop/op_events.c b/libop/op_events.c
index b4a10e7..7f0ed25 100644
--- a/libop/op_events.c
+++ b/libop/op_events.c
@@ -793,6 +793,7 @@ void op_default_event(op_cpu cpu_type, struct op_default_event_descr * descr)
case CPU_ARM_XSCALE2:
case CPU_ARM_MPCORE:
case CPU_ARM_V6:
+ case CPU_ARM_V7:
case CPU_AVR32:
descr->name = "CPU_CYCLES";
break;
diff --git a/opimport_pull b/opimport_pull
index 7dbac4a..bf1f19a 100755
--- a/opimport_pull
+++ b/opimport_pull
@@ -1,4 +1,4 @@
-#!/usr/bin/python2.4 -E
+#!/usr/bin/python -E

import os
import re


And adding event tables for ARMv7

commit f129bca975b1704c06e07df7710d29de13a1e922
Author: Tick Chen <tick@0xlab.org>
Date: Sat Sep 26 22:56:44 2009 +0800

[oprofile] adding metadata of armv7

diff --git a/linux-x86/oprofile/arm/armv7/events b/linux-x86/oprofile/arm/armv7/events
new file mode 100644
index 0000000..2550e41
--- /dev/null
+++ b/linux-x86/oprofile/arm/armv7/events
@@ -0,0 +1,53 @@
+# ARM V7 events
+# From Cortex A8 DDI (ARM DDI 0344B, revision r1p1)
+#
+event:0x00 counters:1,2,3,4 um:zero minimum:500 name:PMNC_SW_INCR : Software increment of PMNC registers
+event:0x01 counters:1,2,3,4 um:zero minimum:500 name:IFETCH_MISS : Instruction fetch misses from cache or normal cacheable memory
+event:0x02 counters:1,2,3,4 um:zero minimum:500 name:ITLB_MISS : Instruction fetch misses from TLB
+event:0x03 counters:1,2,3,4 um:zero minimum:500 name:DCACHE_REFILL : Data R/W operation that causes a refill from cache or normal cacheable memory
+event:0x04 counters:1,2,3,4 um:zero minimum:500 name:DCACHE_ACCESS : Data R/W from cache
+event:0x05 counters:1,2,3,4 um:zero minimum:500 name:DTLB_REFILL : Data R/W that causes a TLB refill
+event:0x06 counters:1,2,3,4 um:zero minimum:500 name:DREAD : Data read architecturally executed (note: architecturally executed = for instructions that are unconditional or that pass the condition code)
+event:0x07 counters:1,2,3,4 um:zero minimum:500 name:DWRITE : Data write architecturally executed
+event:0x08 counters:1,2,3,4 um:zero minimum:500 name:INSTR_EXECUTED : All executed instructions
+event:0x09 counters:1,2,3,4 um:zero minimum:500 name:EXC_TAKEN : Exception taken
+event:0x0A counters:1,2,3,4 um:zero minimum:500 name:EXC_EXECUTED : Exception return architecturally executed
+event:0x0B counters:1,2,3,4 um:zero minimum:500 name:CID_WRITE : Instruction that writes to the Context ID Register architecturally executed
+event:0x0C counters:1,2,3,4 um:zero minimum:500 name:PC_WRITE : SW change of PC, architecturally executed (not by exceptions)
+event:0x0D counters:1,2,3,4 um:zero minimum:500 name:PC_IMM_BRANCH : Immediate branch instruction executed (taken or not)
+event:0x0E counters:1,2,3,4 um:zero minimum:500 name:PC_PROC_RETURN : Procedure return architecturally executed (not by exceptions)
+event:0x0F counters:1,2,3,4 um:zero minimum:500 name:UNALIGNED_ACCESS : Unaligned access architecturally executed
+event:0x10 counters:1,2,3,4 um:zero minimum:500 name:PC_BRANCH_MIS_PRED : Branch mispredicted or not predicted. Counts pipeline flushes because of misprediction
+event:0x12 counters:1,2,3,4 um:zero minimum:500 name:PC_BRANCH_MIS_USED : Branch or change in program flow that could have been predicted
+event:0x40 counters:1,2,3,4 um:zero minimum:500 name:WRITE_BUFFER_FULL : Any write buffer full cycle
+event:0x41 counters:1,2,3,4 um:zero minimum:500 name:L2_STORE_MERGED : Any store that is merged in L2 cache
+event:0x42 counters:1,2,3,4 um:zero minimum:500 name:L2_STORE_BUFF : Any bufferable store from load/store to L2 cache
+event:0x43 counters:1,2,3,4 um:zero minimum:500 name:L2_ACCESS : Any access to L2 cache
+event:0x44 counters:1,2,3,4 um:zero minimum:500 name:L2_CACH_MISS : Any cacheable miss in L2 cache
+event:0x45 counters:1,2,3,4 um:zero minimum:500 name:AXI_READ_CYCLES : Number of cycles for an active AXI read
+event:0x46 counters:1,2,3,4 um:zero minimum:500 name:AXI_WRITE_CYCLES : Number of cycles for an active AXI write
+event:0x47 counters:1,2,3,4 um:zero minimum:500 name:MEMORY_REPLAY : Any replay event in the memory subsystem
+event:0x48 counters:1,2,3,4 um:zero minimum:500 name:UNALIGNED_ACCESS_REPLAY : Unaligned access that causes a replay
+event:0x49 counters:1,2,3,4 um:zero minimum:500 name:L1_DATA_MISS : L1 data cache miss as a result of the hashing algorithm
+event:0x4A counters:1,2,3,4 um:zero minimum:500 name:L1_INST_MISS : L1 instruction cache miss as a result of the hashing algorithm
+event:0x4B counters:1,2,3,4 um:zero minimum:500 name:L1_DATA_COLORING : L1 data access in which a page coloring alias occurs
+event:0x4C counters:1,2,3,4 um:zero minimum:500 name:L1_NEON_DATA : NEON data access that hits L1 cache
+event:0x4D counters:1,2,3,4 um:zero minimum:500 name:L1_NEON_CACH_DATA : NEON cacheable data access that hits L1 cache
+event:0x4E counters:1,2,3,4 um:zero minimum:500 name:L2_NEON : L2 access as a result of NEON memory access
+event:0x4F counters:1,2,3,4 um:zero minimum:500 name:L2_NEON_HIT : Any NEON hit in L2 cache
+event:0x50 counters:1,2,3,4 um:zero minimum:500 name:L1_INST : Any L1 instruction cache access, excluding CP15 cache accesses
+event:0x51 counters:1,2,3,4 um:zero minimum:500 name:PC_RETURN_MIS_PRED : Return stack misprediction at return stack pop (incorrect target address)
+event:0x52 counters:1,2,3,4 um:zero minimum:500 name:PC_BRANCH_FAILED : Branch prediction misprediction
+event:0x53 counters:1,2,3,4 um:zero minimum:500 name:PC_BRANCH_TAKEN : Any predicted branch that is taken
+event:0x54 counters:1,2,3,4 um:zero minimum:500 name:PC_BRANCH_EXECUTED : Any taken branch that is executed
+event:0x55 counters:1,2,3,4 um:zero minimum:500 name:OP_EXECUTED : Number of operations executed (in instruction or mutli-cycle instruction)
+event:0x56 counters:1,2,3,4 um:zero minimum:500 name:CYCLES_INST_STALL : Cycles where no instruction available
+event:0x57 counters:1,2,3,4 um:zero minimum:500 name:CYCLES_INST : Number of instructions issued in a cycle
+event:0x58 counters:1,2,3,4 um:zero minimum:500 name:CYCLES_NEON_DATA_STALL : Number of cycles the processor waits on MRC data from NEON
+event:0x59 counters:1,2,3,4 um:zero minimum:500 name:CYCLES_NEON_INST_STALL : Number of cycles the processor waits on NEON instruction queue or NEON load queue
+event:0x5A counters:1,2,3,4 um:zero minimum:500 name:NEON_CYCLES : Number of cycles NEON and integer processors are not idle
+event:0x70 counters:1,2,3,4 um:zero minimum:500 name:PMU0_EVENTS : Number of events from external input source PMUEXTIN[0]
+event:0x71 counters:1,2,3,4 um:zero minimum:500 name:PMU1_EVENTS : Number of events from external input source PMUEXTIN[1]
+event:0x72 counters:1,2,3,4 um:zero minimum:500 name:PMU_EVENTS : Number of events from both external input sources PMUEXTIN[0] and PMUEXTIN[1]
+event:0xFF counters:0 um:zero minimum:500 name:CPU_CYCLES : Number of CPU cycles
+
diff --git a/linux-x86/oprofile/arm/armv7/unit_masks b/linux-x86/oprofile/arm/armv7/unit_masks
new file mode 100644
index 0000000..02464a3
--- /dev/null
+++ b/linux-x86/oprofile/arm/armv7/unit_masks
@@ -0,0 +1,4 @@
+# ARM V7 PMNC possible unit masks
+#
+name:zero type:mandatory default:0x00
+ 0x00 No unit mask


This way we can play oprofile on beagleboard already. But you cannot analysis it yet.
Because of that prebuild opreport does not supports ARM_v7. Therefore I downloaded and compile the oprofile 0.9.5. Replace those in prebuild, then we can analysis the data happily.


All of these stuff had been done in 0xdroid, therefore you can play directly with 0xdroid.
The default kernel released in http://downloads.0xlab.org/ currently does not set oprofile flags up therefore you will need to set them up and recompile it.


+ CONFIG_OPROFILE_ARMV7=y
+ CONFIG_OPROFILE=y
+ CONFIG_PROFILING=y
+ CONFIG_HAVE_OPROFILE=y
+ CONFIG_TRACEPOINTS=y


You can throw the vmlinux into a usb storage or SD card with VFAT partition as the first partition.

After booting up 0xdroid beagle-cupcake or beagle-donut, you can run


opcontrol —setup —event=CPU_CYCLES:15000:::1:1 \
—vmlinux=/sdcard/vmlinux \
—kernel-range=0xc0008000,0xcfffffff
echo 16 > /dev/oprofile/backtrace_depth


That will setup the oprofiled to trigger sampling for every 15000 clock cycles. The smaller CPU_CYCLES the more heavy loading of profiling and getting more details. The larger CPU_CYCLES the less detail we get and lower profiling loading.
When I am profiling the overhead of camera preview I found one interesting phenomenon. When I use 150000 as sampling CPU_CYCLES, it's about sampling 30 times per second. I cannot get anything meaningful with the sampling rate. This confused me for a while before I realize it's just about the same frame rate with camera. I always sampled at the same point. Therefore even if we samples a lot, the grid of sampling period should be much smaller than what you want to profile. We always may be blind to some samples. We should be aware of that, and we may need to change various CPU_CYCLES profiling the same topic to get more confidence of the result.

When you are ready to profile just enter

opcontrol --start


And then do whatever you want to profile.
Stop oprofile with

opcontrol --stop


After stopping oprofile, you can use a mini usb cable to download all the samples to the host machine, and analysis them.


On device:
1. plug in usb line between laptop and beagleboard (OTG port)
2. netcfg usb0 up
3. ifconfig usb0 192.168.0.202
On you host:
1. sudo ifconfig usb0 192.168.0.200 # beware nm-applet may breaks it, you can set it up.
2. export ADBHOST=192.168.0.202
3. export PATH={Where you put 0xdroid}/out/host/linux-x86/bin:$PATH
4. pkill adb
5. adb devices # If you can see the device then you can do next step, or you may need to checkout what’s wrong.


Then:


cd {Where you put 0xdroid}
. build/envsetup.sh
setpaths
export OPROFILE_EVENTS_DIR=${PWD}/linux-x86/oprofile/
cd external/oprofile
./opimport_pull /tmp/0xdroid-oprofile


Copy your vmlinux to ${OUT}/symbols

Then you can analysis the whole symbols with

${OPROFILE_EVENTS_DIR}/bin/opreport --session-dir=/tmp/0xdroid-oprofile -p ${OUT}/symbols


After analyzing, we can use ooffice, graphvis, gnuplot, or whatever you like to rework the data. For example:







Happy profiling. :-)