Apologies for the longish post, I'm having trouble formulating it in a shorter way. Also, maybe this is more appropriate for Unix & Linux Stack Exchange, but I'll try here at SO first, as there is an ftrace
tag.
Anyways - I'd like to observe do machine instructions of a user program execute in the context of a full function_graph
capture using ftrace
. One problem is that I need this for an older kernel:
$ uname -a
Linux mypc 2.6.38-16-generic #67-Ubuntu SMP Thu Sep 6 18:00:43 UTC 2012 i686 i686 i386 GNU/Linux
... and in this edition, there is no UPROBES
- which, as Uprobes in 3.5 [LWN.net] notes, should be able to do something like that. (As long as I don't have to patch the original kernel, I would be willing to try a kernel module built out of tree, as User-Space Probes (Uprobes) [chunghwan.com] seems to demonstrate; but as far as I can see from 0: Inode based uprobes [LWN.net], the 2.6 would probably need a full patch)
However, on this version, there is a /sys/kernel/debug/kprobes
, and /sys/kernel/debug/tracing/kprobe_events
; and Documentation/trace/kprobetrace.txt implies that a kprobe can be set directly on an address; even if I cannot find an example anywhere on how this is used.
In any case, I would still not be sure what addresses to use - as a small example, let's say I want to trace the start of the main
function of the wtest.c
program (included below). I can do this to compile and obtain an machine instruction assembly listing:
$ gcc -g -O0 wtest.c -o wtest
$ objdump -S wtest | less
...
08048474 <main>:
int main(void) {
8048474: 55 push %ebp
8048475: 89 e5 mov %esp,%ebp
8048477: 83 e4 f0 and $0xfffffff0,%esp
804847a: 83 ec 30 sub $0x30,%esp
804847d: 65 a1 14 00 00 00 mov %gs:0x14,%eax
8048483: 89 44 24 2c mov %eax,0x2c(%esp)
8048487: 31 c0 xor %eax,%eax
char filename[] = "/tmp/wtest.txt";
...
return 0;
804850a: b8 00 00 00 00 mov $0x0,%eax
}
...
I would set up ftrace logging via this script:
sudo bash -c '
KDBGPATH="/sys/kernel/debug/tracing"
echo function_graph > $KDBGPATH/current_tracer
echo funcgraph-abstime > $KDBGPATH/trace_options
echo funcgraph-proc > $KDBGPATH/trace_options
echo 0 > $KDBGPATH/tracing_on
echo > $KDBGPATH/trace
echo 1 > $KDBGPATH/tracing_on ; ./wtest ; echo 0 > $KDBGPATH/tracing_on
cat $KDBGPATH/trace > wtest.ftrace
'
You can see a portion of the (otherwise complex) resulting ftrace
log in debugging - Observing a hard-disk write in kernel space (with drivers/modules) - Unix & Linux Stack Exchange (where I got the example from).
Basically, I'd want a printout in this ftrace
log, when the first instructions of main
- say, the instructions at 0x8048474, 0x8048475, 0x8048477, 0x804847a, 0x804847d, 0x8048483 and 0x8048487 - are executed by (any) CPU. The problem is, as far as I can understand from Anatomy of a Program in Memory : Gustavo Duarte, these addresses are the virtual addresses, as seen from the perspective of the process itself (and I gather, the same perspective is shown by /proc/PID/maps
)... And apparently, for krpobe_event
I'd need a physical address?
So, my idea would be: if I can find the physical addresses corresponding to the virtual addresses of the program disassembly (say by coding a kernel module, which would accept pid and address, and return the physical address via procfs), I could set up addresses as a sort of "tracepoints" via /sys/kernel/debug/tracing/kprobe_events
in the above script - and hopefully get them in the ftrace
log. Could this work, in principle?
One problem with this, I found on Linux(ubuntu), C language: Virtual to Physical Address Translation - Stack Overflow:
In user code, you can't know the physical address corresponding to a virtual address. This is information is simply not exported outside the kernel. It could even change at any time, especially if the kernel decides to swap out part of your process's memory.
...
Pass the virtual address to the kernel using systemcall/procfs and use vmalloc_to_pfn. Return the Physical address through procfs/registers.
However, vmalloc_to_pfn
doesn't seem to be trivial either:
VA: 0xf8ab87fc PA using vmalloc_to_pfn: 0x36f7f7fc. But I'm actually expecting: 0x136f7f7fc.
...
The physical address falls between 4 to 5 GB. But I can't get the exact physical address, I only get the chopped off 32-bit address. Is there another way to get true physical address?
So, I'm not sure how reliably I could extract the physical addresses so they are traced by kprobes - especially since "it could even change at any time". But here, I would hope that since the program is small and trivial, there would be a reasonable chance that the program would not swap while being traced, allowing for a proper capture to be obtained. (So even if I have to run the debug script above multiple times, as long as I can hope to obtain a "proper" capture once out of 10 times (or even 100 times), I'd be OK with it.).
Note that I'd want an output through ftrace
, so that the timestamps are expressed in the same domain (see Reliable Linux kernel timestamps (or adjustment thereof) with both usbmon and ftrace? - Stack Overflow for an illustration of a problem with timestamps). Thus, even if I could come up with, say, a gdb
script, to run and trace the program from userspace (while simultaneously an ftrace
capture is obtained) - I'd like to avoid that, as the overhead from gdb
itself will show in the ftrace
logs.
So, in summary:
- Is the approach of obtaining (possibly through a separate kernel module) physical addresses from the virtual (from a disassembly of an executable) addresses - so they are used to trigger a kprobe_event logged by ftrace - worth pursuing? If so, are there any examples of kernel modules that can be used for this address translation purpose?
- Could I otherwise use a kernel module to "register" a callback/handler function when a particular memory address is being executed? Then I could simply use a
trace_printk
in that function to have anftrace
log (or even without that, the handler function name itself should show in theftrace
log), and it doesn't seem there will be too much overhead with that...
Actually, in this 2007 posting, Jim Keniston - utrace-based uprobes: systemtap mailing list, there is a 11. Uprobes Example
(added to Documentation/uprobes.txt
), which seems to be exactly that - a kernel module registering a handler function. Unfortunately, it uses linux/uprobes.h
; and I have only kprobes.h
in my /usr/src/linux-headers-2.6.38-16/include/linux/
. Also, on my system, even systemtap
complains about CONFIG_UTRACE
not being enabled (see this comment)... So if there's any other approach I could use to obtain a debug trace like I want, without having to recompile the kernel to get uprobes, it would be great to know...
wtest.c
:
#include <stdio.h>
#include <fcntl.h> // O_CREAT, O_WRONLY, S_IRUSR
int main(void) {
char filename[] = "/tmp/wtest.txt";
char buffer[] = "abcd";
int fd;
mode_t perms = S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH;
fd = open(filename, O_RDWR|O_CREAT, perms);
write(fd,buffer,4);
close(fd);
return 0;
}
Obviously, this would be much easier with built-in uprobes on kernels 3.5+; but given that uprobes for my kernel 2.6.38 is a very deep-going patch (which I couldn't really isolate in a separate kernel module, so as to avoid patching the kernel), here is what I can note for a standalone module on 2.6.38. (Since I'm still unsure of many things, I would still like to see an answer that would corrects any misunderstandings in this post.)
I think I got somewhere, but not with
kprobes
. I'm not sure, but it seems I managed to get physical addresses right; however,kprobes
documentation is specific that when using "@ADDR : fetch memory at ADDR (ADDR should be in kernel)"; and the physical addresses I get are below kernel boundary of 0xc0000000 (but then, 0xc0000000 is usually together with the virtual memory layout?).So I used a hardware breakpoint instead - the module is below, however caveat emptor - it behaves randomly, and occasionally can cause a kernel oops!. By compiling the module, and running in
bash
:... I get a log. I want to trace the first two instructions of the
main()
ofwtest
, which for me are:... at virtual addresses 0x08048474 and 0x08048475. In the
syslog
output, I could get, say:... meaning that it mapped the virtual address 0x08048474 to physical address 0x639ec474. However, the physical is not used for hardware breakpoints - there we can supply a virtual address directly to
register_user_hw_breakpoint
; however, we also need to supply thetask_struct
of the process too. With that, I can get something like this in theftrace
output:... where the traces corresponding to the assembly are marked by breakpoint id. Thankfully, they are right after another, as expected; however,
ftrace
has also captured some debug commands in-between. In any case, this is what I wanted to see.Here are some notes about the module:
kallsyms
, then I use a function pointer to the address; else other needed functions are copied from sourceSIGSTOP
(which on its own, seems kind of unreliable at that point), and set state to__TASK_STOPPED
).wtest
hanging in process list long after it would have terminated naturally, so I guess that works.page_to_phys()
to get to the address (internally via page frame number); LDD3 ch.15 helps with understanding relationship between pfn and physical address.objdump
's assembly output - I am not 100% sure this is correct, though.0x08048000
is neitheris_vmalloc_addr
norvirt_addr_valid
; I guess, this should tell me, one couldn't have used neithervmalloc_to_pfn()
norvirt_to_page()
to get to its physical address !?kprobes
forftrace
from kernel space is kinda tricky (needs functions copied)kprobe
on the physical addresses I get (e.g. 0x639ec474), always results with "Could not insert probe(-22)"kallsyms
address of thetracing_on()
function (0xc10bcf60) below; that seems to work - because it raises a fatal "BUG: scheduling while atomic" (apparently, we're not meant to set breakpoints in module_init?). Bug is fatal, because it makes thekprobes
directory dissapear from theftrace
debug directorykprobe
would not make it appear in theftrace
log - it also needs to be enabled; the necessary code for enabling is there - but I've never tried it, because of the previous bugHW_BREAKPOINT_X
,attr.bp_len
need to be set tosizeof(long)
printk
theattr
variable(s) - from _init or from the handler - something gets seriously messed up, and whatever variable I try to print next, I get value 0x5 (or 0x48) for it (?!)bp->id
As far as the randomness goes, I think this is because the process is not started in a stopped state; and by the time it gets stopped, it ends up in a different state (or, quite possibly, I'm missing some locking somewhere). Anyways, you can also expect in
syslog
:... that is, even with a proper task pointer (judging by start_code), only 0x0 is obtained as physical address. Sometimes you get the same outcome, but with
start_code: 0x00000000 ->end_code: 0x00000000
. And sometimes, atask_struct
cannot be obtained, even if pid can:Well, hopefully someone will comment and clarify some of the behavior of this module
:)
Hope this helps someone,
Cheers!
Makefile
:callmodule.c
: