Reading performance registers from the kernel

I want to read certain performance counters. I know that there are tools like perf, that can do it for me in the user space itself, I want the code to be inside the Linux kernel.

I want to write a mechanism to monitor performance counters on Intel(R) Core(TM) i7-3770 CPU. On top of using I am using Ubuntu kernel 4.19.2. I have gotten the following method from easyperf

Here's part of my code to read instructions.

  struct perf_event_attr *attr
  memset (&pe, 0, sizeof (struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE;
  pe.size = sizeof (struct perf_event_attr);
  pe.config = PERF_COUNT_HW_INSTRUCTIONS;
  pe.disabled = 0;
  pe.exclude_kernel = 0;
  pe.exclude_user = 0;
  pe.exclude_hv = 0;
  pe.exclude_idle = 0;

  fd = syscall(__NR_perf_event_open, hw, pid, cpu, grp, flags);

  uint64_t perf_read(int fd) {
    uint64_t val;
    int rc;
    rc = read(fd, &val, sizeof(val));
    assert(rc == sizeof(val));
    return val;
  }

I want to put the same lines in the kernel code (in the context switch function) and check the values being read.

My end goal is to figure out a way to read performance counters for a process, every time it switches to another, from the kernel(4.19.2) itself.

To achieve this I check out the code for the system call number __NR_perf_event_open. It can be found here To make to usable I copied the code inside as a separate function, named it perf_event_open() in the same file and exported.

Now the problem is whenever I call perf_event_open() in the same way as above, the descriptor returned is -2. Checking with the error codes, I figured out that the error was ENOENT. In the perf_event_open() man page, the cause of this error is defined as wrong type field.

Since file descriptors are associated to the process that's opened them, how can one use them from the kernel? Is there an alternative way to configure the pmu to start counting without involving file descriptors?

You probably don't want the overhead of reprogramming a counter inside the context-switch function.

The easiest thing would be to make system calls from user-space to program the PMU (to count some event, probably setting it to count in kernel mode but not user-space, just so the counter overflows less often).

Then just use rdpmc twice (to get start/stop counts) in your custom kernel code. The counter will stay running, and I guess the kernel perf code will handle interrupts when it wraps around. (Or when its PEBS buffer is full.)

IDK if it's possible to program a counter so it just wraps without interrupting, for use-cases like this where you don't care about totals or sample-based profiling, and just want to use rdpmc. If so, do that.

Old answer, addressing your old question which was based on a buggy printf format string that was printing non-zero garbage even though you weren't counting anything in user-space either.

Your inline asm looks correct, so the question is what exactly that PMU counter is programmed to count in kernel mode in the context where your code runs.

perf virtualizes the PMU counters on context-switch, giving the illusion of perf stat counting a single process even when it migrates across CPUs. Unless you're using perf -a to get system-wide counts, the PMU might not be programmed to count anything, so multiple reads would all give 0 even if at other times it's programmed to count a fast-changing event like cycles or instructions.

Are you sure you have perf set to count user + kernel events, not just user-space events?

perf stat will show something like instructions:u instead of instructions if it's limiting itself to user-space. (This is the default for non-root if you haven't lowered sysctl kernel.perf_event_paranoid to 0 or something from the safe default that doesn't let user-space learn anything about the kernel.)

There's HW support for programming a counter to only count when CPL != 0 (i.e. not in ring 0 / kernel mode). Higher values for kernel.perf_event_paranoid restrict the perf API to not allow programming counters to count in kernel+user mode, but even with paranoid = -1 it's possible to program them this way. If that's how you programmed a counter, then that would explain everything.

We need to see your code that programs the counters. That doesn't happen automatically.

The kernel doesn't just leave the counters running all the time when no process has used a PAPI function to enable a per-process or system-wide counter; that would generate interrupts that slow the system down for no benefit.