Choices, choices, choices ...

Choices, choices, choices ...

Post by Toon Moen » Fri, 01 Oct 2004 08:37:25

Lectoribus Salutem,

I was directed to this forum from comp.arch by one of your regulars, Joe
Seigh, because of the following question:

Given that you want to have a RA*S* (i.e., Reliable, Available and
*Scalable* OS), how would you perform the following task on a ~100 CPU
SMP machine:

"Measure the number of instances of <some> kernel event"

1. You create a system wide single counter protected by a single
system wide lock to keep track of the occurences of said event.

2. You create a processor-local counter of said event, and, in case
you want to summarize system wide, sum the counters in the user
space application and present them to the user.

Please, discuss.

Note that I'm not a regular of your newsgroup, so it's advisable to cc
me if you have an answer different from the obvious ....

Toon Moene - e-mail: XXXX@XXXXX.COM - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77:
A maintainer of GNU Fortran 95:

Choices, choices, choices ...

Post by Sender » Fri, 01 Oct 2004 09:16:54

> 2. You create a processor-local counter of said event, and, in case

This would be one way to go. The contention on a single system wide lock
would be very bad.

If the counter is small enough to be modified by atomic ops... I would use a
padded and cache-line aligned counter per-cpu, and modify them with atomic
inc/dec or cas.


Choices, choices, choices ...

Post by Joe Seig » Fri, 01 Oct 2004 09:54:19

The distributed count would be more efficient if the reads are relatively
infrequent because reading a distributed counter is fairly expensive. Statistical
counters are typically kept in processor local storage.

You must have hit a syscall that escaped everyone's attention because
the kernel performance people would have been all over the kernel developers
for using a lock that way on a 100 processor system.

There could of course be reasons we don't know about. If the thread was
preemptable you wouldn't want to increment a counter that way, you might
end up storing the count + 1 into another processors count which would
be bad. Still, there's no excuse for using a performance killing solution.
There's ways around the problem even if you're preemptable.

File a performance problem report. You're the customer.

Joe Seigh

Choices, choices, choices ...

Post by Jonathan A » Fri, 01 Oct 2004 09:56:41

In article <415b45b7$0$283$ XXXX@XXXXX.COM >,

Well, it depends on how often the event occurs. And (2) is rife with
assumptions. (like either being on a non-preemptable kernel or disabling
preemption/interrupts, and that the summation has to occur in userland).

To add some real-life examples from Solaris:

1. memory allocations and frees are tracked by kmem cache (~200 in a
stock system. general-purpose allocations come from 32 of them, with
sizes roughly exponential) Each cache has a set of per-CPU buckets,
each protected by a lock. There are two 64-bit counters protected by
the lock, one for allocations, one for frees. The locks allow the
kernel to do allocations without worrying about preemption, etc. (full
details are at +

When you want allocation statistics for a cache:

% kstat -n kmem_alloc_8

a kernel routine sums the numbers from each per-CPU bucket and reports
the total to userland.

2. CPU interrupt counts and elapsed times are tracked per-cpu,
per-level, since the counting code for level N can be interrupted by a
level N+1 interrupt.

% kstat -m cpu -n intrstat

These don't have to worry about preemption, of course.

3. There are a whole set of CPU-specific counts (context-switches, etc.)

% kstat -m cpu -n sys
% kstat -m cpu -n vm

These are handled by disabling preemption, bumping the count, then
re-enabling preemption.

4. DTrace ( ) widens the
possibilities substantially. For example, if I want to count the number
of calls to "kmem_alloc()" on all CPUs in a scalable fashion, all I need
to do is:

# dtrace -q -n 'kmem_alloc:entry{@a = count()}' \
-n 'END{printa("%@d calls\n", @a)}'
... wait a while ...
96665 calls

DTrace automatically handles instrumenting the function, doing the
counting in per-CPU buffers, etc.

As usual, there are quite a few engineering trade-offs in the mix.

- jonathan

Choices, choices, choices ...

Post by Jonathan A » Fri, 01 Oct 2004 10:08:10

In article <WfI6d.133353$MQ5.70575@attbi_s52>, "SenderX" < XXXX@XXXXX.COM >

Since we are talking about kernels, a much faster approach is to put
your counters in the per-CPU structure the kernel should already *have*,
and only allow modifications to them from that CPU. So a counter
bump becomes:

/* disable preemption, if needed */
/* re-enable preemption, if needed */

This avoids atomic ops, bus locking, etc., and has quit low overhead.

- jonathan

Choices, choices, choices ...

Post by Jonathan A » Fri, 01 Oct 2004 10:31:29

In article < XXXX@XXXXX.COM >,

And statistical counts are usually very rarely read.

Another possibility is that a design decision that made sense when it
was made N years ago was invalidated by changes elsewhere -- for example,
making an instruction which used to be implemented in hardware go into a
trap handler on a newer processor could cause a trap handler to be hit
much harder than the original design expected it to be.

There tend to be unexpected shifts in the trade-offs, sometimes drastic,
as time goes on. And a kernel tends to be a large beast -- until
someone notices the problem, performance issues can hide for a long time.

Disabling and enabling preemption is typically extremely cheap -- bump a
per-thread counter, <pre-emption disabled> decrement it, if it's zero
check a separate "special processing" flag.


- jonathan

Choices, choices, choices ...

Post by Joe Seig » Fri, 01 Oct 2004 10:56:07

schedctl for the kernel. And of course you can depend on it being mandatory
unlike user space. The count trick is neat. It's not interruptable once
you set it to one.

Joe Seigh