From: Christoph Lameter < XXXX@XXXXX.COM >
Date: Thu, 23 Jun 2005 20:10:06 -0700 (PDT)
How much higher? I don't believe it. And %3 doesn't justify the
complexity (both in time and memory usage) added by this patch.
Performance of our stack's routing cache is _DEEPLY_ tied to the
size of the routing cache entries, of which dst entry is a member.
Every single byte counts.
You've exploded this to be (NR_NODES * sizeof(void *)) larger. That
is totally unacceptable, even for NUMA.
Secondly, inlining the "for_each_online_node()" loops isn't very nice
Consider making a per-node routing cache instead, just like the flow
cache is per-cpu, or make socket dst entries have a per-node array of
object pointers. Fill the per-node array in lazily, just as you do
for the dst. The first time you try to clone a dst on a cpu for a
socket, create the per-cpu entry slot.
We don't need to make them per-node system wide, only per-socket is
this really needed.
This way you do per-node walking when you detach the dst from the
socket at close() time, not at every dst_release() call, and thus for
every packet in the system.
In light of that, I don't see what the advantage is. Instead of
atomic inc/dec on every packet sent in the system, you walk the whole
array of counters for every packet sent in the system. If you really
get contention amongst nodes for a DST entry, this walk should result
in ReadToShare transactions, and thus cacheline movement, between the
NUMA nodes, on every __kfree_skb() call.
Essentially you're trading 1 atomic inc (ReadToOwn) and 1 atomic dec
(ReadToOwn) per packet for significant extra memory, much bigger code,
and 1 ReadToShare transaction per packet.
And since you're still using atomic_inc/atomic_dec you'll still hit
ReadToOwn transactions within the node.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/