[PATCH] dst numa: Avoid dst counter cacheline bouncing

[PATCH] dst numa: Avoid dst counter cacheline bouncing

Post by Christoph » Sat, 25 Jun 2005 12:30:09


he dst_entry structure contains an atomic counter that is incremented and
decremented for each network packet being sent. If there is a high number of
packets being sent from multiple processors then cacheline bouncing between
NUMA nodes can limit the tcp performance in a NUMA system.

The following patch splits the usage counter into counters per node.

Patch depends on the numa slab allocator in mm in order to be able to allocate
node local memory in a fast way. Patch will work without the numa slab allocator
but the node specific allocations will be very slow.

The patch also depends on the dst abstraction patch.

AIM7 results (tcp_test and udp_test) on i386 (IBM x445 8p 16G):

No patch 94788.2
w/patch 97739.2 (+3.11%)

The numbers will be much higher on larger NUMA machines.

Detailed AIM7 result no patch:

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" Jun 22 07:58:48 2005

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 568.4 100 10.4 0.5 9.4737
101 37148.0 87 16.1 48.0 6.1300
201 54024.4 81 22.1 95.5 4.4796
301 63424.6 76 28.2 143.2 3.5119
401 69913.1 75 34.1 191.4 2.9058
501 74157.5 72 40.1 239.2 2.4670
601 77388.7 70 46.1 286.9 2.1461
701 79784.2 68 52.2 335.1 1.8969
914 83809.2 67 64.8 435.7 1.5282
1127 86123.5 71 77.7 539.1 1.2736
1595 89371.8 67 106.0 764.3 0.9339
1792 90047.2 64 118.2 861.2 0.8375
2208 91729.8 64 143.0 1059.9 0.6924
2624 92699.9 63 168.1 1260.2 0.5888
3529 93853.9 63 223.3 1701.4 0.4433
3917 94339.6 70 246.6 1886.8 0.4014
4733 94698.3 64 296.9 2285.2 0.3335
5000 94788.2 64 313.3 2413.5 0.3160

AIM7 with patch:

Benchmark Version Machine Run Date
AIM Multiuser Benchmark - Suite VII "1.1" Jun 22 06:27:47 2005

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
1 568.4 100 10.4 0.4 9.4737
101 37426.1 87 16.0 46.5 6.1759
201 54742.8 81 21.8 92.5 4.5392
301 64898.0 77 27.6 138.7 3.5935
401 71272.9 74 33.4 185.2 2.9623
501 75955.6 73 39.2 231.2 2.5268
601 79473.3 73 44.9 277.6 2.2039
701 81967.3 70 50.8 323.7 1.9488
914 85836.5 70 63.2 422.6 1.5652
1127 88585.2 67 75.6 521.7 1.3100
1594 92194.4 78 102.7 738.6 0.9640
1791 93091.9 67 114.3 830.9 0.8663
2207 94517.5 72 138.7 1025.4 0.7138
2623 95358.5 64 163.4 1221.5 0.6059
3529 96449.2 65 217.3 1650.9 0.4555
3917 97075.2 64 239.7 1829.8 0.4131
4732 97485.8 63 288.3 2214.5 0.3434
5000 97739.2 67 303.9 2341.2 0.3258

Signed-off-by: Pravin B. Shelar < XXXX@XXXXX.COM >
Signed-off-by: Shobhit Dayal < XXXX@XXXXX.COM >
Signed-off-by: Shai Fultheim < XXXX@XXXXX.COM >
Signed-off-by: Christoph Lameter < XXXX@XXXXX.COM >

Index: linux-2.
 
 
 

[PATCH] dst numa: Avoid dst counter cacheline bouncing

Post by David S. M » Sat, 25 Jun 2005 12:40:07

From: Christoph Lameter < XXXX@XXXXX.COM >
Date: Thu, 23 Jun 2005 20:10:06 -0700 (PDT)


How much higher? I don't believe it. And %3 doesn't justify the
complexity (both in time and memory usage) added by this patch.

Performance of our stack's routing cache is _DEEPLY_ tied to the
size of the routing cache entries, of which dst entry is a member.
Every single byte counts.

You've exploded this to be (NR_NODES * sizeof(void *)) larger. That
is totally unacceptable, even for NUMA.

Secondly, inlining the "for_each_online_node()" loops isn't very nice
either.

Consider making a per-node routing cache instead, just like the flow
cache is per-cpu, or make socket dst entries have a per-node array of
object pointers. Fill the per-node array in lazily, just as you do
for the dst. The first time you try to clone a dst on a cpu for a
socket, create the per-cpu entry slot.

We don't need to make them per-node system wide, only per-socket is
this really needed.

This way you do per-node walking when you detach the dst from the
socket at close() time, not at every dst_release() call, and thus for
every packet in the system.

In light of that, I don't see what the advantage is. Instead of
atomic inc/dec on every packet sent in the system, you walk the whole
array of counters for every packet sent in the system. If you really
get contention amongst nodes for a DST entry, this walk should result
in ReadToShare transactions, and thus cacheline movement, between the
NUMA nodes, on every __kfree_skb() call.

Essentially you're trading 1 atomic inc (ReadToOwn) and 1 atomic dec
(ReadToOwn) per packet for significant extra memory, much bigger code,
and 1 ReadToShare transaction per packet.

And since you're still using atomic_inc/atomic_dec you'll still hit
ReadToOwn transactions within the node.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/

 
 
 

[PATCH] dst numa: Avoid dst counter cacheline bouncing

Post by Dipankar S » Sat, 25 Jun 2005 14:10:06


Do we really need to do a distributed reference counter implementation
inside dst cache code ? If you are willing to wait for a while,
we should have modified Rusty's bigref implementation on top of the
interleaving dynamic per-cpu allocator. We can look at distributed
reference counter for dst refcount then and see how that can be
worked out.

Thanks
Dipankar
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[PATCH] dst numa: Avoid dst counter cacheline bouncing

Post by Christoph » Sat, 25 Jun 2005 14:10:07


Is that code available somewhere?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[PATCH] dst numa: Avoid dst counter cacheline bouncing

Post by Dipankar S » Sat, 25 Jun 2005 16:40:27


Various places in lkml discussions. Search for discussions on dynamic
per-cpu allocator. Currently Bharata is adding
cpu hotplug awareness in it, but the basic patches work.

BTW, I am not saying that bigref has what you need. What I am trying
to say is that you should see if something like bigref can
be tweaked to use in your case before implementing a new type of
ref counting wholly in dst code.

Thanks
Dipankar
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/