WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Charles Ga » Tue, 12 May 2009 19:28:07


Hi,
does anybody knows which is the more efficient method for copying memory
between two IRP buffers (METHOD_DIRECT), RtlCopyMemory or the
combination WdfMemoryCopyFromBuffer / WdfMemoryCopyToBuffer

I've tried RtlCopyMemory and it seems extremely slow. I'm periodically
copying 4096 ULONGs from an output buffer associated with a pended IRP
to an output buffer associated with a synchronous request but
performance is dismal. It seems as if each DWORD is being copied one by
one. The data-rate is efectively the same as if I picked up every DWORD
individually from the hardware device. Is the WdfMemoryCopy....
combination any better or is it just a wrapper around RtlCopyMemory.

If they are effectively the same, would it be better to have the service
owning the source buffer (an indefinitely pended IRP) to declare a
section object? Can I avoid copying altogether with this solution. Will
a section object work at all if the source buffer is pended (and
therefore locked in memory). i.e. will it be able to use (or does it
need) the system paging file as backing store.

Thanks in advance for any tips.

Charles
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Maxim S. S » Tue, 12 May 2009 22:39:24

> I've tried RtlCopyMemory and it seems extremely slow.

From usual memory to usual memory? or the on-device memory is involved?

--
Maxim S. Shatskih
Windows DDK MVP
XXXX@XXXXX.COM
http://www.yqcomputer.com/

 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Charles Ga » Wed, 13 May 2009 01:48:15

Maxim S. Shatskih schrieb:
Yes, usual memory to usual memory. The device has already DMAed into the
pended output buffer. Another application (the synchronous non-pended
one) wants to get selected entries from the pended buffer.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Doron Hola » Wed, 13 May 2009 02:19:39

WdfMemoryCopyFromBuffer is a wrapper around RtlCopyMemory with bounds
checks before the copy to make sure you do not trash memory

d

--

This posting is provided "AS IS" with no warranties, and confers no rights.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Charles Ga » Wed, 13 May 2009 02:43:39

Doron Holan [MSFT] schrieb:
Doran,
thanks for the Info. Is there any faster way of doing a memory-to-memory
(user space) copy than using RtlCopyMemory? Currently, I have a demo
application (really just using it to demonstrate a driver/hardware
combination for customer) which moves 4096 ULONGS from a pended ouput
buffer to an output buffer associated with a synchronous IRP. I do this
every 50 ms but performance is dreadful. Admittedly, it might be the
timer in the application that is real problem by causing too many task
switches but I don't think so. It really seems as if the RtlCopyMemory
is moving DWORD for DWORD instead of doing a block copy ( at least the
only way can explain the latency and system load).

Thanks in advance for any tips you might have.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Maxim S. S » Wed, 13 May 2009 03:00:27

> every 50 ms but performance is dreadful. Admittedly, it might be the

Maybe.


No, RtlCopyMemory is memcpy.

You need some serious profiler to find the cause of the slowdown.

--
Maxim S. Shatskih
Windows DDK MVP
XXXX@XXXXX.COM
http://www.yqcomputer.com/
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Jakob Boh » Wed, 13 May 2009 03:11:02


Unless you are compiling your driver for Itanium 64 bits (with an old
DDK) or Alpha AXP 64 bits (never released), RtlCopyMemory() is an alias
for memcpy() which is inlined by the C compiler to code which operates
one ULONG_PTR at a time (a rep movs instruction and some alignment code
on i386 and x64).

In contrast, RtlMoveMemory() is an alias for memmove() which the
compiler does not inline, and which wastes a few cycles on a call
instruction and if() statements to test for memory range overlap.
But after that initial overhead, RtlMoveMemory() happens to use
a more intelligent (and larger) copying loop which may make it faster
than the inlined rep movs instruction produced by RtlCopyMemory().

Bizarre but true...


--
Jakob B鴋m, M.Sc.Eng. * XXXX@XXXXX.COM * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.yqcomputer.com/ * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Information in this mail is hasty, not binding and may not be right.
Information in this posting may not be the official position of Netop
Solutions A/S, only the personal opinions of the author.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Charles Ga » Wed, 13 May 2009 03:44:39

Jakob Bohm schrieb:

Just for info, the WDK docs compare these two instructions in exactly
the opposite way. Something to the effect of "RtlCopyMemory runs faster
but the two regions may not overlap"
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Gianluca V » Wed, 13 May 2009 05:01:55

Uhm, 4096 ULONGs every 50ms is 320kB/s, which is a really low amount of data
to be copied. I really dont think that your copy is the real bottleneck.

Can you create a fake driver that performs all the tasks *but* the memcpy
and see what's the performance in that case?

Have a nice day
GV
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Charles Ga » Wed, 13 May 2009 08:20:58

If I let the device and driver run freely, just writing to the circular
buffer at 20 MB/s I get a system load of only about 3-4% (driver keeping
scatter/gather FIFOs topped up)

It's only when I run the demo app that I get the bottle neck. The app is
very simple CodeGear stuff. Just picking up 4096 ULONGs every 20 ms or
so and writing the hex values to a listbox.

I've since added a second thread for the reads from circular buffer (the
RtlMemCopy initiator). The app runs a bit more smothely but I'm still
getting a system load of about 60% (pentium something-or-other with ICH7
chipset), which I find quite high.


Gianluca Varenni schrieb:
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Gianluca V » Wed, 13 May 2009 08:36:05


Uhm.... I think the culprit is the litbox. updating a listbox 50 times a
second with 4096 values seems quite bad to me...

I bet that if you keep your code that picks the ULONGs and disable the
update of the listbox, your application will run smoothly.


GV
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Jakob Boh » Wed, 13 May 2009 19:42:18


Well this would be true if the compiler did not inline memcpy() or if
the compiler inlined both memcpy() and memmove(). In practice, the
compiler inlines memcpy() but not memmove() and the inline versions of
both functions are optimized for small copy sizes while the out-of-line
versions are optimized for medium copy sizes. A version optimized for
really large sizes (ouch) would include additional SSE/MMX instructions
to do prefetching and other CPU cache management and would be specific
to each CPU generation and brand.


--
Jakob B鴋m, M.Sc.Eng. * XXXX@XXXXX.COM * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.yqcomputer.com/ * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Information in this mail is hasty, not binding and may not be right.
Information in this posting may not be the official position of Netop
Solutions A/S, only the personal opinions of the author.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Albert » Wed, 13 May 2009 23:38:18

I use a loop with two __m128* pointer variables:

#include <xmmintrin.h>

void moveto(__m128 * to, __m128 * from, int size)
{
for_as_long_as_it_takes
{
*to++ = *from++;
}
}

There's some additional logic to handle moves that don't align to 128
bits, and I also unroll the loop in an attempt to improve speed, and I
also have a try/except block to catch errors. This works both on 32-
bit and on 64-bit. One advantage is that you have full control over
the amount of unrolling, and if you feel adventurous you can go the
extra mile and try one of the more esoteric data movement techniques
that Intel suggested in their tech papers when the xmm instruction set
was first announced.

I checked the code that the compiler generated, and it looks pretty
decent. I don't know if this saves any significant amount of execution
time because my environment is so chip-bound that optimizing processor
performance doesn't improve the throughput one jota. And, of course,
depending on what you're trying to do, you may want to save/restore
some floating point state.


Alberto.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Jakob Boh » Thu, 14 May 2009 00:05:20


Actually, you should be very careful about touching the floating point
state in kernel mode, although the documentation for what you may and
may not do is (or used to be) sketchy.

Also, such optimized memcpy code needs to be protected by if
statements or function pointers to use completely different code for
different CPU versions and brands. The optimal strategy for a Core2 is
probably not the same as for a Hammer, a Xeon or a Crusoe, just to name
a few. Which is why such code should really be provided by a specialist
vendor who can afford the time and hardware to design, tune and test on
every x86 and x64 CPU generation ever sold.

Ideally, that vendor would be the Microsoft team that writes the C
runtime library, but this would tend to be 1 or 2 CPU generations behind
the times due to new CPU designs being released more often than Windows
versions. It could also be the CPU makers themselves by exporting these
functions from the processor driver.

--
Jakob B鴋m, M.Sc.Eng. * XXXX@XXXXX.COM * direct tel:+45-45-90-25-33
Netop Solutions A/S * Bregnerodvej 127 * DK-3460 Birkerod * DENMARK
http://www.yqcomputer.com/ * tel:+45-45-90-25-25 * fax:+45-45-90-25-26
Information in this mail is hasty, not binding and may not be right.
Information in this posting may not be the official position of Netop
Solutions A/S, only the personal opinions of the author.
 
 
 

WdfMemoryCopyFromBuffer / RtlCopyMemory: copying between user buffers

Post by Albert » Thu, 14 May 2009 23:57:36

With all due respect, I have written OpenGL ICDs for Windows, where
kernel-side floating point can be a necessity. I'm well aware of all
the ins and outs of kernel-side floating point. And as far as the gods
go, hey, I trust my hand at least as much as I trust theirs!

Yet here I'm moving 128 bits at a time, relying on a data type
provided to me by the DDK Compiler. I hope that the compiler handles
the difference between processor platforms, and that it generates
sensible code; yet I check every generated machine instruction. And
why did the compiler writers supply the facility if we're not supposed
to use it ?

When a processor is issued, Intel or Amd usually supply plenty of
technical notes, design sheets, and other hardware level
documentation. Those are valuable sources of enlightment. But beyond
that, more often than not I'd rather roll my own code. If nothing
else, it allows me to go above and beyond what an API can give me.

So far, knock on wood, that specific memory move code didn't cause any
grief.


Alberto.