To effectively solve a research problem, I need a very fast
(sub-microsecond, if possible) method to determine whether one 96-bit
integer is an exact multiple of another 96-bit integer.
The high-level application is mostly written in Perl and is intended
to be portable, but right now I am concentrating on the hardware that
I have in-hand, which include 2.4 GHz Xeons, a 1.8 GHz Athlon64
("3000+), a 1.8GHz Pentium M, and a Mac with a G5 CPU (sorry, I don't
know the CPU speed). The x86s (except for the Pentium M) are running
Linux, the Pentium M is running Windows XP and the Mac is running OS
I am also interested in identifying which of these hardware/OS platforms
might be able to provide the desired speed, and to benchmark all of them
under near-optimal conditions. For now I am mostly coding low-level tests
in C, and hope to expose them to Perl using XS once I figure out which
solutions are effective.
At present I've been unable to get good 96-bit results (the only things I've
tried are the Perl modules Math::GMP and Math::BigInt modules), so I've been
focusing on 64-bit integers which are adequate to address a significant
portion of my problem space, using C's "unsigned long long" integers. Using
"gcc -O" (version 3.3) on the Xeon and Athlon64, I am able to average 0.4
usec and 0.25 usec, respectively for 64-bit modulo.
So finally, here are my questions:
(1) I'm impressed with the job that the Xeon is doing performing 64-bit
modulo despite the fact that it lacks 64-bit integer hardware. In the GCC
libraries and/or Linux kernel, where can I find the source code for this
operation? I've guessed that it might be in lldiv.c, but thus far have run
into dead ends. I'm hoping to adapt their approach for the 96-bit problem.
(2) Which gcc options will ensure that the Athlon64 is really using its
native 64-bit hardware? The fast observed performance might be due to other
factors, and I might be able to do better yet.
(3) Any way to get more bits would be nice. E.g. the current x86 FPUs seems
to have 96-bit floating point registers which in principle might lead to ~a
73-bit mantissa, but in practice it seems that IEEE-854 mandates that no
additional bits are used for the mantissa. Are there any other tricks that
I'm unaware of? What if I were willing to settle for a 96-bit dividend and
a 64-bit divisor?
(4) Is there any hope to squeeze this sort of high performance out of
something like the GMP libraries? I'm afraid that they may be too general
for my use, but I haven't yet played with the native C libraries, where in
particular the mpz_cdiv_r() & mpz_sgn() functions look promising.
I realize that portions of my question are probably out of scope for this
newsgroup, but I hope that you'll see that it's all related. Redirects
[I'm wondering whether it might be better to recode your numbers using
modular arithmetic, which makes this sort of division testing faster.