## 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

Hi,

To effectively solve a research problem, I need a very fast
(sub-microsecond, if possible) method to determine whether one 96-bit
integer is an exact multiple of another 96-bit integer.

The high-level application is mostly written in Perl and is intended
to be portable, but right now I am concentrating on the hardware that
I have in-hand, which include 2.4 GHz Xeons, a 1.8 GHz Athlon64
("3000+), a 1.8GHz Pentium M, and a Mac with a G5 CPU (sorry, I don't
know the CPU speed). The x86s (except for the Pentium M) are running
Linux, the Pentium M is running Windows XP and the Mac is running OS
X.

I am also interested in identifying which of these hardware/OS platforms
might be able to provide the desired speed, and to benchmark all of them
under near-optimal conditions. For now I am mostly coding low-level tests
in C, and hope to expose them to Perl using XS once I figure out which
solutions are effective.

At present I've been unable to get good 96-bit results (the only things I've
tried are the Perl modules Math::GMP and Math::BigInt modules), so I've been
portion of my problem space, using C's "unsigned long long" integers. Using
"gcc -O" (version 3.3) on the Xeon and Athlon64, I am able to average 0.4
usec and 0.25 usec, respectively for 64-bit modulo.

So finally, here are my questions:
(1) I'm impressed with the job that the Xeon is doing performing 64-bit
modulo despite the fact that it lacks 64-bit integer hardware. In the GCC
libraries and/or Linux kernel, where can I find the source code for this
operation? I've guessed that it might be in lldiv.c, but thus far have run
into dead ends. I'm hoping to adapt their approach for the 96-bit problem.
(2) Which gcc options will ensure that the Athlon64 is really using its
native 64-bit hardware? The fast observed performance might be due to other
factors, and I might be able to do better yet.
(3) Any way to get more bits would be nice. E.g. the current x86 FPUs seems
to have 96-bit floating point registers which in principle might lead to ~a
73-bit mantissa, but in practice it seems that IEEE-854 mandates that no
additional bits are used for the mantissa. Are there any other tricks that
I'm unaware of? What if I were willing to settle for a 96-bit dividend and
a 64-bit divisor?
(4) Is there any hope to squeeze this sort of high performance out of
something like the GMP libraries? I'm afraid that they may be too general
for my use, but I haven't yet played with the native C libraries, where in
particular the mpz_cdiv_r() & mpz_sgn() functions look promising.

I realize that portions of my question are probably out of scope for this
newsgroup, but I hope that you'll see that it's all related. Redirects
welcome.

Thanks,

Jonathan
[I'm wondering whether it might be better to recode your numbers using
modular arithmetic, which makes this sort of division testing faster.
-John]

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

You could take the G5 with gcc C compiler (part of XCode), turn
compiler options on to make long double = 128 bit floating-point with

If you had IEEE 754 floating-point arithmetic, you could just
calculate t = x/y and check whether t == floor (t). But 128 long
double doesn't use IEEE 754, so you have to be a bit more careful.

Calculate t = floorl (x/y + 0.5) and check whether t*y == x. t will be
an integer. If x is _not_ a multiple of y, then t*y != x, no matter
what the value of t is, and long double gives you enough precision so
that rounding errors won't change this. If x _is_ a multiple of y,
then t will be exactly equal to x/y, since the rounding errors are too
small to change this, and x*t will be equal to y.

If you use 128 bit long double with values that are actually integers
with less than 100 bits, then add, subtract and multiply will give you
the correct results, and the implementation is quite fast. I would
think this will be below 0.2 microseconds per test.

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

"Jonathan Epstein" < XXXX@XXXXX.COM > writes:
...

This is the default if you run a x86-64 gcc; an ordinary 386 gcc does
not produce x86-64 code (unless it was configured as a cross
compiler). To run the x86-64 code, you need at least a 64-bit kernel
and probably also 64-bit libraries. So I would recommend installing
an AMD64 distribution of Linux.

One thing of interest to you would be the 128-bit integer type
available on gcc for some targets (last time I tried it, it was
available for x86-64, but not for ppc64). You have to actually define
it yourself like this:

typedef int int128_t __attribute__((__mode__(TI)));
typedef unsigned int uint128_t __attribute__((__mode__(TI)));

Note also that this feature does not work correctly on all gcc
versions, so better test it before relying on it.\

Similarly, for the 64-bit features of the G5 you need an OS capable of
running 64-bit binaries, i.e., either MacOS X 10.4 or Linux-ppc64
(Gentoo, Yellowdog, and an unofficial Debian port support this
architecture).

That would really fly on the x64-64 architecture, which has a native
128-by-64-bit division (PPC64 only has 64-by-64-bit).

- anton

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

Knuth's "The Art of Computer Programming", vol. 2, describes the
multiple precision divide algorithm. I haven't looked at it recently,
so I won't say much about it.

-- glen

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

I haven't tried it myself, but gcc should be able to produce leaf
functions using 64 bit instructions on MacOS X 10.3 as well. The
processor doesn't actually run in 64 bit mode, still in 32 bit mode,
but most operations are 64 bit anyway (except memory addresses).

In this case, a leaf function would be enough.

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

(snip)

Does the OS save the full 64 bit registers when not in 64 bit mode?

Otherwise you could get the wrong result on a task switch.

-- glen

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

There are two possibilities: Either the OS saves the full 64 bit
registers, or the designers are really really and I mean really
absolutely incredibly braindamaged stupid. I would assume the first.
Let's say I wouldn't dare adding this to the user interface of a
compiler if it didn't work!

Anyway, I checked what is really available: You have to switch
settings "Code generation = G5" instead of G4, you would probably want
"Scheduling = G5" instead of G4, and you need to turn on "64-bit
integer arithmetic".

The compiler still only saves 32 bit of registers during function
functions. Parameter passing of long long is done through two 32 bit
parameters, so the function interface is compatible with 32 bit code.

long long add (long long x, long long y) { return x+y; }

would combine x from two 32-bit registers to one register, combine y
from two 32-bit registers to one register, do a single add, split the
result back into two 32-bit registers for the caller.

So this would be most useful if you have one function that does quite
a bit of 64 bit arithmetic. And of course it will only run on a G5, so
all I can do is look at the disassembled code that the compiler
produces :-(

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

(snip)

(snip)

The case I was thinking of was running a 32 bit OS on a 64 bit
architecture that is an extension of a 32 bit architecture.

Say, for example, Win2000 on AMD-64 (x86-64). The OS has no idea that
the registers are 64 bits. Unless the hardware includes special mode
bits to stop the operations, the user should be able to use those
bits.

Since one could compile on one machine and run on another, it isn't
enough to check at compile time for a 64 bit OS.

-- glen

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

glen herrmannsfeldt < XXXX@XXXXX.COM > writes:

The designers of what? Of gcc? MacOS X boxes are not the only PPCs
that gcc targets, so I find no fault there. Of MacOS X? They built
an OS for 32-bit CPUs that happens to run on a 64-bit CPU, so I find
no fault there, either.

An example I know of was the first Solarises on UltraSPARCs. In
theory one could use the 64-bit registers, but the 32-bit OS saved
only 32 bits on context switches. I am pretty sure the same is true
for Ultrix on the DecStation 5000/150 (with an R4000).

Bad example, because the x86-64 architecture is not an extension of
the 386 architecture in that sense, i.e., you cannot use 64-bit
instructions in legacy or compatibility mode (W2K runs in legacy
mode), and you cannot use 386 code in 64-bit mode. The x86-64
architecture supports 3 different instruction sets: 8086, 386 32-bit
mode instructions plus extensions, and x86-64 long mode instructions;
there are a bunch of modes and submodes that determine which
instruction set is currently being executed.

In contrast, PPC64 is an extension of PPC32 (actually, the
documentation says that PPC32 is a subset of PPC64) and MIPS-IV is an
extension of MIPS; you can run 32-bit and 64-bit code in the same
mode. IIRC this is mostly true for SPARC, but they did add some mode
bit because of register windows.

- anton

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

someone solicited, and which might lead to an optimization which is
obvious to one of you:

*The quotient will almost always be below 2^32, but I prefer not to
make this assumption for this problem.

*The dividend and divisors are both known to be products of primes
between 2 and 67 inclusive, where in some case a prime is represented
more than once. E.g. 2*3*17*17*61 is a legitimate value. The
operands are always unsigned.

*I will be repeatedly dividing by the same divisor. E.g., I start
with a collection of these products and want to determine which, if
any, of these products are divisible by the divisor which is currently
being considered.

-----------------

I tried the 128-bit C incantation which Anton suggested, but got
compiler errors suggesting that I don't have the compiler installed
properly for optimum perfomance. Thanks in advance for any
information as to how to achieve the desired installation status:

[epstein@foo tmp]\$ gcc -v
Configured with:
../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info
b --enable-__cxa_atexit --host=i386-redhat-linux
gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-49)
[epstein@foo tmp]\$ gcc 128bit.c
128bit.c:1: no data type for mode `TI'
[epstein@foo tmp]\$ cat 128bit.c
typedef unsigned int uint128_t __attribute__((__mode__(TI)));

main()
{
printf("%d\n",sizeof(uint128_t));
}

I haven't tried any of the other suggestions yet.

Thanks again,

Jonathan
[How about representing the numbers as a bit mask where each bit
represents a prime factor? Allocate a reasonable number of bits for
each factor, e.g., 8 bits for 2, 8 bits for 3, etc. up to a total of,
say 64 or 128 bits, and turn on one bit for each factor, and reserve
one bit at the end of the number for factor overflow if there were
more instances of a factor than would fit in the fields reserved. Now
to test whether A is divisible by B, you need only test if (A&B)==B,
and if that matches and the overflow bit is set in A, you need to go
back and do the division. If you choose your field sizes well, you
should usually be able to get the answer from the AND, which would be
quite fast either on an x86 with MMX or a 64 bit machine. -John]

### 96-bit integer modulo, Athlon64 gcc 64-bit integers, libc codefor 64-bit division, etc.

Thanks, John, this is a good idea.

Since the maximum number of occurences of each prime is known in
advance, for now I'm just encoding one bit for each prime factor,
which in practice works out to about 400 bits, and runs at about the
same speed as my native 64-bit modulo operations. This is a bit