AMD64 floating point performance 32 vs. 64

AMD64 floating point performance 32 vs. 64

Post by jbaco » Sun, 09 Jan 2005 02:12:59



I recently upgraded two workstations with identical K8N motherboards,
Athlon 3400+ processors, and 1Gig PC3200 RAM.

On the first, I did a clean installation of FreeBSD 5.3R AMD64, and
compiled all of our software under the new OS.

FreeBSD 5.3-RELEASE #0: Thu Dec 9 11:32:46 CST 2004
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) 64 Processor 3400+ (2210.77-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0xf4a Stepping = 10

Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow+,3DNow>
real memory = 1073479680 (1023 MB)

On the second, I simply replaced the old PII motherboard, CPU, and RAM,
and booted from the original disk, which holds a recent i386 5.3R
installation.
Nothing was recompiled or reinstalled.

FreeBSD 5.3-RELEASE #1: Thu Nov 11 10:01:35 CST 2004
ACPI APIC Table: <A M I OEMAPIC >
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) 64 Processor 3400+ (2411.75-MHz 686-class CPU)
Origin = "AuthenticAMD" Id = 0xfc0 Stepping = 0

Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
AMD Features=0xe0500000<NX,AMIE,LM,DSP,3DNow!>
real memory = 1073479680 (1023 MB)

I then ran our standard benchmark, a program called 3dDeconvolve, which
processes MRI image data. To my surprise, there was NO DIFFERENCE in
performance between the 32-bit and 64-bit os's. I noticed that the CPU
is clocked slightly faster in 32-bit mode, but I don't think this
should be significant. 3dDeconvolve is extremely floating-point
intensive, and uses mostly double.

I was really expecting to see a notable performance advantage from the
code compiled under the AMD64 OS. I've read through a dozen articles
on AMD64 and gcc, and most of them suggest that there should be a gain
when recompiling in a 64-bit environment. I tried adding
-march-athlon64 and -msse2 (which I suspect are defaults anyway), and
not surprisingly, they had no effect.

The question is, is 32-bit code capable of reaping the full advantage
of the AMD64 floating point hardware? If not, what do I have to do to
optimize performance on the 64-bit platform?
TIA,

Jason W. Bacon
Medical College of Wisconsin
 
 
 

AMD64 floating point performance 32 vs. 64

Post by dwmalon » Sun, 09 Jan 2005 06:11:28


XXXX@XXXXX.COM writes:


[I'm guessing here, but...]

I'm not sure, but I think that as long as you're using -msse2 in
32 bit mode, then there probably won't be a big difference for
things that are dependent purely on floating point. The extra integer
registers, bigger registers and larger address space that are the
obvious extras of running in 64-bit mode won't help for pure
floating point stuff.

Another possibility is that your application is limited mainly by
the speed of memory, rather than the CPU. I'm guessing that MRI
data may be quite big, and if your program does random accesses to
it the you may spend a lot of time fetching data to fill the CPU's
caches.

David.

 
 
 

AMD64 floating point performance 32 vs. 64

Post by conrad » Sun, 09 Jan 2005 06:14:38

In article < XXXX@XXXXX.COM >,


I'd be inclined to say "It depends". It varies from one port to the next.
Some ports provide optimization options for 64-bit processors, while others
don't.

For instance, audio/libmad's configure script allows one to specify
"--enable-fpm=64bit" (to enable 64-bit math). Other ports' Makefiles,
configure scripts and whatnot provide similar "knobs", while many others
don't.

So, like I said, "it depends". :-)

--
Conrad J. Sabatier < XXXX@XXXXX.COM > -- "In Unix veritas"
 
 
 

AMD64 floating point performance 32 vs. 64

Post by Logan Sha » Sun, 09 Jan 2005 06:30:04


Hmm, seems to me that in the x86 world, arithmetic on doubles has
been implemented in hardware for at least like 15 years (since
the day of the 80387 co-processor), so I don't see why there should
be any big difference between 32-bit and 64-bit mode when doing
floating point.

- Logan
 
 
 

AMD64 floating point performance 32 vs. 64

Post by Nino Dehn » Sun, 09 Jan 2005 07:31:37


^^^^^^^
^^^^^ ^^
^^^^^^^
^^^^^ ^
How about using identical CPUs and _then_ perform benchmarks that
compare 32bit vs. 64bit?

Just a thought,

ND
 
 
 

AMD64 floating point performance 32 vs. 64

Post by jbaco » Sun, 09 Jan 2005 07:40:19

They *are* identical. The differences in dmesg.boot are due to the
fact that one is running in 32-bit compatibility mode.
 
 
 

AMD64 floating point performance 32 vs. 64

Post by karg » Sun, 09 Jan 2005 08:27:10

In article < XXXX@XXXXX.COM >,
XXXX@XXXXX.COM writes:

The CPU's have different steppings. They are not identical.

--
Steve
http://www.yqcomputer.com/ ~kargl/
 
 
 

AMD64 floating point performance 32 vs. 64

Post by jbaco » Sun, 09 Jan 2005 13:00:09


I kind of doubt that the CPU stepping number would have any significant
effect on performance. But, you are correct - they are only
"essentially" identical. :-)

Anyway, my point in posting in the first place was to open a discussion
comparing i386 and AMD64 FreeBSDs as number-crunching platforms. It
would appear that there isn't any advantage in going with the AMD64
native os for floating point intensive applications. (As long as we
stay under the 4 gig boundary, that is.) The purest in me prefers
leaving 32-bit OSs behind, but at this point it also means sacrificing
Matlab and other Linux apps, OpenOffice, Java, and a few other things.
( Although I've been impressed by the fact that almost all the ports
support AMD64, and I understand that 32-bit binary support, including
Linux compat, is in the works. )

The AMD64 processor definitely blows away all the 32-bit processors
I've seen, and rivals the G5 and even edges it out for really big
analyses. But, at this point, I would venture to guess that the 1mb
cache is the main reason. ( Our analyses process matrices sized
anywhere from a few hundred megs to a gig )
Thanks to everyone who contributed...

JB
 
 
 

AMD64 floating point performance 32 vs. 64

Post by karg » Sun, 09 Jan 2005 13:23:50

In article < XXXX@XXXXX.COM >,
XXXX@XXXXX.COM writes:


stepping = 0 is well not good. :-)
stepping = 10 is 10 revisions later.


I get a 60% increase in execution speed on my dual opteron system
over my athlon and pentium 4 M systems. This is a simple recompile.
I haven't tried to optimize for the larger caches on the opteron.

AFAIK, there are several new registers available for FP on the
opteron over all ia32 processor (at least double the number of
registers). This should provide for better pipelining at some
point (ie. the compiler needs to catch up).

Your app name suggests 3d FFTs where you're striding all over
memory. I suspect your bound by the bus speed and memory speed.


--
Steve
http://www.yqcomputer.com/ ~kargl/
 
 
 

AMD64 floating point performance 32 vs. 64

Post by talo » Sun, 09 Jan 2005 19:20:01


As someone said, floating point unit of the x86 natively operates on
80 bits, since the beginning. Hence there is no reason why floating
point operations should be faster on 64 bits versus 32 bits per se
(neglecting all other factors).

However, if the computation uses a mix of floating point and integer
operations, then the 64 bits processor should shine. One of my colleagues
has benchmarked his program on an Athlon64 FX51, and it runs twice as fast
as on a P4 3GHz, for another colleague, the difference is not so big.
Things are extremely variable. If you look at the benchmarks published
in reviews, it is the same, sometimes the Athlon64 runs much faster
than the Pentium4 EE sometimes not (particularly on applications which
use a lot of SSE2).


--

Michel TALON
 
 
 

AMD64 floating point performance 32 vs. 64

Post by conrad » Sun, 09 Jan 2005 21:50:10

In article < XXXX@XXXXX.COM >,


32-bit (native) binary support already exists, although it's still fairly
rough around the edges.

Linux 32-bit emulation has been available for a good while now, and works
quite nicely, actually. I use it to run acroread, realplayer, java, and
anything else that needs it.

As far as overall performance goes, as another poster already pointed out,
the availability of twice as many machine registers, all with double the
width of the 32-bit processors' registers, really goes a long way to
enabling much more optimized code to be generated by gcc and the like.

I love my Athlon 64! :-)

--
Conrad J. Sabatier < XXXX@XXXXX.COM > -- "In Unix veritas"
 
 
 

AMD64 floating point performance 32 vs. 64

Post by dwmalon » Mon, 10 Jan 2005 01:06:12


XXXX@XXXXX.COM (Michel Talon) writes:


The P4's performance seems to be quite twitchy for FPU stuff. We
were looking at some (hand written assembley) code for generating
random floating point numbers from a pascal compiler. We found that
a 1GHz P3 performed roughly the same as a 3Ghz P4. The Athlons and
Athlon 64 machines we tried did much better. I guess it is some
sort of instruction scheduling issue of some sort.

David.
 
 
 

AMD64 floating point performance 32 vs. 64

Post by talo » Mon, 10 Jan 2005 03:20:38


Yes, but as i said i have another colleague who pretends that the P4
is as performant as the Athlon64 on his own computations (the reason
why we have just one Athlon64 in our lab!). These are floating point
computations accessing very randomly memory. On the other hand the
computations which run twice as fast on the Athlon64 are symbolic
mathematical computations, i think under Maple. This means a lot
of integer and memory stuff.


--
Michel Talon
 
 
 

AMD64 floating point performance 32 vs. 64

Post by jbaco » Wed, 12 Jan 2005 02:59:27


I've seen similar results compared to our recent 32-bit Athlon
processors.
What surprised me is that a recompile isn't necessary to get the full
performance gain from the AMD64.


Yes, the AMD64 doubles the number of general purpose floating point
registers. I suspect the compiler's ability to utilize them is highly
dependent on the particular application. They don't appear to be doing
much for 3dDeconvolve...


Yes, it is. It's not a pretty sight when this program causes
swapping...

I also ran a simple benchmark that clears, adds to, and multiplies an
array of 32 million doubles, and the 64-bit code still was only a few
percent faster
than the 32-bit (running on the AMD64 - it was way faster than our
Athlon XPs).

Here's a snippit:

gettimeofday(&t1, NULL);
for (dp=(double *)mem, end=mem+MEM_SIZE; dp<(double *)end; ++dp)
*dp = 0.0;
gettimeofday(&t2, NULL);
diff = difftimeofday(&t2, &t1);
printf("Done. Time = %lums (%0.2f million words/sec)\n\n", diff /
1000,
(double)MEM_SIZE / sizeof(double) / diff);

gettimeofday(&t1, NULL);
for (dp=(double *)mem, end=mem+MEM_SIZE; dp<(double *)end; ++dp)
*dp += 1234.0;
gettimeofday(&t2, NULL);
diff = difftimeofday(&t2, &t1);
printf("Done. Time = %lums (%0.2f million adds/sec)\n\n", diff /
1000,
(double)MEM_SIZE / sizeof(double) / diff);