max speed obtainable

max speed obtainable

Post by Jakob Niel » Sat, 03 Apr 2004 22:29:20


I am implementing a pic16f83 in software, but I am not at all happy with the
speed.
On an amd 2.4MHz with windows xp, it runs at a max of 10Mips. I then started
looking for the bottleneck, and ended up commenting out everything and stikk
only got 40Mips. It turns out that just making a for loop with a call to an
empty function in c takes the speed to 50 Mips, if each function call is
considered an instruction.

That alone stops me in my tracks. For this project I will need a mix of
processors running up to 40MHz. If i can barely get a loop running with
50MIps then, there is no way I can make ten real emulations run at the
needed speed at the same time.

The very simple test code I used is appended below. The loop takes roughly 2
seconds

The core-design I am using on the emulator is very basic. The pic has 14 bit
opcodes, which gives a total of 16384 opcodes. I make an array of that size,
and for each entry in it, i have a functionpointer. I read an opcode, use it
as index to the array and call the functionpointer, which in turn executes
the actual instruction. Can anyone comment on that method? It seems to me
that it is the fastest possible way to decipher opcodes. It is rather
wastefull, but then again.. speed is much more important.






#include <stdio.h>

int dummy() {
return 0;
}

int main(){
unsigned long int i;
printf("starter\n");
for (i=0;i<100000000;i++) dummy();
printf("fdig\n");

return 0;
}
 
 
 

max speed obtainable

Post by Slor » Sun, 04 Apr 2004 05:52:15

Despite all prevention efforts, "Jakob Nielsen" < XXXX@XXXXX.COM > wrote in



I don't know if this helps you any, but this sample takes only 1.2
seconds on my 800 MHz office machine in an XP DOS box, compiled with
gcc. If I remember, I'll run the same test on my 2.0 GHz when I get
home.

--
Slor

 
 
 

max speed obtainable

Post by Doug Kwa » Sun, 04 Apr 2004 06:42:52


: Despite all prevention efforts, "Jakob Nielsen" < XXXX@XXXXX.COM > wrote in

:
:> #include <stdio.h>
:>
:> int dummy() {
:> return 0;
:> }
:>
:> int main(){
:> unsigned long int i;
:> printf("starter\n");
:> for (i=0;i<100000000;i++) dummy();
:> printf("f?rdig\n");
:>
:> return 0;
:> }
:>
:
: I don't know if this helps you any, but this sample takes only 1.2
: seconds on my 800 MHz office machine in an XP DOS box, compiled with
: gcc. If I remember, I'll run the same test on my 2.0 GHz when I get
: home.

This is not a very good way to measure call overhead.

The function dummy() has only one caller and its very small.
Most compilers like gcc or Visual C/C++ can do a good job of
inlining such function into its caller and eliminate the function
call overhead. If the compiler is really aggressive, the whole loop
can be eliminated because it does not do anything basically. The
function main() is just like

int main()
{
printf("starter\n");
printf("f?rdig\n");
return 0;
}

The loop is irrevelant here because it is internally to main() and
compilers are allowed to optimise it away.

-Doug (remove ___ to get e-mail)
 
 
 

max speed obtainable

Post by Jakob Niel » Sun, 04 Apr 2004 16:30:38

> This is not a very good way to measure call overhead.

I know that.


Aparently is is not optimized away. It does take a long time. If the loop is
used to fetch opcodes and pass them to matching functions dealing with them,
then it can not be optimized. The above loop is aparently not optimized
either. My point is... if a simple loop takes this long, then how on earth
am I suposed to make any real processing go faster?
 
 
 

max speed obtainable

Post by Laurent De » Sun, 04 Apr 2004 18:41:57


[...]

That's pretty strange. What compiler do you use?
And with what flags? Did you enable optimization?


The speed you get is indeed not very good. My ARMv5TE
simulator reaches 40M ip/s on a P4 3.2 GHz and by using
switch.

BTW using a 16K entry table to dispatch instructions is
not necessarily the way to go as this makes 64KB of data
which is bigger than the L1 Dcache of your P4; of course
you won't be using all of your table but instruction
"decoding" + table dispatch may be more efficient than
table dispatch alone.

And if you still can't reach higher speeds you will
have to write a dynamic recompiler ;-)


Laurent
 
 
 

max speed obtainable

Post by Jakob Niel » Sun, 04 Apr 2004 21:57:01

> BTW using a 16K entry table to dispatch instructions is

You have a good point there. I simply didnt give that any thougth. Silly me
:-/


I am not sure I follow. What do you mean by decoding as oposed to table
dispatch?

By the way.. i am searching for good information about writing fast
emulators. Searching for "emulator", mostly gives links related to questions
like "help! why doesnt my amstrad emulator make sounds?". If you or others
here know of good info relating to the coding part, then I woul like to hear
about them.
 
 
 

max speed obtainable

Post by Paul Robso » Sun, 04 Apr 2004 23:11:48


On any modern PC you should not need to make a 'fast' emulator, unless
you wish to emulate a 32 bit up machine : PSX, PS2, N64, Gamecube,
Dreamcast,XBox. These machines will require dynamic recompilation to
work, anything more antiquated will not.
 
 
 

max speed obtainable

Post by Laurent De » Mon, 05 Apr 2004 05:28:20


I took a quick look at the PIC instruction set. For that,
processor, you could for instance make a switch on
(opc >> 7) & 0x7F; that's what I call instruction encoding
(this is more or less what's done in processors design).


Laurent
 
 
 

max speed obtainable

Post by Laurent De » Mon, 05 Apr 2004 05:31:53


Hum, let's take an example. I have that slow processor
running at 16 MHz. I have my dev tools on that PC with
a simulator. During dev I want max speed out of the
simulator to speed up dev...

You should know instruction set simulators are not only
applicable to console stuff.


Laurent
 
 
 

max speed obtainable

Post by Jakob Niel » Mon, 05 Apr 2004 06:19:00

> On any modern PC you should not need to make a 'fast' emulator, unless

The reason i need a fast emulator, is that i need to run several at the same
time plus calculate physics.
My test before told me that i could at most get a cpu up to 40Mips. sinze
the cpu I primarily should emulate is a 10MHz risc, and i need at least ten
of theese running.. well... cant be done.

Therefore i need tips on how to get it running fast enough.
Why did an unoptimized loop run that slow? If a simple loop is that slow,
then any real processing will be at least that slow.
 
 
 

max speed obtainable

Post by Laurent De » Mon, 05 Apr 2004 09:38:18


My guess is that you are either using a bad compiler
or bad options for it.

#include <stdio.h>

static int foo(void)
{
return 0;
}

int main(int argc, const char *argv[])
{
int i;

printf("start\n");
for (i = 0; i < 100000000; i++)
foo();
printf("end\n");
return 0;
}

Athlon @1GHz, Linux 2.4.22 gcc 3.2.2
gcc -o test test.c ; time ./test
1.11user 0.00system 0:01.10elapsed 100%CPU
gcc -O3 -o test test.c ; time ./test
0.20user 0.00system 0:00.20elapsed 99%CPU


Laurent
 
 
 

max speed obtainable

Post by Jakob Niel » Tue, 06 Apr 2004 06:32:49

> My guess is that you are either using a bad compiler

True. I have been a little confused lately. Can hardly even put together two
words in english without making a mess of it :-)
Looking at the clockcount for a loop and call on an intel 486 tells me that
it should be rather quick.
The loop with jump takes 4 clocks and a far indirect call should be 17. That
is some 21 clocks.
Enter takes 17 and leave is 5. All in all that is 43 clocks.

Given a clock rate of 2.5GHz it should take 0,0000172s.. if the system was
doing nothing else. Is this correct? I have been away from theese
calculations for too long. Been on adventure in the land of RAD.

If the emulator should be faster, the call, enter and leave could be trimmed
away and replaces with a far jump to the opcode handler, and a far jump back
into the decoder routine. All in all I would say that the previously
mentioned PIC processor should be emulatable with no more than 100 clocks pr
opcode.. and that is generous! If we assume I have ten of those running,
that would eat up 1GHz, and plenty would be left.

There is an issue I know sadly little about. How much power does a standard
workstation XP installation drain? Given 2.5GHz, how much can I assume to be
able to grab for myself if no other user applications are running? Can I
simply divide by cpu usage as win xp displays it?

By the way. To make you all laugh for a few minutes. I was trying to get
this running quickly with delegates in c#.net. The problem with that is that
I do not know the IL assembler mnemonics that it compiles to, and that
everything is in objects which leaves little control on the low level. My
second problem was that somewhere along the line, I turned my brain off, and
forgot anything lowleverl I know.


*hrmf* well.... *blush* hehe

By the way.. I have now been looking for a few hours on and off for
specifics about the opcodes on the newer amd and intel cpus. There is a lot
of comerciel stuff arround, but I cant seem to find much about the details.
Which opcodes there are, what they do, and how many clocks they take under
which conditions. I am sure the details is out there somewhere, but as I
said..I cant find it. If one of you have a more or less direct link into the
amd/intel directory containing this, I would very much like a pointer.

Thanks for any help in advance :-)
 
 
 

max speed obtainable

Post by Sebastia » Tue, 06 Apr 2004 07:19:22


Don't forget about cache misses. L2 cache miss costs about 220-300 cycles on
such Atholn machine.
For ideas about efficient way of doing a CPU emulator look at QEMU -- thing
allows for emulation of x86 on x86 at ~20% efficiency.

http://www.yqcomputer.com/



With >90% accuracy, yes. OS overhead is pretty small (~1-5% typically).


[snip]

for example this one:

http://www.yqcomputer.com/


rgds
Sebastian Kaliszewski
--
"Never underestimate the power of human stupidity" -- from Notebooks of L.L.
 
 
 

max speed obtainable

Post by Jakob Niel » Tue, 06 Apr 2004 21:15:45

> Don't forget about cache misses. L2 cache miss costs about 220-300 cycles
on
thing

Thanks for that link. Worth reading.

http://www.yqcomputer.com/

Thanks again :-)

By the way. Making this as a loop in asm i can get 600Mega loops pr second
on a 1844 MHz machine. Looks a tad better now.
 
 
 

max speed obtainable

Post by Doug Kwa » Wed, 07 Apr 2004 04:00:10


:
:> I am not sure I follow. What do you mean by decoding as oposed to table
:> dispatch?
:
: I took a quick look at the PIC instruction set. For that,
: processor, you could for instance make a switch on
: (opc >> 7) & 0x7F; that's what I call instruction encoding
: (this is more or less what's done in processors design).
:
:
: Laurent

If you use a small switch table, you can overlap the fetch and decoding
of the next instruction. For example


while (run) {
opcode = fetch_next_opcode();
switch(opcode) {
case XXX:
/* emulate XXX instruction */
break;
...
}

can be re-written as

opcode = fetch_next_opcode();
while (run) {
switch(opcode) {
case XXX:
/* emulate XXX instruction */
opcode = fetch_next_opcode();
break;
....
}

}

The idea is to prefetch the next opcode while executing the current one.
You do need to be careful in dealing with branches. Since you have 128
cases only, the I-cache impart of prefetch code will not be that great.
This may or may be help because new x86 processors these days speculate
like crazy. The hardware may alread does prefectching without the C source
re-writing. It worths trying still since this is simple to implement.

If you use gcc, which supports computed goto, you can further reduce
the look-up cost. You can get rid of the swtich statement. Like

static void *host_pc_table[128] = {
label_00, label_01, label_02,
...
}

/* code to start emulation */
next_host_pc = fetch_next_opcode_host_pc();
goto *next_host_pc;

/* code to implement instructions */
...

label_XXX:
/* emulate XXX instruction */
next_host_pc = fetch_next_opcode_host_pc();
goto *next_host_pc;

If you still want more performance, you can try writing a threaded code
interpreter (this technique is pretty old and you can find academic papers
about it using google). Basically you cache the result of
fetch_next_opcode_host_pc() above in another memory in the emulator.
i.e., pre-decoding the PIC instructions.

However, it is tricky to maintain coherency between the PIC instructions
and the pre-decoded form. Self-modifying code can kill you.

If threaded code is still not fast enough, trying writing a JIT :)
It is not terribly difficult.

-Doug (remove ___ for e-mail)