status update 1 ( assembler speed...)

status update 1 ( assembler speed...)

Post by BGB / cr88 » Mon, 29 Mar 2010 14:53:21

well, a status update:
1.94 MB/s is the speed which can be gained with "normal" operation (textual
interface, preprocessor, jump optimization, ...);
5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
forces single-pass assembly.

10MB/s (analogue) can be gained by using a direct binary interface (newly
in the case of this mode, most of the profile time goes into a few predicate
functions, and also the function for emitting opcode bytes. somehow, I don't
think it is likely to be getting that much faster.

stated another way: 643073 opcodes/second, or about 1.56us/op.
calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
2.31 GHz).

basically, I have a personal optimization hueristic:
when the top item reported by the profiler is the entry point to a switch
statement, it is not likely that all that many more optimizations are gained
(the so-called "switch limit"). a variant of this has happened in this case.

in the binary mode, the test fragment is pre-parsed into an array of
struct-pointers, and these structs are used to drive the assembler internals
(with pre-resolved opcode numbers, ...).

the fragment has 462 ops and manages to be re-assembled 41758 times before
the timer expires (timer expire is 30s, so 1391 re-assembles/second).

to get any faster would likely involve sidestepping the assembler as well
(such as using a big switch and emitting bytes), but this is not something I
am going to test (would make about as much sense as benchmarking it against
memcpy or similar, since yes, memcpy is faster, but no, it is not an

so, at the moment, this means an approx 5x speed difference between the
fastest and the slowest modes.

I am not really sure if this is all that drastic of a difference...

or such...

status update 1 ( assembler speed...)

Post by Rod Pember » Mon, 29 Mar 2010 16:16:54


A few years ago, I posted the link below for large single file programs
(talking to you...). I'm not sure if you ever looked their file sizes, but
the largest two were gcc as a single file and an ogg encoder as a single
file, at 3.2MB and 1.7MB respectively. Those are probably the largest
single file C programs you'll see. It's possible, even likely, some
multi-file project, say the Linux kernel etc., is larger. But, 10MB/s
should still be very good for most uses. But, there's no reason to stop
there, if you've got the time!


BTW, what brand of cpu, and what number of cores are being used?


OpenWatcom is (or was) one of the fastest C compilers I've used. It skipped
emitting assembly. Given the speed, I'm sure they did much more than
that... It might provide a reference point for a speed comparison. I
haven't used more recent versions (I'm using v1.3). So, I'm assuming the
speed is still there.

Rod Pemberton


status update 1 ( assembler speed...)

Post by Robbert Ha » Mon, 29 Mar 2010 16:41:38

To provide another data point:

First, some data from /proc/cpuinfo:

model name : AMD Athlon(tm) Dual Core Processor 5050e
cpu MHz : 2600.000
cache size : 512 KB
bogomips : 5210.11

I did a quick test using the Alchemist code generation library. The
instruction sequence I generated is:

00000000 33C0 xor eax,eax
00000002 40 inc eax
00000003 33DB xor ebx,ebx
00000005 83CB2A or ebx,byte +0x2a
00000008 CD80 int 0x80

for a total of 10 bytes. Doing this 100000000 (a hundred million) times
takes about 4.7 seconds.

Using the same metrics that you provided, that is:

About 200 MB/s
About 100 million opcodes generated per second
About 24 CPU clock cycles per opcode generated



status update 1 ( assembler speed...)

Post by Rod Pember » Mon, 29 Mar 2010 17:22:48

Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect
you listed it for _one_ core, as /proc/cpuinfo does. Look in
/var/log/messages to see if your total is twice. It should say both cores
are activated and list the total. I'm really not sure what anyone could use
BogoMips for...

Rod Pemberton

status update 1 ( assembler speed...)

Post by Branimir M » Mon, 29 Mar 2010 17:58:29

On Sun, 28 Mar 2010 04:22:48 -0400

Well, actually Linux shows that bogomips depending on
bios feagures not real feagures. For example
if you put 400mhz FSB and multiplier 8
it will not show 3.2GHZ but 3.6 if you multiplier
max is 9.
For same reason if you put 400mhz auto multiplier
and speedstep enabled it will show 2ghz when multiplier
iz 6 and 3 ghz when multiplier is 9,
but actually clock is 2.4GHZ,3.6HZ not 2GHZ/3GHZ
as shown.



Sometimes online sometimes not

status update 1 ( assembler speed...)

Post by bart » Mon, 29 Mar 2010 19:21:13

I'm not sure MB/s is that useful a measure (of assembler source code chars
per second?), depending as it does on syntax, white-space and comments.

That would be generated instructions per second then, rather more useful.

The first assembler I ever wrote, required about 1500 clocks/instruction
(for memory to memory assembly). That was on a 2.5MHz Z80 (an average of
perhaps 100-200 machine instructions to assemble one instruction of source
code). I don't think I've ever been able to match that since...

For a current project, I read files from disk, using my asm representation,
and convert them to binary form. This has to be done for all modules known
to be needed for the application.

If the whole thing (ie. loading and fixing up) ends up takes more than a
fraction of a second, I'd be very surprised. In that case there are a range
of techniques to try to ensure that loading any application is more-or-less
instant (loading modules on demand for example).


status update 1 ( assembler speed...)

Post by BGB / cr88 » Tue, 30 Mar 2010 00:25:45

now that I am reminded, I remember them some, but not much...

AMD Athlon 64 X2 4400.
however, all this runs in a single thread, so the number of cores doesn't
effect much.

internally, it runs at 2.31 GHz I think, and this becomes more notable when
doing some types of benchmarks.

my newer laptop has an Pentium 4M or similar, and outperforms my main
computer for raw computational tasks, but comes with rather lame video HW
(and so still can't really play any games much newer than HL2, which runs
similarly well on my old laptop despite my old laptop being much slower in

well, all this is for my assembler (written in C), but it assembles ASM

note that my struct-array interface doesn't currently implement all the
features of the assembler.

status update 1 ( assembler speed...)

Post by BGB / cr88 » Tue, 30 Mar 2010 01:07:05

"Robbert Haarman" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...

well, that is actually a faster processor than I am using...

I don't know the bytes output, I was measuring bytes of textual-ASM input:
"num_loops * strlen(input);" essentially.

in the structs-array case, I pre-parsed the example, but continued to
measure against this sample (as-if it were still being assembled each time).

yeah, but they are probably doing something differently.

I found an "alchemist code generator", but it is a commercial app which
processes XML and uses an IDE, so maybe not the one you are referencing
(seems unlikely).

my lib is written in C, and as a general rule has not been "micro-turned for
max performance" or anything like this (and also is built with MSVC, with
debug settings).

I have been generally performance-tuning a lot of the logic, but not
actually changing much of its overall workings (since notable structural
changes would risk breaking the thing).

mine also still goes through most of the internal logic of the assembler,
mostly bypassing the front-end parser and using pre-resolved opcode numbers
and similar.

emitting each byte is still a function call, and may check for things like
the need to expand the buffer, ...
the output is still packaged into COFF objects (though little related to
COFF is all that notable on the profiler).

similarly, the logic for encoding the actual instructions is still
ASCII-character-driven-logic (it loops over a string, using characters to
give commands such as where the various prefixes would go, where REX goes,
when to place the ModRM bytes, ...). actually, the logic is driven by an
expanded form of the notation from the Intel docs...

there is very little per-instruction logic (such as instruction-specific
emitters), since this is ugly and would have made the thing larger and more
complicated (but, granted, it would have been technically faster).

hence, why I say this is a case of the "switch limit", which often causes a
problem for interpreters:
most of the top places currently in the profiler are switch statements...

this ASCII-driven-logic is actually the core structure of the assembler, and
so is not really removable. otherwise my tool for writing parts of my
assembler for me would have to be much more complicated (stuff is generated
from the listings, which tell about things like how the instructions are
structured, what registers exist, ...).

actually, a lot of places in my framework are based around ASCII-driven
logic (strings are used, with characters used to drive particular actions in
particular pieces of code, typically via switch statements).

this would include my x86 interpreter, which reached about 1/70th native

but, hell, people would probably really like my C compiler upper-end, as
this is essentially a huge mass of XML-processing code... (although no XSLT,
instead mostly masses of C code which recognizes specific forms and work
with them...).


status update 1 ( assembler speed...)

Post by BGB / cr88 » Tue, 30 Mar 2010 01:23:03

well, the code has almost no extra whitespace or comments.
it is actually a fairly dense fragment borrowed from elsewhere in my

yeah, this was from my pre-parsed code.

yeah, my clocks count is a little higher.

as mentioned elsewhere, this may be partly because of my tendency to use
ASCII-driven program logic in some cases.

the core part of the assembler, the part which constructs the machine
instructions, is itself based on ASCII-driven logic (which basically gives
commands for how to generate the specific instruction).

but, I am not thinking it is too terrible, since I am assembling an opcode
in similar time to how long it takes the Win32 API to lock or unlock a mutex
object, perform an interface-method-call (in my object system), ...


for my stuff, most of the code is already in binary form, as I make fairly
heavy use of statically-compiled libraries written in good old C.

yes, ok.

status update 1 ( assembler speed...)

Post by Branimir M » Tue, 30 Mar 2010 02:14:33

On Sun, 28 Mar 2010 08:25:45 -0700

Well, measured some quad xeon against dual athlon slower
than your in initializing 256mb of ram 4 threads xeon, 2 threads
athlon, same speed.
Point is that same speed was with 3.2 ghz strongest dual athlon as well.
Intel external memory controller models are slower with memory
than athlons. You need to overclock to at least 400mhz FSB to compete
with athlons.



Sometimes online sometimes not

status update 1 ( assembler speed...)

Post by Robbert Ha » Tue, 30 Mar 2010 02:37:13

i cr,

On Sun, Mar 28, 2010 at 09:07:05AM -0700, BGB / cr88192 wrote:

Yes, it is. That's why I posted it. I am sure the results I got aren't
directly comparable to yours, and the different CPU is one of the reasons.

Oh, I see. I misunderstood you there. I thought you would be measuring
bytes of output, because your input likely wouldn't be the same size for
textual input vs. binary input.

Of course, that makes the MB/s figures we got completely incomparable.
I can't produce MB/s of input assembly code for my measurements, because,
in my case, there is no assembly code being used as input.

Right. I could, of course, come up with some assembly code corresponding to
the instructions that I'm generating, but I don't see much point to that.
First of all, the size would vary based on how you wrote the assembly code,
and, secondly, I'm not actually processing the assembly code at all, so
I don't think the numbers would be meaningful even as an approximation.

Clearly, with the numbers being so different. :-) The point of posting these
numbers wasn't so much to show that the same thing you are doing can be
done in fewer instructions, but rather to give an idea of how much time
the generation of executable code costs using Alchemist. This is basically
the transition from "I know which instruction I want and which operands
I want to pass to it" to "I have the instruction at this address in memory".
In particular, Alchemist does _not_ parse assembly code, perform I/O,
have a concept of labels, or decide what kind of jump instruction you need.

Right. The one I am talking about is at

Right, I forgot to mention my compiler settings. The results I posted
are using gcc 4.4.1-4ubuntu9, with -march-native -pipe -Wall -s -O3
-fPIC. So that's with quite a lot of optimization, although the code for
Alchemist hasn't been optimized for performance at all.

I expect that this may be costly, especially with debug settings enabled.
Alchemist doesn't make a function call for each byte emitted and doesn't
automatically expand the buffer, but it does perform a range check.

Right. Alchemist doesn't know anything about object file formats. It just
gives you the raw machine code.

That may be a major difference, too. Alchemist has different functions for
emitting different kinds of instruction. For reference, the code that
emits the "or ebx,byte +0x2a" instruction above looks like this:

/* or ebx, 42 */
n += cg_x86_emit_reg32_imm8_instr(code + n,
sizeof(code) - n,

There are other functions for emitting code, with names like
cg_x86_emit_reg32_reg32_instr, cg_x86_emit_imm8_instr, etc.

Each of these functions contains a switch statement that looks at the
operation (an int) and then calls an instruction-format-specific function,
substituting the actual x86 opcode for the symbolic constant. A similar
scheme is used to translate the symbolic constant for a register name to
an actual x86 register code.

You can take a look at
for all the gory details, if you like.




status update 1 ( assembler speed...)

Post by BGB / cr88 » Tue, 30 Mar 2010 02:49:49


well, whatever the case, my 2009-era laptop with an Pentium4 outperforms my
2007-era desktop with an Athlon 64 X2, at least for pure CPU tasks.

I haven't really compared them with memory-intensive tasks.

I put DDR-2 PC2-6400 RAM in my desktop, but the BIOS regards it as 5400 (as
does memtest86...).
I don't know what laptop uses.

for games, the main issue is video HW, as apparently the "Intel Mobile
Video" or whatever isn't exactly good...
main computer has a "Radeon HD 4850".


status update 1 ( assembler speed...)

Post by Branimir M » Tue, 30 Mar 2010 03:04:01

On Sun, 28 Mar 2010 10:49:49 -0700

Intel core/2 is much faster than athlon per CPu tasks clock per clock,
when data is in cache, but ahtlon is faster regarding
when you have to write lot of data at same time.
That's why intel has larger cache to compensate that.
i7 changed that as it has internal memory controller.



Sometimes online sometimes not

status update 1 ( assembler speed...)

Post by BGB / cr88 » Tue, 30 Mar 2010 03:22:43

"Robbert Haarman" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...



I can't directly produce (meaningful) bytes of output either, since the
output is currently in the form of unlinked COFF objects...


mine does all this apart from the IO.

input and output is passed as buffers, although input can be done into the
assembler via "print" statements, which are buffered internally, which is
one of the main ways of using the assembler.

trivially different is the "puts" command, which doesn't do any formatting,
and hence is a little faster if the code is pre-formed.



MSVC's performance generally falls behind GCC's in my tests anyways...

the range check is used, and typically realloc is used if the buffer needs
to expand.
the default initial buffers are 4kB and expand by a factor of 1.5, and with
the example I am using this shouldn't be an issue.

yep, and mine produces objects which will be presumably passed to the
dynamic linker (but other common uses include writing them to files, ...).

my tests have typically excluded the dynamic linker, as it doesn't seem to
figure heavily in the benchmarks, would be difficult to benchmark, and also
tends to crash after relinking the same module into the image more than a
few k times in a row (I suspect it is likely using up too much memory or

mine works somewhat differently then.

in my case, the opcode number is used, and then the specific form of the
instruction for the given arguments is looked up (typically using
predicate-based matchers), and this results in a string which tells how to
emit the bytes for the opcode.

this string is passed to the "OutBodyBytes" function, which follows the
commands in the string (typically single letters telling where to put
size/addr/REX/... prefixes, apart for XOP and AVX instructions which are
special and may use several additional characters to define the specific
prefix), and outputs literal bytes (typically represented in the command
string as hex values).

each byte is emitted via "OutByte", which deals with matters of putting the
byte into the correct section, checking if the buffer for that section needs
to expand, ...

or, IOW, it is a more generic assembler...


status update 1 ( assembler speed...)

Post by Waldek Heb » Wed, 31 Mar 2010 05:50:51

For a litte comparison: Poplog needs 0.24s to compile about
20000 lines of high-level code generating about 2.4 MB of
image. Only part of generated image is instructions, rest
is data and relocation info. Conservative estimate is about
10 machine instructions per high-level line, which gives
about 200000 instructions, that is about 800000 istructions
per second.

Poplog generates machine code from binary intermediate form
(slightly higher level than assembler, typically one
intermediate operation generates 1-3 machine instructions).
Code is generated in multiple passes, at least two, in next
to last pass code generator computes size of code, then
buffer of appropriate size is allocated and in final pass
code is emmited to the buffer.

Code generator can not generate arbitrary x86 instructions,
just the ones needed to express intermediate operations.
Bytes are emmited via function calls, opcodes and modes
are symbolic constants (textual in source, but integers
in compiled form).

My feeling is that trying to use strings as intermediate form
(or even "character based dispatch") would significantly
slow down code generator and the whole compiler.

BTW: I tried this on 2.4 GHz Core 2. The machine is quad
core, but Poplog uses only one. L2 cache is 4MB per two cores
(one pair of cores shares one cache on one die, another pair
of cores is on second die and has its own cache). IME Core 2
is significantly (about 20-30% faster than similarly clocked
Athlon 64 (I have no comparison with newer AMD processors)),
so the results are not directly comparable with yours.

Waldek Hebisch