assembler speed...

assembler speed...

Post by cr8819 » Sat, 27 Mar 2010 00:18:15


well, this was a recent argument on comp.compilers, but I figured it may
make some sense in a "freer" context.

basically, it is the question of whether or not a textual assembler is fast
enough for use in a JIT context (I believe it is, and that one can benefit
notably from using textual ASM here).

in my case, it works ok, but then I realized that I push through it
relatively low volumes of ASM, and I am left to wonder about the higher
volume cases.


so, some tests:
basically, I have tried assembling a chunk of text over and over again (in a
loop) and figuring out how quickly it was pushing through ASM.

it keeps track of the time, and runs for about 10s, as well as how many
times the loop has run, and from this can figure how quickly ASM is being
processed.

the dynamic linker is currently disabled in these tests, as this part proves
problematic to benchmark due to technical reasons (endlessly re-linking the
same code into the running image doesn't turn out well, as fairly quickly
the thing will crash).

(I would need to figure out a way to hack-disable part of the dynamic linker
to use it in benchmarks).



initially, I found that my assembler was not performing terribly well, and
the profiler showed that most of the time was going into zeroing memory. I
fixed this, partly both by reducing the size of some buffers, and in a few
cases disabling the 'memset' calls.

then I went on a search, trying some to micro-optimize the preprocessor, and
also finding and fixing a few bugs (resulting from a few recent additions to
the preprocessor functionality).

at this point, it was pulling off around 1MB/s (so, 1MB of ASM per second).

I then noted that most of the time was going into my case-insensitive
compare function, which is a bit slower than the case-sensitive compare
function (strcmp).

doing a little fiddling in the ASM parser reduced its weight, and got the
speed to about 1.5MB/s.

as such, time is still mostly used by the case-insensitive compare, and also
the function to read tokens.


I am left to wonder if this is "fast enough".

I am left to wonder if I should add options for a no-preprocessor +
case-sensitive mode (opcodes/registers/... would be necessarily lower-case),
...

but, really, I don't know how fast people feel is needed.


but, in my case, I will still use it, since the other major options:
having codegens hand-craft raw machine-code;
having to create and use an API to emit opcodes;
...

don't really seem all that great either.

and, as well, I guess the volumes of ASM I assemble are low enough that it
has not been much of an issue thus far (I tend not to endlessly re-assemble
all of my libraries, as most loadable modules are in HLL's, and binary
object-caching tends to be used instead of endless recompilation...).

for most fragmentary code, such as resulting from eval or from
special-purpose thunks, the total volume of ASM tends to remain fairly low
(most are periodic and not that large).

likely, if it did become that much of an issue, there would be bigger issues
at play...

or such...
 
 
 

assembler speed...

Post by Marco van » Sat, 27 Mar 2010 01:05:33


When we replaced the assembler with an internal one, it was 40% faster on
Linux/FreeBSD, and more than 100% on windows. (overall build time)

We explained the difference due to slower I/O and, mainly, slower .exe
startup/shutdown time.

 
 
 

assembler speed...

Post by Branimir M » Sat, 27 Mar 2010 02:34:41

On Thu, 25 Mar 2010 08:18:15 -0700



This is fasm time in compilin it's own source for all 4
platforms it supports on my machine

bmaxa@maxa:~/fasm/source$ time fasm DOS/fasm.asm | fasm Linux/fasm.asm
| fasm libc/fasm.asm | fasm Win32/fasm.asm
flat assembler version 1.68 (16384 kilobytes memory) 4 passes, 83456
bytes.

real 0m0.060s
user 0m0.080s
sys 0m0.000s
bmaxa@maxa:~/fasm/source$ find . -name 'fasm*' -exec ls -l {}
\;-rw-r--r-- 1 bmaxa bmaxa 99982 2010-03-25 18:21 ./libc/fasm.o
-rw-rw-r-- 1 bmaxa bmaxa 4874 2009-07-06 15:44 ./libc/fasm.asm
-rw-r--r-- 1 bmaxa bmaxa 77635 2010-03-25 18:21 ./DOS/fasm.exe
-rw-rw-r-- 1 bmaxa bmaxa 5260 2009-07-06 15:44 ./DOS/fasm.asm
-rw-r--r-- 1 bmaxa bmaxa 83456 2010-03-25 18:21 ./Win32/fasm.exe
-rw-rw-r-- 1 bmaxa bmaxa 6160 2009-07-06 15:44 ./Win32/fasm.asm
-rwxr-xr-x 1 bmaxa bmaxa 75331 2010-03-25 18:21 ./Linux/fasm -rw-rw-r--
1 bmaxa bmaxa 4694 2009-07-06 15:44 ./Linux/fasm.asm
bmaxa@maxa:~/fasm/source$
bmaxa@maxa:~/fasm/source$ find . -name '*.inc' -exec ls -l {} \;
-rw-rw-r-- 1 bmaxa bmaxa 5424 2009-07-06 15:44 ./libc/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 50682 2009-07-06 15:44 ./expressi.inc
-rw-rw-r-- 1 bmaxa bmaxa 138351 2009-07-06 15:44 ./x86_64.inc
-rw-rw-r-- 1 bmaxa bmaxa 7779 2009-07-06 15:44 ./DOS/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 1995 2009-07-06 15:44 ./DOS/sysdpmi.inc
-rw-rw-r-- 1 bmaxa bmaxa 10419 2009-07-06 15:44 ./DOS/modes.inc
-rw-rw-r-- 1 bmaxa bmaxa 24541 2009-07-06 15:44 ./parser.inc
-rw-rw-r-- 1 bmaxa bmaxa 37936 2009-07-06 15:44 ./assemble.inc
-rw-rw-r-- 1 bmaxa bmaxa 7916 2009-07-06 15:44 ./Win32/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 6290 2009-07-06 15:44 ./Linux/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 46363 2009-07-06 15:44 ./preproce.inc
-rw-rw-r-- 1 bmaxa bmaxa 3860 2009-07-06 15:44 ./errors.inc
-rw-rw-r-- 1 bmaxa bmaxa 1805 2009-07-06 15:44 ./version.inc
-rw-rw-r-- 1 bmaxa bmaxa 82747 2009-07-06 15:44 ./formats.inc
-rw-rw-r-- 1 bmaxa bmaxa 2404 2009-07-06 15:44 ./messages.inc
-rw-rw-r-- 1 bmaxa bmaxa 48970 2009-07-06 15:44 ./tables.inc
-rw-rw-r-- 1 bmaxa bmaxa 2267 2009-07-06 15:44 ./variable.inc


Greets!


--
http://www.yqcomputer.com/

Sometimes online sometimes not
 
 
 

assembler speed...

Post by Robbert Ha » Sat, 27 Mar 2010 03:24:59

Hi CR,



I would imagine that it depends on what you consider "fast enough".


Right. If you handle only low volumes, many solutions tend to be
"fast enough".

In my experience, assemblers (that read assembly code and produce
machine code) tend to be quite fast. It seems to me that many compilers
spend more time processing the source language into (optimized) assembly
than the assembler spends turning the resulting assembly code into
machine code.

On the other hand, parsing text can be quite time consuming. In programs
I have profiled, it is not uncommon to find that they spend most of their
time parsing their input. Although I haven't profiled any assemblers, I
could easily imagine that parsing and recognizing opcodes takes up most
of their time.

To answer all the questions here, it would probably be a good idea to
first come up with a definition of "fast enough", and then, if you find
your program isn't fast enough by this definition, to profile it to figure
out where it is spending most of its time.

Another question is why you would be going through assembly code at all.
What benefit does it provide, compared to, for example, generating machine
code directly? Surely, if speed is a concern, you could benefit from
cutting out the assembler altogether.

Kind regards,

Bob
 
 
 

assembler speed...

Post by cr8819 » Sat, 27 Mar 2010 03:53:20


this assembler exists as a library, not as an external tool, so
startup/shutdown time isn't an issue, nor is IO...

the issue is more one of textual ASM vs, for example, manually crafting
machine-code sequences in the cogegen...
 
 
 

assembler speed...

Post by Maxim S. S » Sat, 27 Mar 2010 04:15:26

> basically, it is the question of whether or not a textual assembler is fast

What is the value of this?

The values of JIT:

a) platform independent binaries, the platform-dependency occurs only on load and not on build.
b) mandatory, really mandatory, without the chances to escape by the malicious tools use, things like exception handling, attribute-based code access rights, and garbage collection.

Both a) and b) are only achievable if some IL will be used pre-JIT, not the real assembler.

IL is a) platform-independent b) just has no means to bypass security, exception frames, or to do leakable memory allocations.

Real ASM is not such.

You can, though, invent the textual IL and binarize it on load. But what is the value of this, compared to IL binarized at build?

--
Maxim S. Shatskih
Windows DDK MVP
XXXX@XXXXX.COM
http://www.yqcomputer.com/
 
 
 

assembler speed...

Post by cr8819 » Sat, 27 Mar 2010 04:22:02

"Robbert Haarman" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...

agreed.

it has been fast enough for my uses, but others claim that JIT requires
things like directly crafting the sequences in the codegen.

however, an assembler has many advantages:
it avoids the tedium of endlessly re-crafting the same basic opcodes;
it can automatically figure out how wide jumps need to be;
...



agreed.

the vast majority of the time in my compiler tends to go into higher-level
operations:
parsing C code;
working with AST's;
processing the IL and running the codegen;
...

often to produce only a few kB to be run through the assembler at a time.


FWIW, my assembler is MUCH faster than my C upper-end, since it can assemble
about 1.5MB of ASM per second, rather than more about 250kB per second
(which is a common case for the C frontend given the volumes of crap it can
pull in from headers...).



opcode lookup uses a hash-table, and doesn't really show up in the profiler.
much more of the time at present goes into recognizing and reading off
tokens.



well, I know about my programs.
the question is, what about everyone else?...

so, the goal is more of a general answer, rather than something simply
relevant to my projects...



producing it directly is IMO a much less nice option.
it is barely even a workable strategy with typical bytecode formats, and
with x86 machine code would probably suck...


admittedly, if really needed I could add a binary-ASM API to my assembler
(would allow using function calls to generate ASM), but this is likely to be
much less nice than using a textual interface, and could not likely optimize
jumps (likely short jumps would need to be explicit).

OTOH, another assembler could be written, but there is not likely a whole
lot of point for this at present.

...



 
 
 

assembler speed...

Post by cr8819 » Sat, 27 Mar 2010 04:38:44


<--
What is the value of this?

The values of JIT:

a) platform independent binaries, the platform-dependency occurs only on
load and not on build.
b) mandatory, really mandatory, without the chances to escape by the
malicious tools use, things like exception handling, attribute-based code
access rights, and garbage collection.

Both a) and b) are only achievable if some IL will be used pre-JIT, not the
real assembler.

IL is a) platform-independent b) just has no means to bypass security,
exception frames, or to do leakable memory allocations.

Real ASM is not such.

You can, though, invent the textual IL and binarize it on load. But what is
the value of this, compared to IL binarized at build?
-->


in question here is using ASM within the JIT, IOW, in the post-IL stages.

for example, a person loads their bytecode, and has the option of how to go
about JIT'ing it.

admitted, for large binary-image IL formats (such as MSIL/CIL), using ASM
could have a performance impact.


IL is loaded, run through codegen, and then:
A. textual ASM is produced, and run through an assembler, and linked into
the running image;
B. machine code is produced in-place.

A. has the potential of a much cleaner implementation, ... but the cost that
it is not as fast and, on average, there is more code (both in the codegen,
+ the code for the assembler);

B. is very fast, and needs relatively little code, but IMO often leads to a
much less clean and much less reusable implementation (likely the codegen
will depend far more on architecture specific details than had it simply
used an assembler, since the assembler would manage many low-level ISA
details).

the few times I had used B (mostly because early on I couldn't use the
assembler from within the assembler), it was a very unpleasant experience,
typically involving going back and forth over the Intel docs to craft the
particular instruction sequences (and also for each CPU mode).

I have generally since used almost entirely dynamically-produced textual
ASM, since this is, FWIW, a much nicer way to do it.

similarly, with ASM it is also much more obvious which instructions are
being used, taking out the problem of having to recognize many of the
opcodes by byte sequence, ...

...
 
 
 

assembler speed...

Post by Maxim S. S » Sat, 27 Mar 2010 04:52:48

> IL is loaded, run through codegen, and then:

So, ASM is the intermediate form of IL->machine code translator.

And why this form? maybe there are other ways, more effective?


Matter of taste.


Perf loss, code complexity loss - all for the matter of taste.

--
Maxim S. Shatskih
Windows DDK MVP
XXXX@XXXXX.COM
http://www.yqcomputer.com/
 
 
 

assembler speed...

Post by bart » Sat, 27 Mar 2010 04:57:31


Yes, the question is why bother with a textual, human-readable
representation of something between one compiler stage and another?

An internal form has many advantages:

* A lot of things are already known, such as address modes, which will be
lost in the textual repr, and have to be reparsed later

* Metadata can be present, useful for internal processing, and can appear on
printouts of the internal form. For example, what does the "1" mean in "mov
eax,1"? It could be just 1, or a label, or a type, or a string constant
(index into a table). You lose that in text form, unless a complex series of
comments or some scheme of identifiers is used.

* Optimisation (and other processing) is more practical.

If the asm output has to be stored in a file, textual asm might have some
merit. But then the whole processing of generating asm source in acceptable
format is full of it's own problems; storing a series of records in a file
to be retrieved later is far simpler, and the file will likely be smaller
(and more private).

--
Bartc
 
 
 

assembler speed...

Post by bart » Sat, 27 Mar 2010 05:07:36


I should explain by 'internal form', I don't mean binary machine code, but a
representation of it as a series of records. Another stage turns it into
runnable machine code.

--
Bartc
 
 
 

assembler speed...

Post by Robbert Ha » Sat, 27 Mar 2010 05:19:11

n Thu, Mar 25, 2010 at 12:22:02PM -0700, cr88192 wrote:

I wouldn't worry about that too much, unless your code generator is the
most interesting part of what you are making. First, make it work. Then
you can think about making it better - assuming you don't have more
interesting things to tackle.


I don't really see that. The way I see it, most of the work is in getting
from what you have (presumably some instruction-set-independent source code
or intermediate representation) to the instructions of your target platform.
Once you are there, I think emitting these instructions as binary or as
text doesn't make too much of a difference.

I've written code to emit binary instructions for various targets, and,
in my experience, it's not very hard. Sure, x86's ModRM is a bit tricky,
but you write that once and then it will just sit there, doing its job.
In the grand scheme of writing a compiler, this isn't a big deal.

Generating the instructions in binary form right away also makes it very
easy to know exactly where your code ends up and what its size is, which
may actually make it _easier_ to patch addresses into your code and
make decisions about short vs. long jumps.


My experience is that how nice the API is depends very much on the
language you express it in. For example, I've tried to come up with a nice
API for instruction generation in C, but never got it to the point where
I was really happy with it. In a language which lets you write out a
data structure in-line, preferably with automatic memory management and
namespaces, this is much easier.

It's the difference between, for example:

n += cg_x86_emit_reg32_imm8_instr(code + n,
sizeof(code) - n,
CG_X86_OP_OR,
CG_X86_REG_EBX,
42);

and

(emit code '(or (reg ebx) (imm 42)))

Cheers,

Bob


 
 
 

assembler speed...

Post by Marco van » Sat, 27 Mar 2010 06:51:41


That's the difference and maybe a small bit more. The rest is the
generating-reparsing.


I think you need both. The fast one default, the other one to be able to do
global checks, and being very handy with diff tools (to compare objdumps of
code assembled via both tracks)
 
 
 

assembler speed...

Post by bart » Sat, 27 Mar 2010 07:05:17


In a mini-Api I'm developing at the moment, the above might come out
something like:

genmc(or_opc, genam(p), genam(p,2))

where p refers to a specific IL instruction. A lot of the details just
disappear. In fact the same code deals with reg/imm, reg/mem and reg/ptr
variations, and probably also half-a-dozen others.

(This is built on top of lower level functions. Using such a function, your
specific example would look like:

genmc(or_opc, 4, rmreg,rm_ebx,, rmdisp,,42)

but be capable of expressing *any* instruction (of the subset I use).)

There is nothing remarkable about the language, other than it uses dynamic
types.

I suppose that using textual output, it might look like:

genmct("or ebx,42")

except you rarely output entire instructions like that; textual output can
get very messy when you throw in a few variables.

--
Bartc
 
 
 

assembler speed...

Post by Rod Pember » Sat, 27 Mar 2010 07:21:08

cr88192" < XXXX@XXXXX.COM > wrote in message
news:hofus4$a0r$ XXXX@XXXXX.COM ...
fast

Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...

We know that interpreters are a bit slower than compilers, and compilers do
take some time too. How fast is fast enough is very relative to 1)
generation of microprocessor, 2) size of files, 3) in-memory or on-disk, 4)
language complexity, etc.

a

You may just be testing the OS's buffering abilities here...


Instead of memset()-ing entire strings, you might try just setting the first
char to a nul character: str[0]='\0'; It's not as safe, but if your code
is without errors, it shouldn't be an issue.

Instead of strcmp(), you can try switches on single chars, while progressing
through the chars needed to obtain the required info. Sometimes this works
because you only need one or two characters out of much longer keywords to
distinguish it from other keywords.

Character directed parsing can speed things up too. Determining what the
syntax component is, say integer or keyword, takes time. But, if you put a
character infront that indicates what follows, you don't have
to do that processing to determine if it's an integer or keyword. E.g., an
example from an assembler of mine:

.eax _out $255

"dot" indicates a register follows. "underscore" indicates instruction
follows. "dollar-sign" indicates a decimal integer follows. Each directive
character is passed to switch() which selects the appropriate parsing
operation. The parser doesn't have to determine _what_ "eax" or "out" or
"255" is. It "knows" from the syntax. That's a large part of parsing logic
eliminated. When you program, you know what the directive character is and
can easily insert the correct character. Code generators also "know" too -
since you coded it... It's just an inconvenience to type the extra
characters, if you're doing alot of assembly.

If you use memory instead of file I/O, processing will be faster. Linked
lists, esp. doubly linked, can also speed up in-memory processing.
Allocation of memory in a single large block, instead of calling malloc()
repeatedly "as you go" or as needed, can simplify the arrangement of objects
in the allocated memory. It can eliminate pointers. Reduce the object
size. etc.


Decide on one case, such as lowercase. That cuts your processing in half.
Use hash functions. They can eliminate multiple strcmp()'s. Try to only
strcmp() once, to eliminate possible collisions.

re-assemble

If it's low use, you can eliminate much code by removing checks. I.e., if
you know your compiler correctly emits registers "eax", "ebx", etc., don't
implement a check for invalid registers. Some people would call such
techniques "unsafe" programming - which is true. But, since the code is
used in a controlled environment and without any "garbage" input, it'll
speed things up if the code does less work such as safety checks.


Rod Pemberton