using MISC (post 1987 Forth hardware) opcodes

using MISC (post 1987 Forth hardware) opcodes

Post by Jeff Fo » Tue, 11 Jul 2006 10:03:57


n the presentation that I for SVFIG I reviewed some
of the things I had talked about in the past regarding
the optimization of minimal instruction set Forth code. I
presented some numbers from an analysis of code written
over a period of a year. This will be review for many and
new for some but I think there are several significant issues
relating to the use of the instructions in real code and about
which optimization techniques apply.

Smaller code tends to be faster code the five bit opcodes
that manipulate the stack are faster than those that
do memory access including calls and returns, that's
the idea. The following large global optimizations were observed:

An optimization of about 15% was observed in code size
and code speed due to the use of tail recursion. The
reason the number is so large is that the shorter the
length of factored Forth words the greater the increase
in efficiency by saving one call and return. And it is
important to note that the saving of one cell on the
return stack is considered a significant optimization.
With misc code expressed in colorforth Forth words
averaged 1.2 lines of code which made tail recursion
significant.

An optimization of about 15% was observed due to the
use of multiple entry points and muliple exit points for
similar reasons to that observed from the use of
tail recursion. It should also be noted this mechanism
can also save the use of a cell on the return stack. Some
compilers will hide these optimizations rather than expose
them in the source.

A global optimization of about 3% was observed in program
size due to not having to always balance stack use
when stack wrap can be utilized. More signficant is that the
technique provided a local 100% optimization and increase
in resolution in a tight timing loop due to the reduced code
overhead. In other applications in the last year much larger
optimization values were observed using stack wrap.

Other local optimizations include the use of @a+ and !a+
in place of a traditional phrase to achieve a local 800%
speedup improvement and size reduction. The same
style and technique can easily be used in ANS Forth but
because it won't have the advantage of mapping directly
to a five bit opcode so the optimization will not be as
significant. But it does provide a mechanism to exploit
auto-increment addressing modes in an instruction set.

-IF, -WHILE, -UNTIL obsolete quite a long list of
traditional words and provide considerable local
optimization in code size and speed. -IF tends to
be about as important as IF because it is a cheap
and fast bit test and also useful in some math.

Many traditional Forth optimizations don't apply. I used
some of them for years in various projects with little real
value but they can provide some amazing benchmark
hackery. Peep-hole or pin-hole optimization techiques and
register machine and stack in memory optimzations may
not apply at all at times. Some sequences look like they
should optimize away in traditional Forth implementations like:

DUP DROP

Presumably that code would not be written that way in
traditional Forth, but other optimizations done by inlining
macros might easily generate that sequence.

But it may be important to note that it is not the same as null
just as the non-destructive IF is not the same as DUP IF
because DUP is destructive.

On P21 for example:

1 2 3 4 5 6 ( 1 2 3 4 5 6 ) \ bottom of stack is
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Bernd Pays » Tue, 11 Jul 2006 22:10:59


Thanks for the overview.


Yes, that's obvious. That means you can't use DUP IF if you have a full
stack and want to branch on the top of stack element. That's the up side:
you save one stack element. The cost is that you lose the destructive IF,
and you need to DROP the flags when you don't need them (maybe it's my
programming style, but that's the majority). Finally, on my code, the
coincidence of having a DUP IF and a full stack at the same time was zero,
so I reverted to a destructive IF in the two b16 variants that are used in
our products.

Typical cases of IF in my environment:

( value ) #mask and IF
( value ) dup #mask and IF
( value ) #mask and dup IF

The last version actually can be done with a non-destructive IF, but there
is a free stack entry for DUP IF available.

I'd like to see a useful example where you have a full stack at the
non-destructive IF.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.yqcomputer.com/ ~paysan/

 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by John Dot » Wed, 12 Jul 2006 03:09:18


Have either of you guys considered a flag *register* rather than a flag
in the stack? I think it has some serious advantages.

1. You don't have to worry about either DUP or DROP of flags: the flag
persists until the next test.

2. Explicit arithmetic with flag values is relatively rare, and
"short-circuit evaluation" is often a good (and more efficient!)
substitute anyway.

3. A persistent flag register makes it easy to return success or failure
up through several levels of code, handling it appropritely at each
level without stack gymnastics.

Nondestructive IF seems to me to fight with the stack data flow: it
leaves the flag blocking your other operands so you're almost always
going to need stack gymnastics immediately after. But the need to reuse
the flag is fairly common in my code, so a register allows me to have my
cake and eat it too. The thing I hardly ever want is a stack of flags:
one at a time is usually enough.

--
---
John Doty, Noqsi Aerospace, Ltd.
---
His diagnosis of the hostility ... reflects the willful blindness of the
invader who assures himself that the natives are only made unfriendly by
some other provocation than his own. -Barbara W. Tuchman
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Jerry Avin » Wed, 12 Jul 2006 03:44:25


...


Worth considering.


No. Considered and rejected. If another level also tests, the flag won't
persist.


If I want to mask with the result of > , then it belongs on the stack.
If I want to branch on it, then it's consumed by IF .

Jerry
--
Engineering is the art of making what you want from things you can get.
?
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by John Dot » Wed, 12 Jul 2006 04:26:18


I do this fairly frequently. If you're bailing out of a failure, you
usually do no more tests. Success is occasionally more troublesome, so I
sometimes use an explicit "true" at the tail of a definition to indicate
success. That's less troublesome than stack gymnastics, I think.


Masking with the result is not a technique I ever employed even when I
had a Forth that put flags on the stack, so I guess I don't miss it.
Seems like yet another way to be clever at the expense of readability to me.

--
---
John Doty, Noqsi Aerospace, Ltd.
---
His diagnosis of the hostility ... reflects the willful blindness of the
invader who assures himself that the natives are only made unfriendly by
some other provocation than his own. -Barbara W. Tuchman
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Jerry Avin » Wed, 12 Jul 2006 04:41:37


...


Even DUP IF can be seen as clever by one unaccustomed to it. For me,
masking with a flag is a standard idiom.

: >CHAR ( hex_digit -- ASCII_representation ) DUP 9 > 7 AND + 30 + ;

Jerry
--
Engineering is the art of making what you want from things you can get.
?
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by jmdrake_9 » Wed, 12 Jul 2006 04:53:48

Bernd Paysan wrote:

The cost/benifit of a destructive/non destructive IF despends on as
well
as influences programming style. Consider this optimized SwiftForth
code Elizabeth posted some time ago.

===================================================
A favorite example is this definition:

: DIGIT ( n -- n ) DUP 9 > IF 7 + THEN [CHAR] 0 + ;

It converts a single digit to its equivalent
character. It optimizes to
(i386 version):

46E89F 9 # EBX CMP 83FB09
46E8A2 46E8AB JLE 0F8E03000000
46E8A8 7 # EBX ADD 83C307
46E8AB 30 # EBX ADD 83C330
46E8AE RET
===================================================

Here's the equivalent in Pentium colorforth.

(version 1)
: digit -10 + -if 58 + ; then 65 + ;

add EAX, -10
jns label
add EAX, 58
ret
label add EAX, 65
ret

Note for "readability" this could be written as:

: ascii0 48 ;
: asciia 64 ;
: digit -10 + -if [ ascii0 10 + ] + ; then [ asciia ] + ;

Same disassembly. (Thanks to Jeff Fox and others
on the colorforth mailing list for helping me with the
disassembly and a few tweaks to the code.)

There are 5 instructions for the SwiftForth version
and 6 for the ColorForth version. The SF version
has a maximum code path of 3 instructions and
the CF version has a max of 2. You can write
a CF version that has the same # instructions
and same max code path as the SF version.

(version 2)
: digit -10 + -if -7 + then 65 + ;

(Again, thanks CF mailing list).

Anyway, all of that to say this. What happens
when you don't have a CMP instruction (like
the SwiftForth Pentium code) and you don't
have a non-destructive -IF?

Note: Per Jeff the F21 would likely pack
these instructions like this:

# - nop nop \ -10 as 21-bit number
9 \ nops for add w/ -number in T
+ jns label \ jns can go in any slot on f21
# + ret x \ x doesn't matter
58 \ "0" +10 +n -10
label # + ret x \ x can be used for something else
65 \ "A" +n -10

This is from version 1. Version 2
wouldn't be used because it packs
into the same number of instructions on
an F21, with a larger max code path.

And while some might consider this a "contrived"
example, remember it didn't come from the
MISC crowd. ;-)

Regards,

John M. Drake

 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Jeff Fo » Wed, 12 Jul 2006 11:55:08


What happens at the bottom of stack only really matters if you fill
it all the way or spin it. Those are fairly rare things. You noticed
that I listed it as providing only a 3% optimization in the size of
the code generated in this case.

Despite small stacks needing to fill them all the way is rare
and if you do or want to it does make for a complication in
the complexity of the code do take advantage of those things.
But we think it is not as bad as dealing with the complexity
introduced by stacks that can throw errors instead. The
main thing however is just that these kind of stacks can
be faster than stacks in memory.
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Jeff Fo » Wed, 12 Jul 2006 12:23:12

ohn Doty wrote:

We found that having a stack of registers with flag bits was much more
powerful compared to classic Forth because of the -IF -UNTIL -WHILE
stuff and 2* but I know that you gave up on classic Forth decades ago
for more conventional things like a conventional flag register design.


You don't have to worry about a stack or Forth if you choose something
else.


We found it was really powerful for math. I suppose explicit
arithmetic
with flag values is very rare when you only have a single flag register
bottleneck design. Then software has to work that way so the
powerful techniques we used in advanced math wouldn't be possible.
I am reporting what quite a few programmers who worked with this
stuff said, these are not just my opinions.


Stackable flags makes it easy to nest flags to make code smaller
and faster and factored in a Forth way to match other Forth stuff.
If you only had one flag register you would have to do what you
say. I guess that means it is easier in that non-Forth context
than doing it in Forth if that is impossible.

Stack gymnastics are always bad. They are an example of poorly
factored Forth, the definition of poorly factored Forth. It is one of
them, unnecessary variable use is obviously another.


It is pretty much a non-factor, you gain a little, you lose a little.
It really doesn't matter much.

It was mostly a case of Chuck saying that he felt it would be
better if the language reflected what the hardware actually did
more accurately than to have it add a layer of abstraction and
that the hardware 'really wanted' to do non-destructive tests.

And he said it was going to pretty much be a wash.

And after many man years of programming the staff all said
he was right. It really didn't matter much. You have to have
explicit drops, maybe two if you have a non-destruction branch
and want to get rid of the flag.

And that would happen a lot if you could only test all of the
bits at once. You might want to drop it as often as you test
it in that traditional Forth environment.

But with -IF we might have a math routine, or compiler
routine, or control routine that really want non-destructive
tests. The non-destructive branching makes much more
sense when you consider that I always say that -IF is
more important than IF.

Once you add -IF then non-destructive branching is much
more valuable. That probably isn't obvious if you are not
thinking about Forth or remembering what all those programmers
who did this for all those years said about it.


You haven't convinced me.


One flag register. Not a very novel idea.

We get a few thousand pieces of cake and get to eat them too. So
our code doesn't need any gymnastics to deal with a one
flag register bottleneck. Much of what we showed as our
cleverest code used stacked flags beautifully.

Some people don't like clever code. I know that. We did.


I know. And I think that opinion goes back to around the time
that I started doing Forth and you stopped. So I am not supprised
that you would hardly ever want something like that. ;-)

Thanks for the questions and opinions just the same.

 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Elizabeth » Wed, 12 Jul 2006 18:45:49


I concur with the general point that Forth stacks don't need to be big.
We've done a few studies on large applications, and the stacks rarely
get deeper than 4-5 items. It's ironic that Forth's stacks are explicit
and obvious, but remain consistently way smaller than C's implicit
stacks. The stacks on the SEAForth chips seem small, but then the whole
chip is small, and one will have to be sensitive to this when
programming it.

I do like being able to detect stack underflows, though. That's very
helpful in catching programming errors. Ideally, of course, one
shouldn't make errors, but none of us is perfect.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.yqcomputer.com/

"Forth-based products and Services for real-time
applications since 1973."
==================================================
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Bernd Pays » Wed, 12 Jul 2006 21:22:35


Good example, but slightly missing the point. If you have a destructive -IF,
you would write that

: digit -10 + dup -if -7 + then 65 + ;

and the dup won't destroy anything (since the -10 consumes the same stack
element as the dup after the +). Yes, that's more work to do, but what you
waste here you might gain elsewhere, so it's all about trade-offs, anyway.
And there's no space in the instruction set for both destructive and
non-destructive branches ;-).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.yqcomputer.com/ ~paysan/
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Jeff Fo » Wed, 12 Jul 2006 23:47:57


That is more code than with a non-destructive -IF, and it is slower.
The point was that a DUP isn't really needed and makes for smaller
and faster code.

And since -IF has been non-destructive for more than fif *** years your
example with all that extra stuff in it might easily confuse people who
might not have remembered all the details of MISC programming from
previous discussions in c.l.f.

It can be optimized a lot by using a ; before the then
that removes a jump that isn't needed, a literal that isn't needed, and

and another + operation which also isn't needed in the -IF path. Whew.

It packs into the same number of words, but isn't so wastefully
inefficient, the -IF path in your example is more than twice as
long as it should be. If it is easy to make the code more than
twice as fast, and it costs nothing it is a good thing.

An introduction to using MISC opcodes should emphasise this
on day one.

If you are writing a short script on a computer with gigs of memory
then a little extra typing or a program being twice as big or twice
as slow as it could be is no big deal. It doesn't matter much in
that kind of scripting.

But the idea in machineForth is that hardware and software can
be minimized to get the maximum performace for cost for the
targeted app. And in that environment you really don't want
code where each layer is twice as big and twice as slow as
it should be.

It takes a little thought and a desire to make the programs
efficient or there is no point at all to small cheap fast efficient
hardware except that it is easy to make bigger more expensive
and slower copies of it. Anyone can make the hardware slower and
the software bigger and slower. Our goal is to make the
hardware and softtware smaller, cheaper, faster, and
easier to program.
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by jmdrake_9 » Thu, 13 Jul 2006 00:06:46


Hello Bernd. I didn't "miss" the point. You made two points and I
addressed
one of them. Here's the point I was addressing.

************************************************************************************
The cost is that you lose the destructive IF, and you need to DROP the
flags when you don't need them (maybe it's my programming style, but
that's the majority).
************************************************************************************

On the Pentium you have an actual CMP instruction so it's possible for
SwiftForth to make the type of optimization it does. But most MISC
chips don't have that instruction. This leads to a different coding
style,
at least in this example. That doesn't mean there aren't times you
need to DROP the flag under MISC any more than it means there
aren't times you need to DUP the flag under ANS. But this is just
an example where factors OTHER than whether or not you have
a destructive IF dictates what's more efficient. Of course there are
tradeoffs. Nobody is debating that there aren't.

Regards,

John M. Drake
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by mhx » Thu, 13 Jul 2006 07:34:06

Elizabeth D Rather < XXXX@XXXXX.COM > writes Re: using MISC (post 1987 Forth hardware) opcodes
[..]

Hm.

I tried iForth, SwiftForth, gForth and VFX with the following words
(all on Windows XP):

: w1 S" HELLO" TYPE ; TEST w1
: w2 123.123e-12 F. ; TEST w2
( w3 ) TEST WORDS
( w4 ) TEST CREATE aap

Approximate number of return stack cells needed for ...

word iForth SwiftForth gForth VFX
------+------------------------------------
w1 2 42 1 10
w2 6 49 4 21
w3 7 126 2 4
w4 9 75 8 8
------+------------------------------------

Wow, there is some pretty fine-grained factoring going on
in SwiftForth :-)

Maybe my stack trick doesn't work for your company's Forth.
What would be your own results for these words?

In a distant past, Coos Haak independently wrote a depth checker.
He might be able to crosscheck some of these results.

-marcel

-- -----------------------------------------------------------
0 [IF]
This utility assumes that stacks grow down.
This utility assumes that stacks are at least 130 cells deep.
This utility is not very accurate for shallow programs (words).
[THEN]

0 VALUE deep130
0 VALUE rminimum

: @+ DUP CELL+ SWAP @ ;

0 VALUE s^
CREATE s-stack 256 CELLS ALLOT

: >S s^ 1+ $FF AND DUP TO s^ CELLS s-stack + ! ;
: -S s^ 1- $FF AND TO s^ ;
: S s^ CELLS s-stack + @ ;
: S> S -S ;

\ This word writes the same address all over the return stack
: MARK-130 deep130 130 < IF deep130 1+ TO deep130
RECURSE
THEN ;

\ This word serves to find out how many levels TEST itself will use.

: CALIBRATE 0 LOCALS| lowraddr |
0 TO deep130
['] NOOP >S
RP@ 130 CELLS - TO lowraddr
MARK-130 S> EXECUTE
lowraddr DUP @ >S
130 0 DO @+ S <> IF LEAVE THEN LOOP -S ( addr)
lowraddr 130 CELLS + SWAP - 1 CELLS / TO rminimum ;

CALIBRATE

: FILL-DATA ( -- )
100 0 DO -2 -2 >S LOOP
100 0 DO DROP -S LOOP ;

: TEST 0 0 0 LOCALS| lowraddr | \ <?> TEST #<name># --> <?>
0 TO deep130
' ( token) >S
RP@ 130 CELLS - TO lowraddr
MARK-130 FILL-DATA S> EXECUTE

lowraddr DUP @ >S
130 0 DO @+ S <> IF LEAVE THEN LOOP -S ( addr)
lowraddr 130 CELLS + SWAP - 1 CELLS / rminimum -
CR ." Approximately " . ." Rstack cells used." ;

: .ABOUT CR ." TEST <word> finds out how many stack cells <word> uses."
CR ." Example: TEST CR "
CR ." The accuracy is rather low for `shallow' words." ;

.ABOUT CR
 
 
 

using MISC (post 1987 Forth hardware) opcodes

Post by Elizabeth » Thu, 13 Jul 2006 07:54:23


What's going on here is that W1 and W2 do output, which requires Windows
calls. Windows calls use buckets of stack, because it's C-ish rather
than Forth-ish. My good results are from standalone Forth apps. That's
an appropriate comparison when the size of SEAForth stacks is at issue.

We've actually done some applications on 8051 versions with on-chip
stacks which are pretty small.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.yqcomputer.com/

"Forth-based products and Services for real-time
applications since 1973."
==================================================
 
 
 

1. =?ks_c_5601-1987?B?uau34SC7+cfDwLsgw7zH6MfYILq4vLy/5H5+fn656LzbuvEgwP0=?= =?ks_c_5601-1987?B?tOsguau34SEhIaGhoaGhoaGhoaFmOTQ3ZjI1OTViNGEwMTNjZjlh?= =?ks_c_5601-1987?B?YTkzYjBiNjc0OGRhNg==?=

2. =?ks_c_5601-1987?B?tOu5sCAgKNPe2qopICDExMfDt7q9ui4gILOyvLogKNH74PUpIMDa?= =?ks_c_5601-1987?B?vcWwqC4gICAgICAgICAgICAgICAgICAgICAgICAgIHMgIHogIGZw?= =?ks_c_5601-1987?B?Ymx3?=

3. =?ks_c_5601-1987?B?uau34bv1x8PAuyDDvMfox9gguri8vL/kfn656LzbuvEguau34SEh?= =?ks_c_5601-1987?B?oaGhoaGhoaGhoTZiZGJkZTg3ZTQ3ZDdmZDg0YTFjNTVkNjgwMDZj?= =?ks_c_5601-1987?B?YzEy?=

4. =?ks_c_5601-1987?B?v8DHwrHis+QhISAyNL3DsKMgvLrAzrvnwMzGriC5q7fhwMy/67HH?= =?ks_c_5601-1987?B?IC0guri9w7DtILDhwaTHz7y8v+QuIMDavcXA1r3AtM+02SAgICAg?= =?ks_c_5601-1987?B?ICAgICAgICAgICAgICAgICAgbXprd2kgIHIgbyBt?=

5. =?ks_c_5601-1987?B?KLGksO0pMjAwM8PWvcUguau34SC8vL26v7W+7iDDvMfovcXDu8fP?= =?ks_c_5601-1987?B?vLy/5C64tsH2uLex4si4wNS0z7TZLkA=?=

6. =?ks_c_5601-1987?B?KLGksO0px/a060dQUy4usO3BpCzAzLW/xKu43rbztbUgucy4rr7L?= =?ks_c_5601-1987?B?uLO0z7TZLi5ALTVidGhtOUA=?=

7. =?ks_c_5601-1987?B?KLGkICC/w8fYv6G0wiC6uLTZIL7IwaTA+8DOIMH3wb7AuyC/+MfP?= =?ks_c_5601-1987?B?vcW02bjpLi4u?=

8. =?ks_c_5601-1987?B?KLGksO0podogv+7A/MC7IMfPvcO46SCyv36/wSC53r7GsKG8vL/k?= =?ks_c_5601-1987?B?IKHaIMf2tOtHUFM4NTDAzLqlxq5A?=

9. =?ks_c_5601-1987?B?KLGksO0px/a060dQUy4uM7naNMDPILmrt+HA5cL4KMO8x+gpvK0=?= =?ks_c_5601-1987?B?uvG9usHfwNS0z7TZLkAtREJPVElHQA==?=

10. =?ks_c_5601-1987?B?KLGksO0pxKu15bTrsd0yLjAwMLi4v/jB773DtOvD4iEhISG5q7nm?= =?ks_c_5601-1987?B?ua4uwaS9xLXut8++98O8QA==?=

11. =?ks_c_5601-1987?B?tOu5sCAgKNPe2qopICDExMfDt7q9ui4gILOyvLogKNH74PUpIMDa?= =?ks_c_5601-1987?B?vcWwqC4gICAgICAgICAgICAgICAgICAgICAgIGpyYWlx?=

12. =?ks_c_5601-1987?B?KLGksO0pwaYxyLggJ77Wv8+1v7mwsPy4rrvnJyAguau34bfOIMPW?= =?ks_c_5601-1987?B?vcW89sfosKHAzLXluKYgteW4s7TPtNkhQA==?=

13. =?ks_c_5601-1987?B?vbrEq8DMIL3Fw7u4uMfPvcO46SAzsLO/+bCjILmrwbawxyC5q7fh?= =?ks_c_5601-1987?B?IS03OG5RRUNeXg==?=

14. =?ks_c_5601-1987?B?VmlydHVhbEZyZWUgsKG787jeuPC4riDH2MGmvcMguN648LiuILiv?= =?ks_c_5601-1987?B?ua7Bpi4u?=

15. =?ks_c_5601-1987?B?KLGksO0pIEMtYS1yLWS/rMO8LCBDLWEtci1kILTrsd0gus7Btywg?= =?ks_c_5601-1987?B?w9bA5SAyNLCzv/kgutDH0rvzyK8hIEA=?=