dward just noticed some useful emails came back from old friends
(previous job who works with GP - that said they're C++ guys and not
really into Matlab) I'll upload them (this fitness fuction of
evaluating the trees is a real pain and the email sent back to me
Have you tried evolving three classifiers.
The first segregates classs one from the other two
The next segregates class two from the other two
The last segregates class three from the other two
You can use IF THEN ELSE or min max for example
use a population size of 1000 in each case. As you evaluate a GP
tree over all the data remember which branches were "fired". So if
min (A,B) where A and B are two subtrees put a 0 if it went left and
a 1 if it went right and do this against TP FP TN FN (true positive,
false positive, true negative, false negative). Run it 20 times on
the same or different machines. Compute the FOM you want. Then look
at each case TP, FN, TN, FN and look at these memories of 0 and 1.
You will find that perhaps in TP only certain branches will fire and
not others, so you can understand in this way which parts of the tree
executed and so simplify the tree for whatever task (e.g. TP for
discriminating class 1 from the rest).
In order to get the precise FOM you want then you may use another
trick. As you evaluate the GP tree over all cases and you compute
the FOM and say it comes out to favour too much the FPs or something
then simply slide the threshold (>0 for example as above to >
epsilon where epsilon is a small constant until you get the FOM that
you want). We did not do this but it is a trick that Maarten gave us
once which makes sense.
If you use IF THEN ELSE or min max alone you will never get anything
that "slides" but some FOM with abrupt changes. Consider using plus
minus divide (protected) and multiply in the formula. Do not use
constants in the GP tree, I think that you will get better
generalization without them.
Moreover, you should use linear scaling (interval arithmetic you do
not need because the protected division and the fact that the return
from the tree is a threshold and a bit meaningless means you do not
have to rely on interval arithmetic in this case)
For these things you should look at Maarten Keijzer's paper which won
best paper in the 2003 EuroGP paper
[Kei03] Maarten Keijzer. Improving symbolic regression with interval
arithmetic and linear scaling. In Conor Ryan, Terence Soule, Maarten
Keijzer, Edward Tsang, Riccardo Poli, and Ernesto Costa, editors,
Genetic Programming, Proceedings of EuroGP'2003, volume 2610 of LNCS,
pages 71-83, Essex, 14-16 April 2003. Springer-Verlag.
[ bib | .ps.gz | Abstract ]
Use the technique known as linear scaling because it should help you
a lot with nice improvements.
Limit your tree size to 1000, use tournament selection, and make sure
your population is randomly assembled correctly, etc, all the basic
things you do right. Also use the efficiency measure of Koza it is
not perfect but will tell you something about your problem.
By the way, the way to understand the GP tree is to pick the smallest
for your analysis, one you got in one of your parallel runs which is
almost as good as the best one. There is always one run that
stochastically results in shorter GP trees and then you can use
analyse that one for comprehension rather