creating an (inefficent) alternating regular expression from a list of options

creating an (inefficent) alternating regular expression from a list of options

Post by metaperl.c » Wed, 10 Sep 2008 22:12:19


Pyparsing has a really nice feature that I want in PLY. I want to
specify a list of strings and have them converted to a regular
expression.

A Perl module which does an aggressively optimizing job of this is
Regexp::List -
http://www.yqcomputer.com/ ~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm

I really dont care if the expression is optimal. So the goal is
something like:

vowel_regexp = oneOf("a aa i ii u uu".split()) # yielding r'(aa|a|uu|
u|ii|i)'

Is there a public module available for this purpose?
 
 
 

creating an (inefficent) alternating regular expression from a list of options

Post by George Sak » Thu, 11 Sep 2008 00:24:39


I was about to start a thread about a very similar problem; hope I'm
not hijacking this thread. However I am interested in a solution that
(1) scales to a potentially large number of alternate strings
(hundreds or thousands), which will hit, sooner or later, some max
limit in the builtin regexps and (2) is not limited to strings but can
be used for arbitrary objects. Therefore I am looking for a different
approach, perhaps a specialization of the general regexp search
algorithm.

More formally, given (a) an input sequence I and (b) a set of
'censored' sequences S, find and remove all the sequences in S that
appear in I. For instance, if
S = set([(a,), (a,b), (a,c), (c,), (d,a)])
and
I = (x, a, b, c, y, d, a, b), the filtered sequence would be:
O = (x, y, b)
i.e. the censored subsequences are (a,b), (c,), and (d,a).

As with regexps, if sequence `x` is a prefix of `y`, the longer
sequence `y` wins when both match (e.g. if (a,b) matches, it
supersedes (a,)).

For extra bonus, the algorithm should remove overlapping subsequences
too, i.e. for the input I above, the output would be (x, y) since both
(d,a) and (a,b) would match for (d,a,b).

With respect to complexity, I am mainly interested in len(S); len(I)
is small for my application, typically no more than 10. Of course, an
algorithm that scales decently in both len(S) and len(I) would be even
better. Any ideas or relevant links ?

George

PS: This is not a homework ;)

 
 
 

creating an (inefficent) alternating regular expression from a list of options

Post by metaperl.c » Sat, 20 Sep 2008 05:08:59


> >>>> I really dont care if theexpressionis optimal. So the goal is> > >>>> something lik>: >>
>>>> vowel_regexp = oneOf("a aa i ii u uu".split()) yielding r'(aa|>|uu>> > >> u>ii>i)'>>>
> >> Is there a public module available for this>pu>pose?
>
> Check Ka-Ping Yee's rx> m>dule:
>
> ttp://lfw.org/pyth<n/

Ok < http://www.yqcomputer.com/ ;thon/rxb.py>
suffers from the possibility of putting shorter match before longer
one:

def either(*alternatives):
options = []
for option in alternatives:
options.append(makepat(option).regex)
return Pattern('\(' + string.join(options, '|') +>'\)')


>lso, check PyPI>to see if
> someone has already updated rxb for use with re.

No one has - http://www.yqcomputer.com/ %3Aaction=search&term=rxb&submit=search

no results returned
 
 
 

creating an (inefficent) alternating regular expression from a list of options

Post by metaperl.c » Sat, 20 Sep 2008 05:20:58


yes, thank you very much:

import re

def oneOf(s):
alts = sorted(s.split(), reverse=True)
alts = [re.escape(s) for s in alts]
return "|".join(alts)