need help on need help on generator...

need help on need help on generator...

Post by Francis Gi » Sun, 23 Jan 2005 00:05:38


Hi,

I recently read David Mertz (IBM DeveloperWorks) about generators and got
e *** d about using lazy constructs in my Python programming.

But besides the fact that generators are either produced with the new "yield"
reserved word or by defining the __new__ method in a class definition, I
don't know much about them.

In particular, I don't know what Python constructs does generate a generator.
I know this is now the case for reading lines in a file or with the new
"iterator" package. But what else ? Does Craig Ringer answer mean that list
comprehensions are lazy ? Where can I find a comprehensive list of all the
lazy constructions built in Python ? (I think that to easily distinguish lazy
from strict constructs is an absolute programmer need -- otherwise you always
end up wondering when is it that code is actually executed like in Haskell).

Thank you

Francis Girard
FRANCE

Le vendredi 21 Janvier 2005 15:38, Craig Ringer a crit:
 
 
 

need help on need help on generator...

Post by Craig Ring » Sun, 23 Jan 2005 00:42:21


Speaking of totally great articles, and indirectly to lazyness (though
not lazyily evaluated constructs), I was really impressed by this
fantastic article on metaclasses:

http://www.yqcomputer.com/
http://www.yqcomputer.com/

which shows that they're really just not that hard. That saved me an
IMMENSE amount of utterly tedious coding just recently.


They can also be created with a generator expression under Python 2.4. A
generator expression works much like a list comprehension, but returns a
generator instead of a list, and is evaluated lazily. (It also doesn't
pollute the outside namespace with its working variables).

[1, 2, 3, 4, 5, 6, 7, 8, 9]
<generator object at 0x401e40ac>
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Not the use of xrange above for efficiency in the generator expressions.
These examples are trivial and pointless, but hopefully get the point
across.


As far as I know, functions that use yield, and generator expressions. I
was unaware of the ability to create them using a class with a __new__
method, and need to check that out - I can imagine situations in which
it might be rather handy.

I'm not sure how many Python built-in functions and library modules
return generators for things.


Nope, but generator expressions are, and they're pretty similar.


I'm afraid I can't help you with that. I tend to take the view that side
effects in lazily executed code are a bad plan, and use lazy execution
for things where there is no reason to care when the code is executed.

 
 
 

need help on need help on generator...

Post by Francis Gi » Sun, 23 Jan 2005 01:16:28

eally, thank you Craig Ringer for your great answer.




I completly agree with this. But this is much more true in theory than in
practice. In practice you might end up with big big memory usage with lazy
constructs as the "system" has to store intermediate results (the execution
context in the case of Python). These kinds of phenomena happen all the time
in a language like Haskell -- at least for a beginner like me -- if you don't
pay attention to it ; and this makes the language a lot more difficult to
master. Thus you have to keep an eye on performance even though, in FP, you
shoould just have to "declare" your intentions and let the system manage the
execution path.



Thank you, I'll read that.

Francis Girard
FRANCE

Le vendredi 21 Janvier 2005 16:42, Craig Ringer a crit:

 
 
 

need help on need help on generator...

Post by Nick Coghl » Sun, 23 Jan 2005 13:45:10


I don't think there is a *list* as such, but there are some rules of thumb for
when lazy evaluation will take place (hopefully others will note any cases that
I missed):

1. Iterators (classes with a next() method, and an __iter__ method that returns
'self') are lazily evaluated, as itr.next() is called to retrieve each value. I
think you will find it is this method, rather than __new__ which is relevant to
creating class-based generators. Note that "for x in itr" automatically calls
itr.next() in order to obtain each new value of the loop variable.
This iterator protocol is the basis of lazy evaluation in Python, and is
described here:
http://www.yqcomputer.com/

2. Iterables (classes with an __iter__ method) will return a lazy iterator via
iter(obj). Actual iterators return themselves from __iter__, so iter(obj) is a
good way to make sure you have an iterator.

3. Generators (functions that use 'yield' instead of 'return') and generator
expressions (like list comprehensions, but without the square brackets) are
simply concise ways of creating iterators.

4. The utility functions in the itertools module generally return iterators
rather than lists (this shouldn't suprise anyone!)

5. Several builtin functions return iterators rather than lists, specifically
xrange(), enumerate() and reversed(). Other builtins that yield sequences
(range(), sorted(), zip()) return lists.

However, be aware that some things which accept any iterable don't take
advantage of the lazy evaluation, and still cause the whole thing to be created
in memory at once - "".join(itr) is currently one such operation.

The sequence vs iterator distinction is somewhat unfortunate (since it breaks
with TOOWTDI), but completing the transition to an iterator based approach isn't
going to be possible until Python 3.0, when things that currently return lists
can be converted to return iterators (i.e. it has been suggested that the
fundamental construct in Python 3.x should be an iterator just as a list is the
fundamental construct in Python 2.x)

Regards,
Nick.

--
Nick Coghlan | XXXX@XXXXX.COM | Brisbane, Australia
---------------------------------------------------------------
http://www.yqcomputer.com/
 
 
 

need help on need help on generator...

Post by aleaxi » Sun, 23 Jan 2005 18:10:36


...

Having __new__ in a class definition has nothing much to do with
generators; it has to do with how the class is instantiated when you
call it. Perhaps you mean 'next' (and __iter__)? That makes instances
of the class iterators, just like iterators are what you get when you
call a generator.


A 'def' of a function whose body uses 'yield', and in 2.4 the new genexp
construct.


Nope, besides the fact that the module you're thinking of is named
'itertools': itertools uses a lot of C-coded special types, which are
iterators but not generators. Similarly, a file object is an iterator
but not a generator.


Since you appear to conflate generators and iterators, I guess the iter
built-in function is the main one you missed. iter(x), for any x,
either raises an exception (if x's type is not iterable) or else returns
an iterator.


Nope, those were generator expressions.


That's yet a different question -- at least one needs to add the
built-in xrange, which is neither an iterator nor a generator but IS
lazy (a historical artefact, admittedly).

But fortunately Python's built-ins are not all THAT many, so that's
about it.


Encapsulation doesn't let you "easily distinguish" issues of
implementation. For example, the fact that a file is an iterator (its
items being its lines) doesn't tell you if that's internally implemented
in a lazy or eager way -- it tells you that you can code afile.next() to
get the next line, or "for line in afile:" to loop over them, but does
not tell you whether the code for the file object is reading each line
just when you ask for it, or whether it reads all lines before and just
keeps some state about the next one, or somewhere in between.

The answer for the current implementation, BTW, is "in between" -- some
buffering, but bounded consumption of memory -- but whether that tidbit
of pragmatics is part of the file specs, heh, that's anything but clear
(just as for other important tidbits of Python pragmatics, such as the
facts that list.sort is wickedly fast, 'x in alist' isn't, 'x in adict'
IS...).


Alex
 
 
 

need help on need help on generator...

Post by aleaxi » Sun, 23 Jan 2005 18:20:36


Yes for enumerate and reversed, no for xrange:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'xrange' object has no attribute 'next'

it SHOULD return an iterator, no doubt, but it doesn't (can't, for
backwards compatibility reasons). Neither does it return a list: it
returns "an `xrange' object", a specialized type that's not an iterator,
though it's iterable. It's a type, btw:

<type 'xrange'>

so it's not surprising that calling it returns instances of it
(enumerate and reversed are also types, but *WITH* 'next'...).


Alex
 
 
 

need help on need help on generator...

Post by Craig Ring » Sun, 23 Jan 2005 18:46:15


A particularly great example when it comes to unexpected buffering
effects is the file iterator. Take code that reads a header from a file
using an (implicit) iterator, then tries to read() the rest of the file.
Taking the example of reading an RFC822-like message into a list of
headers and a body blob:

.>>> inpath = '/tmp/msg.eml'
.>>> infile = open(inpath)
.>>> for line in infile:
.... if not line.strip():
.... break
.... headers.append(tuple(line.split(':',1)))
.>>> body = infile.read()


(By the way, if you ever implement this yourself for real, you should
probably be hurt - use the 'email' or 'rfc822' modules instead. For one
thing, reinventing the wheel is rarely a good idea. For another, the
above code is horrid - in particular it doesn't handle malformed headers
at all, isn't big on readability/comments, etc.)

If you run the above code on a saved email message, you'd expect 'body'
to contain the body of the message, right? Nope. The iterator created
from the file when you use it in that for loop does internal read-ahead
for efficiency, and has already read in the entire file or at least a
chunk more of it than you've read out of the iterator. It doesn't
attempt to hide this from the programmer, so the file position marker is
further into the file (possibly at the end on a smaller file) than you'd
expect given the data you've actually read in your program.

I'd be interested to know if there's a better solution to this than:

.>>> inpath = '/tmp/msg.eml'
.>>> infile = open(inpath)
.>>> initer = iter(infile)
.>>> headers = []
.>>> for line in initer:
.... if not line.strip():
.... break
.... headers.append(tuple(line.split(':',1)))
.>>> data = ''.join(x for x in initer)

because that seems like a pretty ugly hack (and please ignore the
variable names). Perhaps a way to get the file to seek back to the point
last read from the iterator when the iterator is destroyed?

--
Craig Ringer
 
 
 

need help on need help on generator...

Post by Francis Gi » Sun, 23 Jan 2005 19:06:41

e samedi 22 Janvier 2005 10:10, Alex Martelli a crit:

Yes, I meant "next".



Ok. I guess I'll have to update to version 2.4 (from 2.3) to follow the
discussion.


You're absolutly right, I take the one for the other and vice-versa. If I
understand correctly, a "generator" produce something over which you can
iterate with the help of an "iterator". Can you iterate (in the strict sense
of an "iterator") over something not generated by a "generator" ?



You're right. I was much more talking (mistakenly) about lazy evaluation of
the arguments to a function (i.e. the function begins execution before its
arguments get evaluated) -- in such a case I think it should be specified
which arguments are "strict" and which are "lazy" -- but I don't think
there's such a thing in Python (... well not yet as Python get more and more
akin to FP).


Thank you

Francis Girard
FRANCE

 
 
 

need help on need help on generator...

Post by Craig Ring » Sun, 23 Jan 2005 19:17:10


And now, answering my own question (sorry):

Answer: http://www.yqcomputer.com/

so we can slightly simplify the above to:

.>>> inpath = '/tmp/msg.eml'
.>>> infile = open(inpath)
.>>> headers = []
.>>> for line in infile:
.... if not line.strip():
.... break
.... headers.append(tuple(line.split(':',1)))
.>>> data = ''.join(x for x in infile)

at the cost of making it less clear what's going on and having someone
later go "duh, why isn't he using read() here instead" but can't seem to
do much more than that.

Might it be worth providing a way to have file objects seek back to the
current position of the iterator when read() etc are called? If not, I
favour the suggestion in the referenced post - file should probably fail
noisily, or at least emit a warning.

What are others thoughts on this?

--
Craig Ringer
 
 
 

need help on need help on generator...

Post by aleaxi » Sun, 23 Jan 2005 20:00:36

rancis Girard < XXXX@XXXXX.COM > wrote:
...

It's worth upgrading even just for the extra speed;-).



A generator function (commonly known as a generator), each time you call
it, produces a generator object AKA a generator-iterator. To wit:

...
<function f at 0x75fe70>
<generator object at 0x75aa58>
<type 'generator'>

A generator expression (genexp) also has a result which is a generator
object:

<type 'generator'>


Iterators need not be generator-iterators, by any means. Generally, the
way to make sure you have an iterator is to call iter(...) on something;
if the something was already an iterator, NP, then iter's idempotent:

True


That's what "an iterator" means: some object x such that x.next is
callable without arguments and iter(x) is x.

Since iter(x) tries calling type(x).__iter__(x) [[slight simplification
here by ignoring custom metaclasses, see recent discussion on python-dev
as to why this is only 99% accurate, not 100% accurate]], one way to
code an iterator is as a class. For example:

class Repeater(object):
def __iter__(self): return self
def next(self): return 23

Any instance of Repeater is an iterator which, as it happens, has just
the same behavior as itertools.repeat(23), which is also the same
behavior you get from iterators obtained by calling:

def generepeat():
while True: yield 23

In other words, after:

a = Repeater()
b = itertools.repeat(23)
c = generepeat()

the behavior of a, b and c is indistinguishable, though you can easily
tell them apart by introspection -- type(a) != type(b) != type(c).


Python's penchant for duck typing -- behavior matters more, WAY more
than implementation details such as type() -- means we tend to consider
a, b and c fully equivalent. Focusing on ``generator'' is at this level
an implementation detail.


Most often, iterators (including generator-iterators) are used in a for
statement (or equivalently a for clause of a listcomp or genexp), which
is why one normally doesn't think about built-in ``iter'': it's called
automatically by these ``for'' syntax-forms. In other words,

for x in <<<whatever>>>:
...body...

is just like:

__tempiter = iter(<<<whatever>>>)
while True:
try: x = __tempiter.next()
except StopIteration: break
...body...

((Simplification alert: the ``for'' statement has an optional ``else''
which this allegedly "just like" form doesn't mimic exactly...))



Python's strict that way. To explicitly make some one argument "lazy",
sorta, you can put a "lambda:" in front of it at call time, but then you
have to "call the argument" to get it evaluated; a bit of a kludge.
There's a PEP out to allow a ``prefix :'' to mean just the same as this
"lambda:", but even though I co-authored it I don't think it lowers the
kludge quotient by all that much.

Guido, our beloved BDFL, is currently musing about optional typing of
arguments, which might perhaps open a tiny little crack towards letting
some arguments be lazy. I don't think Guido wants to go there, though.

My prediction is that even Python 3000 will be strict. At least this
makes some things obvious at each call-site without having to study the
way a function is defined, e.g., upon seeing
f(a+b, c*d)
you don't have to wonder, or study the ``def f'', to find out when the
addition and the multiplication happen
 
 
 

need help on need help on generator...

Post by aleaxi » Sun, 23 Jan 2005 20:20:36


Maybe ''.join(infile) is a better way to express this functionality?
Avoids 2.4 dependency and should be faster as well as more concise.



It's certainly worth doing a patch and see what the python-dev crowd
thinks of it, I think; it might make it into 2.5.


Under what conditions, exactly, would you want such an exception?


Alex
 
 
 

need help on need help on generator...

Post by Craig Ring » Sun, 23 Jan 2005 20:53:27


Thanks - for some reason I hadn't clicked to that. Obvious in hindsight,
but I just completely missed it.


I'll certainly look into doing so. I'm not dumb enough to say "Sure,
I'll do that" before looking into the code involved and thinking more
about what issues could pop up. Still, it's been added to my
increasingly frightening TODO list.


When read() or other methods suffering from the same issue are called
after next() without an intervening seek(). It'd mean another state flag
for the file to keep track of - but I doubt it'd make any detectable
difference in performance given that there's disk I/O involved.

I'd be happier to change the behaviour so that a warning isn't
necessary, though, and I suspect it can be done without introducing
backward compatibility issues. Well, so long as nobody is relying on the
undefined file position after using an iterator - and I'm not dumb
enough to say nobody would ever do that.

I've really got myself into hot water now though - I'm going to have to
read over the `file' source code before impulsively saying anything
REALLY stupid.

--
Craig Ringer
 
 
 

need help on need help on generator...

Post by Terry Reed » Mon, 24 Jan 2005 08:36:12


Almost...


To be exact, the producer is a generator function, a function whose body
contains 'yield'. In CPython, the difference after executing the def is
that a generator function has a particular flag set. People sometimes
shorten 'generator function' to 'generator' as you did, but calling both a
factory and its products by the same name is confusing. (For instance, try
calling an automobile factory an automobile).

...
<function genf at 0x008873B8>

The result of calling a generator function is a generator, which is one but
only one type of iterator.

<generator object at 0x008781F8>
[<stuff inherited from object>, '__iter__', ' gi_frame', 'gi_running',
'next']

The .__iter__ and .next methods make this an iterator. The two data
attributes are for internal use.


Of course. Again, a generator is one specific type of iterator, where an
iterator is anything with the appropriate .__iter__ and .next methods.

Terry J. Reedy