Python Iterators

Iterables and Iterators

iterable
object that can be iterated over.
iterator
object that implements a next() method. Often used to wrap other objects and make them iterable.

The for statement allows to iterate over various types of collections:

# over a list, you get a value 
for value in mylist:
    print value

# over a dict, you get a key
for key in mydict:
    print key

# over a string, you get a character
for char in mystring:
    print char

# over a file, you get a line
for line in open('myfile.txt')
    print line

# etc

These collections are iterable, we'll see how iterators are implicitly used to access the collections' items.

The for loop

many objects can be made into iterables, thanks to a simple protocol

# we have a collection we'd like to iterate over
items = [1, 5, 8]

# we get an iterator for the collection
it = iter(items)

# we get each element, one at a time
e = it.next() 
# we do something with e
print e
# outputs: 1

e = it.next() 
print e
# outputs: 5

e = it.next() 
print e
# outputs: 8

e = it.next() 
# raises StopIteration Exception

how the for loop works

The way we write it:

for element in items:
    # do something with element

The way it works under the hood (pretty much)

_it = iter(items) # get iterator
while True:
    try:
        i = _it.next() # get next element
    except StopIteration, IndexError:
        break
    # do something with element

the iter() function

The short explanation on iter() is that it operates on objects that implement either the __iter__() or __getitem__() methods and it returns an iterator.

# lists implement __iter__() and can return an iterator themselves
a = [5, 6, 9, 1]
it = a.__iter__()
it.next()
# outputs: 5
it.next()
# outputs: 6

# when calling iter() on a list it simply uses that iterator directly
it = iter(a)
it.next()
# outputs: 5

# strings only implement __getitem__(), 
a = "abcde"
a.__getitem__(3)
# outputs: 'd'
a.__getitem__(1)
# outputs: 'b'

# iter() will have to "wrap" them with a third party iterator
it = iter(a) # returns a "wrapper" iterator object
# this method of the iterator calls a.__getitem__(index) under the hood
it.next() 
# outputs: 'a'

creating iterable objects

Essentially 2 approaches:

  • implementing the __getitem__() method on the class
class DoubleChar(object):
    def __init__(self, seq):
        self.seq = seq

    def __getitem__(self, index): 
        print "__getitem__ was called"
        return self.seq[index] * 2

for i in DoubleChar('abc'):
    print i
# outputs:
# __getitem__ was called
# aa
# __getitem__ was called
# bb
# __getitem__ was called
# cc
# __getitem__ was called  

__getitem__ was called a 4th time, but an IndexError was raised by the extra self.seq[index] call and implicitly suppressed in the for loop.

  • implementing the __iter__() method and ensure that its returned object itself implements the next() method (i.e. is an iterator).
class DoubleChar(object):
    def __init__(self, seq):
        self.counter = 0
        self.seq = seq
        self.length = len(seq)

    def __iter__(self):
        return self

    def next(self):
        # iter() needs this method to be defined 
        # on whatever object is returned by __iter__()
        # (which in this case is this very object),
        # otherwise a TypeError will be raised.

        print "next() has been called"

        if self.counter >= self.length:
            raise StopIteration
        value = self.seq[self.counter]
        self.counter += 1
        return value * 2

for i in DoubleChar('abc'):
    print i
# outputs:
# next has been called
# aa
# next has been called
# bb
# next has been called
# cc
# next has been called

Stopping a for loop

In the previous examples, note how many time the messages "__getitem__ was called" and "next has been called" were printed, as opposed to how many time the actual value itself was printed. This suggests that a for loop can be stopped by raising an IndexError inside __getitem__() or a StopIteration inside the iterator's next() method (see earlier explanation of how a for loop actually works).

class max(object):
    def __init__(self, someiterable, max):
        self.iterable = someiterable
        self.max = max

    def __getitem__(self, index):
        if index >= self.max:
            raise IndexError("you asked for %s" % self.max) 
        return self.iterable.__getitem__(index)

for i in max("woejfoeifwsjodf", 3):
    print i
# outputs:
# w
# o
# e

Note how the IndexError and the message within have been suppressed in the for loop.

which to use __getitem__ or __iter__

As a rule of thumb, __getitem__() should give a semantic indication that random access is possible. Originally __iter__() did not exist and to create iterators people would write code that looked like

class Stream(object):
    def __init__(self, url):
        self.buffer = connect(url)

    def __getitem__(self, index):
        next = self.buffer.download()
        if next == 'EOF':
            raise IndexError
        return next


for i in Stream(someurl): 
    print i

Note how the index parameter is completely ignored inside the __getitem__() method? This is misleading. The presence of __getitem__() gives the false impression that a Stream object can be accessed randomly, when in fact it can't. For that reason, the mechanisms associated with __iter__() and iterator.next() were added in Python 2.2. It allows for a more semantically accurate usage.

References