Difference between revisions of "Component programming with ranges"

Revision as of 19:34, 2 August 2013

Preface

This article was inspired by Walter's article on component programming in D and based on a related discussion thread in the D forum. In short, Walter's article addressed the question of why, despite years of carefully programming in paradigms that purportedly makes your code more reusable, so very little code from the past is actually reused. Component-style programming, in which code is assembled from reusable components may be data sources, data sinks, or algorithms (that transforms data sources and puts them into data sinks), is proposed as a possible solution to this problem.

In this article, we will consider how component-style programming can greatly untangle a complicated algorithm into manageable pieces that are straightforward to write, easy to debug, and reusable.

The Task

We shall consider the classic task of laying out a yearly calendar on the console, such that given a particular year, the program will print out a number of lines that displays the 12 months in a nice grid layout, with numbers indicating each day within the month. Something like this:

       January              February                March        
        1  2  3  4  5                  1  2                  1  2
  6  7  8  9 10 11 12   3  4  5  6  7  8  9   3  4  5  6  7  8  9
 13 14 15 16 17 18 19  10 11 12 13 14 15 16  10 11 12 13 14 15 16
 20 21 22 23 24 25 26  17 18 19 20 21 22 23  17 18 19 20 21 22 23
 27 28 29 30 31        24 25 26 27 28        24 25 26 27 28 29 30
                                             31                  

        April                  May                  June         
     1  2  3  4  5  6            1  2  3  4                     1
  7  8  9 10 11 12 13   5  6  7  8  9 10 11   2  3  4  5  6  7  8
 14 15 16 17 18 19 20  12 13 14 15 16 17 18   9 10 11 12 13 14 15
 21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22
 28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29
                                             30                  

        July                 August               September      
     1  2  3  4  5  6               1  2  3   1  2  3  4  5  6  7
  7  8  9 10 11 12 13   4  5  6  7  8  9 10   8  9 10 11 12 13 14
 14 15 16 17 18 19 20  11 12 13 14 15 16 17  15 16 17 18 19 20 21
 21 22 23 24 25 26 27  18 19 20 21 22 23 24  22 23 24 25 26 27 28
 28 29 30 31           25 26 27 28 29 30 31  29 30               

       October              November              December       
        1  2  3  4  5                  1  2   1  2  3  4  5  6  7
  6  7  8  9 10 11 12   3  4  5  6  7  8  9   8  9 10 11 12 13 14
 13 14 15 16 17 18 19  10 11 12 13 14 15 16  15 16 17 18 19 20 21
 20 21 22 23 24 25 26  17 18 19 20 21 22 23  22 23 24 25 26 27 28
 27 28 29 30 31        24 25 26 27 28 29 30  29 30 31

While intuitively straightforward, this task has many points of complexity:

While generating all dates in a year is trivial, thanks to D's std.datetime module, the order in which they must be processed is far from obvious, due to the following complicating factors:
Since we're writing to the console, we're limited to outputting one line at a time; we can't draw one cell of the grid and then go back up a few lines, move a few columns over, and draw the next cell in the grid. We have to somehow print the first lines of all cells in the top row, followed by the second lines, then the third lines, etc., and repeat this process for each row in the grid.
As a result, the order in which the days in a month are processed is not the natural, chronological order. We have to assemble the dates for the first weeks in each of the first 3 months, say, (if we are putting 3 months per row in the grid), print those out, then assemble the dates for the second weeks in each month, print those out, etc..
Furthermore, within the rows representing each week, some days may be missing, depending on where the boundaries of adjacent months fall; these missing days must then be filled out in the following month's first week before the first full week in the month is printed. It is not that simple to figure out where a week starts and ends, and how many rows are needed per month.
If some months have more full weeks than others, they may occupy less vertical space than other months on the same row in the grid; so we need to insert blank spaces into these shorter months in order for the grid cells to line up vertically in the output.

Considering all of the above points, it would appear at first glance that we are doomed to writing algorithms specific only to this task, because each piece of the puzzle depends on the others in complex ways. It would appear hopeless to actually get any reusable components out of our calendar program! And indeed, this would be the case if we approached the problem from the traditional imperative approach.

Furthermore, such a complex algorithm would be difficult to write, and would be more prone to bugs because of its complexity.

Sources of Complexity

One of the more influential courses I took in college was on Jackson Structured Programming. It identified two sources of programming complexity (i.e., where bugs are most likely to occur):

Mismatches between the structure of the program and the structure of the data (e.g., you're reading an input file that has a preamble, body, and epilogue, but your code has a single loop over lines in the file), or between two or more data structures that you are processing at the same time (e.g., laying out a yearly calendar where the boundaries of weeks don't correspond with the boundaries of the months, and the grid structure doesn't correspond with the line-by-line sequence of the final output);
Writing loop invariants (or equivalently, loop conditions).

Most non-trivial loops in imperative code have both, which makes them doubly prone to bugs. Take the example of reading a file with three sections in a single loop over the lines of the file. The mismatch between the code structure (a single loop) and the file structure (three sequential sections) often prompts people to add boolean flags, state variables, and the like, in order to resolve the conflict between the two structures. For example, to keep track of which section we're in when processing each line, we may use a state variable, like this:

auto file = File("inputfile");
enum State { Preamble, Body, Epilogue }
auto state = State.Preamble;
foreach (line; file.byLine()) {
    final switch (state) {
      case State.Preamble:
        ... // process preamble
        break;
      case State.Body:
        ... // process body
        break;
      case State.Epilogue:
        ... // process epilogue
        break;
    }
}

Already, that loop body is looking pretty complicated. But it's still missing one key ingredient: state transitions. Once we finish reading the preamble, we need to transition to State.Body so that the next line can be processed correctly, and ditto for the transition to State.Epilogue. So our code becomes:

auto file = File("inputfile");
enum State { Preamble, Body, Epilogue }
auto state = State.Preamble;
foreach (line; file.byLine()) {
    final switch (state) {
      case State.Preamble:
        ... // process preamble
        if (endOfPreamble)
            state = State.Body;
        break;
      case State.Body:
        ... // process body
        if (endOfBody)
            state = State.Epilogue;
        break;
      case State.Epilogue:
        ... // process epilogue
        break;
    }
}

Note that almost all of this code is just scaffolding; we haven't even written the code that does the real processing of the input data!

Furthermore, suppose the way we process the body changes depending on something in the preamble. For example, the preamble may define an encoding, so then we have to save this encoding value somewhere outside the loop, so that when we transition to State.Body, we will know how to correctly decode the input line.

Such ad hoc structure resolutions are a breeding ground for bugs, because we're adding all sorts of state variables, flags, and other such things to the outer scope of the loop. Each piece of the code depends on some number of variables that are set by other code elsewhere, causing the code to be very fragile. This approach also often leads to complicated loop conditions, which invite even more bugs.

Untangling Complexity

In contrast, if you structure your code according to the structure of the input (i.e., one loop for processing the preamble, one loop for processing the body, one loop for processing the epilogue), it becomes considerably less complex, easier to read (and write!), and far less bug prone. Your loop conditions become simpler, and thus easier to reason about and leave less room for bugs to hide.

But to be able to process the input in this way requires that we encapsulate our input so that it can be processed by 3 different loops. Once we go down that road, we start to arrive at the concept of input ranges... then we abstract away the three loops into three components, and thus we arrive at Walter's component-style programming, where each component has a well-defined input and output, and the transformation process from input to output has a straightforward correspondence.

Structuring the Calendar Program

Consider our calendar program. Using the traditional approach, we may structure our code one of two ways:

Have a single main loop that prints out each line of the calendar at a time. The problem is, the loop body will be extremely complex, because sometimes we need to output month names, sometimes weeks. Then within each week, we have to know what day to start the week, what day to end it, and we have to keep track of which days of the month we're currently on, in order to continue from the previously-output line correctly. On top of that, we're processing multiple months at a time, so we have to generate these dates out-of-order, yet in the end the dates generated for each month must add up to a chronological order.
Create a grid buffer of characters, then loop over dates of the year and place them in the buffer in the right place. This also introduces a lot of complexity: where should we place the month names, and, given a particular month and date, how do we know where on the month's cell we should place the day? Since the starting point of each subsequent month depends on the ending point of the previous month, we will need all kinds of state variables and counters to keep track of everything. And in the end, we still have to write another loop over the buffer to print out each line in display order.

Neither approach is very appealing, because they are catering to one of the conflicting structures at the expense of the others, resulting in a level of complexity that's difficult to deal with. Such code will be very prone to bugs due to its sheer complexity. It will also have no reusable pieces.

Using ranges, however, our task becomes considerably more tractable. First, let's identify all of the different structures that we will need to deal with:

Generating dates in a year
Grouping dates by month
Grouping dates by week
Formatting the days in a week
Grouping formatted weeks into months
Laying out some number of months horizontally in a grid to form rows
Outputting each line of each grid row
Outputting all the rows

Each of the above tasks can be separated out into their own ranges: we can create an input range that generates all the dates in a year, then write an algorithm that, given a sequence of dates, breaks them up into chunks by month, then given a chunk of dates within a month, we can write an algorithm for grouping them by week, and so forth. This separation of tasks greatly simplifies the code within each component, and thus reduces the complexity of the code required and decreases the likelihood of bugs. For example, it's far easier to ensure you never put more than 7 days into a week if you have a single place where dates are grouped into weeks.

It is also far easier to write unittests for checking code correctness when the code is in small, manageable pieces: it's rather hard to write unittests for a giant outer loop that contains several levels of nested inner loops. We'd have no confidence that every possible code path was tested, because there are too many of them!

Then once we have all these components, we just need a little bit of glue code to piece them together to do what we want.

Let's walk through these components one by one, and show how we can write our calendar program in nice, reusable pieces.

Generating Dates in a Year

Our first task is to generate all the dates in a year. Thanks to D's std.datetime module, this is rather easy: create a Date object, then repeatedly add durations of 1 day to it. For our purposes, though, we can't just do this in a loop, because it has to interface with the other components, which do not have a matching structure to a loop over dates. So we capture this task of generating dates by encapsulating it within an input range. Of course, given that our other components may have to traverse these dates out-of-order, it is wise to make it a forward range instead, so we will also include a save() method in our range. Here is the code:

/**
 * Returns: A range of dates in a given year.
 */
auto datesInYear(int year) {
    static struct DateRange {
        private int year;   /// so that we know when to stop

        this(int _year) {
            year = _year;
            front = Date(year, 1, 1);
        }

        /// Generate dates only up to the end of the starting year
        @property bool empty() { return front.year() > year; }

        /// Current date
        Date front;

        /// Generate the next date in the year.
        void popFront() { front += dur!"days"(1); }

        /// Provide forward range interface
        @property DateRange save() {
            DateRange r;
            r.year = year;
            r.front = front;
            return r;
        }
    }
    static assert(isForwardRange!DateRange);

    return DateRange(year);
}

This is relatively straightforward; it is a typical example of a forward range implementation. For ease of usage, we have encapsulated it within a function that creates the range and returns it.

Grouping Dates by Month

Our next task on the list is to group dates by month. For maximum utility, we'd like to take a range of Dates, and break it up into subranges where all the Dates in a given subrange belong to the same month.

chunkBy

We could simply proceed and write this code directly; but in the spirit of code reuse, we note that this particular task is not really specific to dates. At its core, what we're doing is that given a range of items with properties x, y, z, we want to break the range up into subranges where all items in each subrange share the same value for a particular chosen property, say x. Or, to phrase it in terms of the Dates that we're dealing with, given a range of Dates, each of which consists of year, month, and day, we'd like to be able to select a particular property, such as month, and break the range up into subranges by that property (i.e., each subrange contains the dates for a single month). Since we don't know if, in the future, we might want to group Dates by year instead, the selection of which property to use should be parametrized.

But there is no reason why we should limit ourselves to using only properties of the item as the grouping criterion; that would not cover the case of, say, grouping Dates by week, since the Date object doesn't have a week field we may group by. So, boiled down to the essentials, what we're doing is to map each item to some value, be it selecting the month from a Date, or computing a function on a number, etc.. This then leads to the generic definition:

auto chunkBy(alias attrFun, Range)(Range r)
    if (isInputRange!Range &&
        is(typeof(
            unaryFun!attrFun(ElementType!Range.init) ==
            unaryFun!attrFun(ElementType!Range.init)
        ))
    )
{
    ...
}

That is, given a range r and some function that maps items in the range to some value type that can be compared with the == operator, chunkBy() returns a range of subranges of the original range, such that all the items in each subrange is mapped by the function to the same value. This is expressed by the example usage below, which is a unittest in the actual code of the calendar program:

unittest {
    auto range = [
        [1, 1],
        [1, 1],
        [1, 2],
        [2, 2],
        [2, 3],
        [2, 3],
        [3, 3]
    ];

    auto byX = chunkBy!"a[0]"(range);
    auto expected1 = [
        [[1, 1], [1, 1], [1, 2]],
        [[2, 2], [2, 3], [2, 3]],
        [[3, 3]]
    ];
    foreach (e; byX) {
        assert(!expected1.empty);
        assert(e.equal(expected1.front));
        expected1.popFront();
    }

    auto byY = chunkBy!"a[1]"(range);
    auto expected2 = [
        [[1, 1], [1, 1]],
        [[1, 2], [2, 2]],
        [[2, 3], [2, 3], [3, 3]]
    ];
    foreach (e; byY) {
        assert(!expected2.empty);
        assert(e.equal(expected2.front));
        expected2.popFront();
    }
}

This unittest exemplifies the point that in component-style programming, we work with self-contained components with well-defined, straightforward interfaces, which makes them straightforward to implement, easy to test, and amenable to reuse.

Now we present the full definition of chunkBy():

auto chunkBy(alias attrFun, Range)(Range r)
    if (isInputRange!Range &&
        is(typeof(
            unaryFun!attrFun(ElementType!Range.init) ==
            unaryFun!attrFun(ElementType!Range.init)
        ))
    )
{
    alias attr = unaryFun!attrFun;
    alias AttrType = typeof(attr(r.front));

    static struct Chunk {
        private Range r;
        private AttrType curAttr;
        @property bool empty() {
            return r.empty || !(curAttr == attr(r.front));
        }
        @property ElementType!Range front() { return r.front; }
        void popFront() {
            assert(!r.empty);
            r.popFront();
        }
    }

    static struct ChunkBy {
        private Range r;
        private AttrType lastAttr;
        this(Range _r) {
            r = _r;
            if (!empty)
                lastAttr = attr(r.front);
        }
        @property bool empty() { return r.empty; }
        @property auto front() {
            assert(!r.empty);
            return Chunk(r, lastAttr);
        }
        void popFront() {
            assert(!r.empty);
            while (!r.empty && attr(r.front) == lastAttr) {
                r.popFront();
            }
            if (!r.empty)
                lastAttr = attr(r.front);
        }
        static if (isForwardRange!Range) {
            @property ChunkBy save() {
                ChunkBy copy;
                copy.r = r.save;
                copy.lastAttr = lastAttr;
                return copy;
            }
        }
    }
    return ChunkBy(r);
}

While this code is more complex than before, it is still relatively straightforward: the outer range iterates over the first elements of each subrange, and its .front method returns said subrange as a separate iterable object. Each subrange iterates over the source range until the criterion that defines that subrange is no longer satisfied, at which point the iteration ends. There are only a bare minimum of state variables (keeping track of the value of the function defining the current subrange), and no intricate interdependencies that are difficult to understand.

byMonth

Armed with chunkBy, grouping a range of Dates by month is now trivial to implement:

/**
 * Chunks a given input range of dates by month.
 * Returns: A range of ranges, each subrange of which contains dates for the
 * same month.
 */
auto byMonth(InputRange)(InputRange dates)
    if (isDateRange!InputRange)
{
    return chunkBy!"a.month()"(dates);
}

And the accompanying unittest shows how this component can be put together with our first component, datesInYear(), to produce a range of dates in each month of a year:

unittest {
    auto months = datesInYear(2013).byMonth();
    int month = 1;
    do {
        assert(!months.empty);
        assert(months.front.front == Date(2013, month, 1));
        months.popFront();
    } while (++month <= 12);

    assert(months.empty);
}

We didn't verify that the dates within each month as yet -- we will get to that eventually -- but at this point it suffices to show how, by composing two components, we're now able to produce ranges of dates for each month of a year.

Grouping Dates by Week

Another piece of the puzzle of calendar layout is to be able to group a range of dates by week. This is also a relatively simple task, since for the purposes of our calendar we can regard each week as beginning on a Sunday and ending on a Saturday. Again, we write a function that takes a range of dates, and breaks it up into subranges by week:

/**
 * Chunks a given input range of dates by week.
 * Returns: A range of ranges, each subrange of which contains dates for the
 * same week. Note that weeks begin on Sunday and end on Saturday.
 */
auto byWeek(InputRange)(InputRange dates)
    if (isDateRange!InputRange)
{
    static struct ByWeek {
        InputRange r;
        @property bool empty() { return r.empty; }
        @property auto front() {
            return until!((Date a) => a.dayOfWeek == DayOfWeek.sat)
                         (r, OpenRight.no);
        }
        void popFront() {
            assert(!r.empty());
            r.popFront();
            while (!r.empty && r.front.dayOfWeek != DayOfWeek.sun)
                r.popFront();
        }
    }
    return ByWeek(dates);
}

This time, our code is simpler because weeks always end on a Saturday, which allows us to use the until() algorithm included in D's Phobos standard library.

This simplicity belies something very powerful that we have just achieved, though. If you notice, nowhere in the definition of byWeek() is there any reference to dates in a year, or dates in a month. It can be applied to both! If applied to the output of datesInYear(), for example, it would give us the dates of the year grouped by weeks in the year. If applied to one of the subranges returned by byMonth(), it would give us the dates of that month grouped by weeks, which is exactly what we need later on when we want to format the days in a month. So already, we can see that byWeek() is a reusable component that can be reused outside of the confines of our task at hand.

Another thing worth pointing out is that if a month starts in the middle of the week, byWeek() will return less than a full week in its first subrange, because it breaks the range up by Sundays, not necessarily by every 7 items in the range. The same thing goes for ranges that end in the middle of the week. So we have, indirectly, solved one of the key problems in calendar layout: the handling of partial weeks in a month. Yet, this solution does not involve tricky if-statements or other conditionals at all, just a straightforward iteration over a range of dates!

Difference between revisions of "Component programming with ranges"

Revision as of 19:34, 2 August 2013

Contents

Preface

The Task

Sources of Complexity

Untangling Complexity

Structuring the Calendar Program

Generating Dates in a Year

Grouping Dates by Month

chunkBy

byMonth

Grouping Dates by Week

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools