Component programming with ranges

From D Wiki
Revision as of 18:31, 2 August 2013 by Quickfur (talk | contribs) (expand)
Jump to: navigation, search

Preface

This article was inspired by Walter's article on component programming in D and based on a related discussion thread in the D forum. In short, Walter's article addressed the question of why, despite years of carefully programming in paradigms that purportedly makes your code more reusable, so very little code from the past is actually reused. Component-style programming, in which code is assembled from reusable components may be data sources, data sinks, or algorithms (that transforms data sources and puts them into data sinks), is proposed as a possible solution to this problem.

In this article, we will consider how component-style programming can greatly untangle a complicated algorithm into manageable pieces that are straightforward to write, easy to debug, and reusable.

The Task

We shall consider the classic task of laying out a yearly calendar on the console, such that given a particular year, the program will print out a number of lines that displays the 12 months in a nice grid layout, with numbers indicating each day within the month. Something like this:

       January              February                March        
        1  2  3  4  5                  1  2                  1  2
  6  7  8  9 10 11 12   3  4  5  6  7  8  9   3  4  5  6  7  8  9
 13 14 15 16 17 18 19  10 11 12 13 14 15 16  10 11 12 13 14 15 16
 20 21 22 23 24 25 26  17 18 19 20 21 22 23  17 18 19 20 21 22 23
 27 28 29 30 31        24 25 26 27 28        24 25 26 27 28 29 30
                                             31                  

        April                  May                  June         
     1  2  3  4  5  6            1  2  3  4                     1
  7  8  9 10 11 12 13   5  6  7  8  9 10 11   2  3  4  5  6  7  8
 14 15 16 17 18 19 20  12 13 14 15 16 17 18   9 10 11 12 13 14 15
 21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22
 28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29
                                             30                  

        July                 August               September      
     1  2  3  4  5  6               1  2  3   1  2  3  4  5  6  7
  7  8  9 10 11 12 13   4  5  6  7  8  9 10   8  9 10 11 12 13 14
 14 15 16 17 18 19 20  11 12 13 14 15 16 17  15 16 17 18 19 20 21
 21 22 23 24 25 26 27  18 19 20 21 22 23 24  22 23 24 25 26 27 28
 28 29 30 31           25 26 27 28 29 30 31  29 30               

       October              November              December       
        1  2  3  4  5                  1  2   1  2  3  4  5  6  7
  6  7  8  9 10 11 12   3  4  5  6  7  8  9   8  9 10 11 12 13 14
 13 14 15 16 17 18 19  10 11 12 13 14 15 16  15 16 17 18 19 20 21
 20 21 22 23 24 25 26  17 18 19 20 21 22 23  22 23 24 25 26 27 28
 27 28 29 30 31        24 25 26 27 28 29 30  29 30 31            

While intuitively straightforward, this task has many points of complexity:

  • While generating all dates in a year is trivial, thanks to D's std.datetime module, the order in which they must be processed is far from obvious, due to the following complicating factors:
  • Since we're writing to the console, we're limited to outputting one line at a time; we can't draw one cell of the grid and then go back up a few lines, move a few columns over, and draw the next cell in the grid. We have to somehow print the first lines of all cells in the top row, followed by the second lines, then the third lines, etc., and repeat this process for each row in the grid.
  • As a result, the order in which the days in a month are processed is not the natural, chronological order. We have to assemble the dates for the first weeks in each of the first 3 months, say, (if we are putting 3 months per row in the grid), print those out, then assemble the dates for the second weeks in each month, print those out, etc..
  • Furthermore, within the rows representing each week, some days may be missing, depending on where the boundaries of adjacent months fall; these missing days must then be filled out in the following month's first week before the first full week in the month is printed. It is not that simple to figure out where a week starts and ends, and how many rows are needed per month.
  • If some months have more full weeks than others, they may occupy less vertical space than other months on the same row in the grid; so we need to insert blank spaces into these shorter months in order for the grid cells to line up vertically in the output.

Considering all of the above points, it would appear at first glance that we are doomed to writing algorithms specific only to this task, because each piece of the puzzle depends on the others in complex ways. It would appear hopeless to actually get any reusable components out of our calendar program! And indeed, this would be the case if we approached the problem from the traditional imperative approach.

Furthermore, such a complex algorithm would be difficult to write, and would be more prone to bugs because of its complexity.

Sources of Complexity

One of the more influential courses I took in college was on Jackson Structured Programming. It identified two sources of programming complexity (i.e., where bugs are most likely to occur):

  1. Mismatches between the structure of the program and the structure of the data (e.g., you're reading an input file that has a preamble, body, and epilogue, but your code has a single loop over lines in the file), or between two or more data structures that you are processing at the same time (e.g., laying out a yearly calendar where the boundaries of weeks don't correspond with the boundaries of the months, and the grid structure doesn't correspond with the line-by-line sequence of the final output);
  2. Writing loop invariants (or equivalently, loop conditions).

Most non-trivial loops in imperative code have both, which makes them doubly prone to bugs. Take the example of reading a file with three sections in a single loop over the lines of the file. The mismatch between the code structure (a single loop) and the file structure (three sequential sections) often prompts people to add boolean flags, state variables, and the like, in order to resolve the conflict between the two structures. For example, to keep track of which section we're in when processing each line, we may use a state variable, like this:

auto file = File("inputfile");
enum State { Preamble, Body, Epilogue }
auto state = State.Preamble;
foreach (line; file.byLine()) {
    final switch (state) {
      case State.Preamble:
        ... // process preamble
        break;
      case State.Body:
        ... // process body
        break;
      case State.Epilogue:
        ... // process epilogue
        break;
    }
}

Already, that loop body is looking pretty complicated. But it's still missing one key ingredient: state transitions. Once we finish reading the preamble, we need to transition to State.Body so that the next line can be processed correctly, and ditto for the transition to State.Epilogue. So our code becomes:

auto file = File("inputfile");
enum State { Preamble, Body, Epilogue }
auto state = State.Preamble;
foreach (line; file.byLine()) {
    final switch (state) {
      case State.Preamble:
        ... // process preamble
        if (endOfPreamble)
            state = State.Body;
        break;
      case State.Body:
        ... // process body
        if (endOfBody)
            state = State.Epilogue;
        break;
      case State.Epilogue:
        ... // process epilogue
        break;
    }
}

Note that almost all of this code is just scaffolding; we haven't even written the code that does the real processing of the input data!

Furthermore, suppose the way we process the body changes depending on something in the preamble. For example, the preamble may define an encoding, so then we have to save this encoding value somewhere outside the loop, so that when we transition to State.Body, we will know how to correctly decode the input line.

Such ad hoc structure resolutions are a breeding ground for bugs, because we're adding all sorts of state variables, flags, and other such things to the outer scope of the loop. Each piece of the code depends on some number of variables that are set by other code elsewhere, causing the code to be very fragile. This approach also often leads to complicated loop conditions, which invite even more bugs.

In contrast, if you structure your code according to the structure of the input (i.e., one loop for processing the preamble, one loop for processing the body, one loop for processing the epilogue), it becomes considerably less complex, easier to read (and write!), and far less bug prone. Your loop conditions become simpler, and thus easier to reason about and leave less room for bugs to hide.

But to be able to process the input in this way requires that we encapsulate our input so that it can be processed by 3 different loops. Once we go down that road, we start to arrive at the concept of input ranges... then we abstract away the three loops into three components, and thus we arrive at Walter's component-style programming, where each component has a well-defined input and output, and the transformation process from input to output has a straightforward correspondence.

Structuring the Calendar Program

Consider our calendar program. Using the traditional approach, we may structure our code one of two ways:

  1. Have a single main loop that prints out each line of the calendar at a time. The problem is, the loop body will be extremely complex, because sometimes we need to output month names, sometimes weeks. Then within each week, we have to know what day to start the week, what day to end it, and we have to keep track of which days of the month we're currently on, in order to continue from the previously-output line correctly. On top of that, we're processing multiple months at a time, so we have to generate these dates out-of-order, yet in the end the dates generated for each month must add up to a chronological order.
  2. Create a grid buffer of characters, then loop over dates of the year and place them in the buffer in the right place. This also introduces a lot of complexity: where should we place the month names, and, given a particular month and date, how do we know where on the month's cell we should place the day? Since the starting point of each subsequent month depends on the ending point of the previous month, we will need all kinds of state variables and counters to keep track of everything. And in the end, we still have to write another loop over the buffer to print out each line in display order.

Neither approach is very appealing, because they are catering to one of the conflicting structures at the expense of the others, resulting in a level of complexity that's difficult to deal with. Such code will be very prone to bugs due to its sheer complexity.

Using ranges, however, our task becomes considerably more tractable. First, let's identify all of the different structures that we will need to deal with:

  • Generating dates in a year
  • Grouping dates by month
  • Grouping dates by week
  • Formatting the days in a week
  • Grouping formatted weeks into months
  • Laying out some number of months horizontally in a grid to form rows
  • Outputting each line of each grid row
  • Outputting all the rows

Each of the above tasks can be separated out into their own ranges: we can create an input range that generates all the dates in a year, then write an algorithm that, given a sequence of dates, breaks them up into chunks by month, then given a chunk of dates within a month, we can write an algorithm for grouping them by week, and so forth. This separation of tasks greatly simplifies the code within each component, and thus reduces the complexity of the code required and decreases the likelihood of bugs. For example, it's far easier to ensure you never put more than 7 days into a week if you have a single place where dates are grouped into weeks.

Then once we have all these components, we just need a little bit of glue code to piece them together to do what we want.