Beautiful Code [309]
Look for the most likely causes first
The scientific method guarantees you will find a cause, but it does not tell you when. Identify possible failure causes first, and then focus on those where the product of likelihood and effort is minimal.
Unfortunately, interactive debuggers as they stand do not support the scientific method. To be sure, debuggers are great tools to poke around and investigate code and its results at will. This is a good thing—but only for skilled programmers who know how to use debuggers systematically. I'd rather see programmers trained in systematic debugging methods than in fancy debugging tools. (And I still feel guilty having written a fancy debugging tool myself.)
Beautiful Debugging > A Search Problem
28.3. A Search Problem
Let's come back to our initial problem of debugging the debugger. Even after isolating and fixing the bug, I wondered: is there a way to find the failure-inducing change automatically? What one would need is a test that is automatically invoked each time the programmer changes something. As soon as the test broke, we would know what had changed most recently so we could immediately fix it. (A few years later, David Saff and Michael Ernst implemented this idea under the name continuous testing.)
In my situation, I knew the change that had caused the test to fail—it was the change from gdb 4.16 to gdb 4.17. The problem was that the change was so huge, affecting 8,721 locations. There should be a way to break down this change into smaller pieces.
What if I tried to split it into 8,721 smaller changes, each affecting just one location? This way, I could apply and test one change after the other until the test failed—and the last change applied would be the one that broke the test. In other words, I would simulate the 4.17 version's development history. (Actually, it would not be me who would simulate the history; instead, it would be a tool I built. And while I would be sitting sipping my tea, playing a game with my kids, or catching up my email stream, this nifty little tool would search and find the failure-inducing change. Neat.)
There was a catch, though. I had no clue about the order in which the changes had to be applied. And this was crucial because the individual changes may depend on each other. For instance, change A may introduce a variable that would be used in new code included in other changes B or C. Whenever B or C are applied, A must be applied, too; otherwise, building gdb would fail. Likewise, change X may rename some function definition; other changes (Y, Z) in other locations may reflect this renaming. If X is applied, Y and Z must be applied as well, because again, otherwise, gdb would not build.
How does one determine whether one change depends upon another? This problem looked quite hard—and almost intractable without very fancy (and not yet existing) multiversion program analysis.
How about just trying out various orderings of changes? 8,721 individual changes can be ordered in 8,721 x 8,720 x 8,719 x … x 1 = 8,721! different ways. No way anyone could test all of them. Trying out all subsets is somewhat better: 8,721 changes mean 28,721 =102,625 possible subsets, which means a lot fewer tests than 8,721 orderings. I could try to console myself with the thought that by the time these computations had ended on my machine, quantum computing, time travel, and universally correct programs would long have gone mainstream, eliminating the need for such futile attempts.
So, I made another try. How about good old divide and conquer? We could start applying the first half of changes to the gdb 4.16 source and test it. If ddd failed, we would know that the failure-inducing change was in that first half; if it did not fail, we'd keep on searching in the other half. With each test, we'd reduce the search space by one-half, and thus finally end up in the failure-inducing change. That's it, I thought: an automatic application of the scientific method, systematically creating, testing, and refining hypotheses.