Beautiful Code [310]
But again—what do we do if applying a set of changes ends in inconsistent code? I had no idea.
Beautiful Debugging > Finding the Failure Cause Automatically
28.4. Finding the Failure Cause Automatically
It took me three months to come up with a solution—which came to me, incidentally, lying in my bed at six in the morning. The sun was rising, the birds were singing, and I finally had an idea. The reasoning was as follows:
Applying half of the changes has a small chance of getting a consistent build; the risk of skipping a dependent change is simply too high. On the other hand, if we get a consistent build (or a "resolved" outcome), we can very quickly narrow down the set of changes.
On the other hand, applying individual changes has a far greater chance of getting something meaningful—in particular, if the version being changed was already consistent. As an example, think of changing a single function; unless its interface changes, we will most likely get a running program. On the other hand, trying out one change after the other would still take ages.
Therefore, I'd call for a compromise: I would start with two halves. If neither half of the changes would result in a testable build, I would split the set of changes into four subsets instead, and then apply each subset individually to the gdb 4.16 source. In addition, I would also unapply the subset from the gdb 4.17 source (which would be realized by applying the complement of the subset to the gdb 4.16 source).
Splitting in four (rather than two) would mean that smaller change sets would be applied, which means that the changed versions would be closer to the (working) original versions—and which would imply higher chances of getting a consistent build.
And if four subsets would not suffice, then I would go for 8, 16, 32, and so on, until, eventually, I would apply each single change individually, one after the other—which should give me the greatest chance of getting a consistent build. As soon as I had a testable build, the algorithm would restart from scratch.
I calculated that in the worst case, the algorithm would still require 8,7212 = 76,055,841 tests. This number was still way too high, but much lower than the exponential-approaches thought of earlier. At the other extreme, if all builds succeeded, the algorithm would work as a binary search, and require just log2 8,721 = 14 tests. Given the odds, was it worth doing?
I implemented a simple Python script with a very crude version of the preceding algorithm. The key part was the testing function. It would take a set of changes, run patch to apply the changes to the gdb 4.16 source, and then invoke make to build the changed gdb. Finally, it would run gdb and see whether the failure occurred (returning "fail" if it did) or not (returning "pass"). As any of these steps could fail, the testing function could also return "unresolved" as a test result.
As I started the script, it quickly turned out that "unresolved" was still by far the most frequent return value. Actually, for the first 800 tests or so, the testing function returned nothing but "unresolved." The number of change subsets had gone up from two to four to eight…until we had 64 subsets, each containing 136 changes. And these tests took some time. As one gdb build took about six minutes, I was already waiting for three days. (Actually, I was not waiting, but writing my Ph.D. thesis. But still….)
I was just examining the log as something unusual happened. A test had just failed! Now, finally, I would see how the algorithm would focus on the smaller set, narrowing the search space. But when I checked the results, it turned out that the test printed the following message on the screen and stopped:
NameError: name 'next_c_fial' is not defined
After three days of constant calculation, my script had stumbled on a dumb misspelling. I truly wished I had used a language with static checking rather than Python.
I fixed the problem and ran the script again. Now, finally, it would work. After five more days, and roughly 1,200 tests, the script