The place where you don't want to go

2021-02-01

More context

The first post I wrote on this blog garnered few reactions, which was to be expected. There was nothing extraordinary or particularly insightful about it. The reactions I did get consisted in a mix of sympathy, presumably from the people who can conceive of what I wrote or directly experienced it, and of mild condescension, which is to be expected from software developers in general and really no big deal (but let me make it abundantly clear that I probably wouldn't want to work with you either).

Some were oddly fixated on this part :

Attempting to fix systems like these, in most cases, would consist in more work than starting over from scratch.

Although it was never my intention to argue the merits of large, ambitious, optimistic refactoring projects, it seems to be what most readers took away from my blog post. Call it "failure to communicate" if you must.

More importantly though, I was also linked to this absolute gem of a post, which I'll now be attempting to respond to. Railing on an article that's older than the Iraq war might seem like a particularly frivolous things to do, but if this article is still largely regarded as credible and shared in developer communities, I think the belated response is warranted.

Good code is not hard to read

A central point to Mr. Spolsky's thesis is that "it's harder to read code than to write it".

This is generally true, but it's mostly true in the sense that code bases tend to amass structural complexity as they grow. It is inevitable that any unit of code in an application will largely come to resemble the underlying structure of it; even your ideal loosely-coupled functions, those that truly "do one thing and do it well", have clear expectations on the shape of the data they receive and potentially return. Typically this data is a chunk of something larger. The function is thus not necessarily useful all on its own. The manner in which this separation or factoring is done will usually establish certain expectations about where and how the actual logic is represented.

What this means is that, except in very simple cases, the entirety of the purpose and of the motivation behind a single unit of code cannot be learned simply by reading it. With enough time, it comes to the point where the understanding of any given part of an application requires partial of total understanding of the entire application itself. If you're not careful, this can be as soon as day zero.

Now, it isn't automatically the case that code is easier to write than it is to read. That only occurs when it's easier to conceive of a new solution than to apprehend an existing one. Since the preferred way to teach algorithms is to show them, it stands to reason that an already-expressed solution can teach. In addition, almost all developers, given the task of implementing a sort algorithm, will not come up with an optimal solution. This represents a case in which code is easier to read than it is to write.

This isn't limited to sort algorithms, obviously. Taking a look at a particularly well-written open source library, you might be forced to realize that you cannot, in fact, come up with a better solution. You might be introduced to new useful idioms, or even language features that you didn't know existed. You might be more inclined to work within the confines of a well architectured machine than to tear it down. It happens.

Whether your code is harder or easier to read than to write will largely depend on the person reading it, but being optimal in size (or reasonably close to it) and self-explanatory are qualities that your code can exhibit. Both are accomplished with research, planning, tooling, and perhaps most importantly, a willingless to alter or replace working code. In other words, to refund technical debt. This means that the aforementioned phenomenon, of code growing increasingly coupled with its immediate surroundings and thus more opaque and less adaptable, can be staved off. Not prevented, but kept in check.

The genesis of the suck

At this point, it may seem like I'm mischaracterizing Joel's article. Before I continue, I want to express my belief that its central points are good. The constant need to raze and rebuild is a common developer reflex. The cost and risk of doing so are also extremely easy to underestimate, with potentially ruinous consequences.

With that said, it expresses these points at the expense of nuance, and with a general disregard for the realities of working with terrible code in general. The cost and risk of never fixing your mistakes is also extremely easy to underestimate, and it likely kills more companies than failed greenfield projects do. There's a good chance you've seen it happen, and I believe it's almost certain that you've seen it in the middle of happening without necessarily realizing it.

Creative and idiosyncratic bug fixes

In any case, your legacy code isn't only bad because it accumulates bugfixes or special edge cases for Windows 95 users. Sometimes it's full of extraneous and bone-headed bullshit. Sometimes it sucks for no good reason.

I find this part especially offensive :

Back to that two page function. Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes.

This is one possible scenario, and it's a pretty fucking optimistic one. I challenge any of you to look me in the eyes and tell me these the following examples are, in fact, bug fixes :

catch (Exception ex)
{
    return;
    throw ex;
}
IF X7$="ZZZZZZ"
  GOTO 066127;
  REM "SPECIAL"
doComplete(null, &prd, null, null, null, null, null, null, null, null, null, null, null, null, null, null, __MODE, null, null);

These are reproduced from memory, but based on actual real-world code I've encountered. These are examples of code that works and is relied on.

I believe they serve to illustrate the point that sometimes the suck is flagrant, and unjustifiable. You might think that these aren't a big deal because they're tiny snippets, but please try to imagine thousands of lines of code exactly like this.

All of the above, on top of being cryptic or outright absurd, have another thing in common. They'll send you down the rabbit hole, into frustrating hunts that might take entire days of your time. Maybe, like the first one, they'll hijack and disable the normal logging mechanism and force you to reproduce the entire environment on your own machine before you can even begin to locale the general vicinity of the problem. Maybe they run in an old BASIC interpreter and use hardcoded file definitions, forcing you to make copies of both the program and the production flat-file, requiring you to lock all users out of it for five minutes.

Broken is forever, crappy is forever

Joel seems to live in a world where all problems in software ultimately get fixed.

It happens, sometimes. In rarer and more insidious cases, they become permanent baggage. I've seem support teams trained and given scripts that have to do with working around bugs, or guiding the clients through recovering from them. Years-old, productivity killing bugs.

If the fix requires a substantial rewrite, and doing so is the "single worst strategic mistake that any software company can make", then we both know that fucking bug is not going anywhere.

Software depends upon its environment

At the complete opposite of the spectrum, we have the legacy code that's actually brilliantly put together, runs like a charm, pretty much never crashes, and is reasonably easy to understand even to people who weren't writing C in the 90s.

But there's a major problem with it. It was built to run on SCO OpenServer, and the last time that system received an update was in 2015. The company has abandoned the business of selling operating systems and now mostly concerns itself with patent trolling commercial Linux users. What's more; it doesn't run on modern hardware, maintaining VMs for it is a major pain in the ass, the license costs are criminally high and disk i/o is abysmal. Getting it to run on another system is a non-trivial, partial rewrite.

Or : it keeps all of its state in memory and/or on a local filesystem and is therefore not replicable. There is now a hard cap on your ability to operate at a large scale.

Or : it relies on third party commercial libraries that are completely abandoned, and also full of major vulnerabilities that you cannot fix yourself because you only have pre-built obfuscated copies to work with.

Or : it uses system APIs or external services that are going extinct, and it's completely useless without them.

You get the point. The fact is, for almost all of us, our application performs a fraction of the actual work of implementing its features. It most likely relies on things that may break and/or go extinct regardless of what we personally do. It is simply the case that yes, in fact, "software acquires bugs just by sitting around on your hard drive".

It is also the case that the entire industry will keep evolving even if you don't, and there might come a time when the new kids on the block show up with a fancy new product that runs on every major platform, serves millions of requests per hour and looks far better to boot. You, on the other hand, are still in the process of adapting your WinForms product to Windows 10. It takes your developers days to add a couple of radial buttons because their job is 5% progress and 95% navigating a landmine of "bug fixes" and cross-cutting concerns that shouldn't be there in the first place.

You gotta worry about more than just the machines

To management people, the mental health and job satisfaction of their engineers is a nebulous concept. I don't need to explain to you the fact that working with legacy software can be extremely frustrating.

The stress of it can be fatal. I worked in a place where the former head of IT died of a cardiac arrest on the job. He was a man in his 50s, in relatively good shape and with no history of heart related problems. Of course, these things happen out of nowhere sometimes, but the rest of us always suspected that the reality of dealing with multiple production-halting crashes of a day and of getting constantly chewed out by managemetn had at least something to do with it.

Nevertheless, beyond issues of health and long-term business viability, people often choose a career because they imagine it'll be stimulating, fulfulling, or outright fun. It's commonly understood that a software developer spends a large amount of their time maintaining software rather than developing it. This part of our job can be enjoyable, believe it or not. But I posit that if you took even the happiest and most highly motivated museum curator in the world, and put him to work in a hoarder's piss-smelling, rotting, crumbling house, he would very soon quit.

Similarly, garbage legacy software can and will lead you to lose some of your people, and they tend to be your better engineers.

Conclusion

The historical failures of Netscape and Borland might serve as a compelling warning against the dangers of completely rewriting the one product your bottom line relies on. Still, I don't think Mr. Spolsky was being entirely honest there. The simple fact is that sometimes major rewrites do work. We tend to hear about success stories far less often than we hear about disasters.

This is the actual single worst strategic mistake that any software company can make: allowing yourself to be straddled with insurmountable technical debt in the first place. If you get to that point, you have already failed. You are already under threat of extinction. Going tabular rasa with a 6.0 version might have killed Netscape, but there is no reason to assume that sticking with the 5.0 version wouldn't have lead to the same result.

Especially if you have no idea what 5.0 actually looked like under the hood.