The worst thing I've ever seen

2021-01-19

Context

Back in 2019, I spent the better part of a year working for a local software company. It was a fairly large one, and also pretty old by software industry standards. It had gotten its start in the 90s, doing most of its work in C before jumping on the C++ train and later on picking up Microsoft's C# language. The company very much exemplified the software development culture of the 90's and early aughts. Object-oriented programming was king. The internet was a hostile place. Most documentation came in booklet form. As such, most development studios functioned like silos. Instead of learning terrible habits through proper channels like HackerNews and StackOverflow, most of their mistakes were original.

This culture manifested itself in other ways: its operations and goals often became the province of management and management alone. The company was steered entirely by non-technical people, who outnumbered R&D employees dramatically. That is not what this blog post is about, but it might help to say that after four months there, I was second among seven developers in terms of seniority. The turnover was extreme; you can fill in the blanks as to why.

The company specialized in point-of-sale products for the hospitability industry. Its main product was a restaurant management software suite. It was, for all intents and purposes, a successful product with users all over the globe. I was not hired to work on it; I was responsible for another monstrosity built with ASP.NET. However, I often found myself working with the legacy C++ product. Those guys were always short staffed, and I had a working understanding of the ancient arcane devices called "pointers".

The application was structured in a manner that I like to refer to as the "rat king model". On the surface, it appeared like a functioning whole, but in truth it was more a product of accident than design. It was a pile of unfortunate incidents which, somehow, produced meaningful and consistent results. It took an entire fucking day to compile. It spat cryptic memory access violation errors at its end users. It featured two broken update mechanisms that could possibly double as trojan horses. Much like the rat king, it could not be saved; its continued existence was an affront upon nature. Some of you may know this sort of application under its usual name: "legacy software".

With that said, it did have some recognizable characteristics. For one, its individual components were arrayed around a simple client-server model. The thing I want to talk about today is the peculiar way in which these programs communicated with each other.

Into the (void*)

In C++, the representation of data in memory is simple enough that the language permits you to treat an object as a contiguous chunk of bytes, wrangle it around buffers, then cast it back to its original type and retrieve your data fully unviolated. It's not much more difficult to send these bytes over a TCP buffer. So long as both processes possess the same type definition, it ought to work. This is our how software sent data across the wire.

There is one problem with this approach. Compilers may choose to pad objects in memory in order to make data alignment more amenable to the hardware. This behaviour is not defined by the language specification. Because the serialization and deserialization of these objects occurs in separate processes (and ostensibly on separate machines), there is a possibility that this home-baked marshaling may fail if both programs do not agree on what the object ought to look like in memory.

The engineer who devised the communication layer, back when I was being potty trained, solved this problem in two ways:

In practice this meant that no marshaling errors occurred. I presume that there was no discernible performance hit either, even on machines of the era.

In addition, it was likely designed to save space on the wire, a resource that was arguably way more precious back then. In theory, at least. In practice, there was only one type, for any of the hundreds of messages that these applications may exchange. This caused issues. More on that later.

It also meant that enterprising youngsters would spend the next few decades cramming pointers into these messages and wondering why things were exploding on the other end, not aware that their object copies were as shallow as their understanding of C-style memory management. It might also explain why the application was still being built with VC++2010 when I got there, and why it took two months for the senior C++ engineer to get it to build with Visual Studio 2019. But I'm conjecturing here.

It was far from ideal, but it worked reliably. The real problems began when the time came to write a client in a different language.

The .NET client

The "workstation" client was being ported to Android. It was decided, quite a while before I arrived, that the server API would be given an HTTP implementation in order to accomodate the new mobile client. The project was a combination of C# and C++ and it was quasi-unusable. Being the only developer with expertise in both languages, I was tasked with fixing it.

What I inherited was this : a broken WCF piece of shit which was configured to do interesting things like present a blank page with status 200 whenever an unhandled exception was thrown. It took me about a day to convert the project to the much-less-shitty ASP.NET Core framework, and the blank pages were successfully replaced with equally useful memory access violation popups.

The C# part only really had one responsibility: take the contents of the HTTP requests and regurgitate them into an adhoc C++ library, which was supposed to then implement the actual TCP communication logic. The task of making it was outsourced to a development firm in India. The first thing the DLL did was pass down the extremely long (and mostly empty) lists of arguments exactly as-is to similarly-named and mostly-identical classes thirteen fucking times before it reached any actual application logic. Of all the code that actually did something, approximately 2% of it worked.

I immediately decided that I would condemn this entire piece of garbage and write my own client, in C#. This was before I understood any of the above, mind you. If I'd known what I was in for, I might have chosen the third option: not showing up to work anymore.

One class to rule them all

After a week of sifting through spaghetti, I managed to work out most of the above. Nobody was of any help as nobody understood it, save for my desk neighbor, a very sympathetic senior dev with extensive knowledge of C++98 and essentially nothing more recent than that. The TCP layer was not his doing, but he understood C++ memory management like an old rancher understands the instincts of cattle. He helped me through a handful of confusing hurdles. In exchange I told him about bleeding edge features, like smart pointers.

As I mentioned above, for all of the hundreds of message types that these applications sent each other, there was one class definition. It contained a simple integer field, which served as the "message type identifier" (something that would on both sides be consumed by an extremely long switch statement) and hundreds of assorted fields, usually between 0 and 3 of them initialized. In short, a number, then a whole bunch of uninitialized memory, then the data you actually wanted, then more garbage.

Trying to make sense of this turned out to be a challenge.

Level 99 hacking

For starters, the messages were "encrypted". This encryption was the product of extensive internal research. What our security specialists came up with was a substitution cipher, which returned at the first zero-byte it encountered. In other words, "Caesar's cipher except Caesar gets stabbed by Brutus seconds into it and dies".

I had to replicate this exact logic, or else the server would segfault. In fact most of the project could be described that way.

A big chunk of fucking nothing

Because the single type determining the message format could contain anything from the contents of a table's orders to the work hours of a waiter, several of the fields were collections. Because someone at some point understood what was going on, most of these used some sort of home-baked variable-length array, which consisted in a single 32bit integer determining length (in case someone were to order 4,294,967,295 Coca Colas) and then the actual contiguous data.

One, however, was a good old fixed-length array, and it was initialized with a length of 250. This, combined with the size of the type it contained, added up to roughly 2.7 megabytes of uninitialized memory. Any message sent over the TCP socket thus weighed at the very least 3 fucking megabytes, regardless of what it actually contained. So much for saving bandwidth.

Achieving Success For Greater Synergy And More Comprehensive Solutions

In the end, I managed to prove the viability of my approach by implementing one operation : the ability to switch the status of a table, or something of the sort. The process of tracing through megabytes of garbage in a very specific sequence in order to fish out a single 16bit field worked, and the communication non-protocol was at least documented somewhere now.

There were more struggles and tribulations, most of them having to do with home-baked license validation bullshit (it was extremely difficult to make successive requests without the server going through the entire handshake process every time and exhausting workstation license slots), but at least the reality of developing the .NET API closely matched that of developing and maintaining the application itself. Instead of impossible it was now merely nightmarish.

Lessons not learned

This is one example, but the entire application looked like this. It was the product of decades of incomplete or missing documentation, of people leaving halfway through projects, of ill-fated refactoring attempts, and most importantly of general technical incompetence.

The only real way to solve this sort of problem is to go back in time and make sure it doesn't happen. Attempting to fix systems like these, in most cases, would consist in more work than starting over from scratch. Managers tend not to understand technical debt very well. To them, software is a plasticine that can be reshaped endlessly, and when their business fails, they focus on the problems they understand. In other words, the technical debt is constantly created, because doing so generates value on the short term. The technical debt is never reimbursed, because of the simple capitalist principle that if the value of something cannot be expressed on a spreadsheet or bar chart, it doesn't exist.

Large refactorings and/or reimplementations are therefore exceedingly rare. What happens instead is that the company begins to hemorrhage money. It becomes unable to keep existing clients happy and attract new ones. The few competent employees remaining, such as myself (I am not a humble man), get fed up and leave. The ones that replace them are the ones that are willing to work for peanuts, and you know how the rest of the saying goes.

I can't tell you how to prevent this downward spiral from happening, because it's a simple fact of nature that organizations develop entropy. You can stave it off, by finding competent engineers and technicians, and putting them in charge of specifically improving operations and ensuring no corners are cut, rather than simply churning out modal windows and XML guzzling rules engines over and over.

Ultimately, the dysfunctions and maladaptations will pile up. What's more, even a perfectly decent system will eventually become deficient as the environment around it changes and the problems themselves have disappeared or evolved beyond its capacity to solve them. As such, taking charge of any amount of software that you cannot feasibly replace or outright abandon spells your inevitable doom.

Code is a liability.