Keeping Copilot from Producing an Unmaintainable Mess

Keeping Copilot From Producing an Unmaintainable Mess

Introduction

In the past few months, I've changed my practices so that the vast majority of my software contributions are from Copilot. For many tasks I'm not sure if it ends up taking less time for a given task, but it certainly allows me to multitask far more effectively: which is crucial for my current role. My current typical process is assigning an issue to the Copilot agent and then reviewing the results with varying amounts of requested follow-ups (often far more than I would inflict upon a colleague). In addition to specific quirks I've overall come to really appreciate how well Copilot can do when extending a system with a well-established structure, but also notice some of the suspect practices that can creep in from the training corpus.

I have a range of concerns about generative AI generally, and have adopted a very different stance personally versus professionally on its use, which will likely be covered in throughout this site. The one concern that is pertinent to this thread is around code maintenance which tends to be the most difficult and expensive part of software projects. While I can certainly see the value in vibe coding for prototyping, much of the current direction seems to suggest that the volume of code produced will increase far more quickly than the number of engineers (where the actual change in the latter is a bit murkier) which begs the question of how this is expected to be sustained. A variation of this was also mentioned by Cory Doctorow in the November 2025 issue of CACM as accelerating accumulation of technical debt. While postulating what may happen and the impacts is fertile ground, the focus here is more grounded on those practices which are common but in my estimation are paths to unmaintainable systems generally, but especially when amplified through AI.

Premise

First and foremost, it is worth identifying why this would even be a problem. It can be tempting to think that the ability to produce code more quickly and (ostensibly) less expensively may allow for the result itself to be less precious, but that seems inconsistent with history. As systems grow more complex the interactions require more attention, which can contribute to some of the aforementioned maintenance costs and velocity degradation identified by Fred Brooks and likely many others. Much of this could be presumably guarded against with very precise contracts (and accounting for Hyrum's Law), and that type of rigorous breakdown may be part of the future of software engineering - but currently that seems to fly in the face of how GenAI is being applied which somewhat seeks to eschew such formalism. AI could certainly also serve a valuable role in additional processes such as analyzing what is available, but there should be controls which make sure that ground truths are available so that less-than-ideal information is not inadvertently reinforced. In one way or another, the assertion here boils down to the idea that there whould be a tractable path to "trust, but verify" so that the behavior of systems at all levels remain deterministically comprehensible.

The Anti-Patterns

With the background perspective out of the way, we can get to the practical details. So far there are two that stand out - these are by no means new prospective issues, but they are often normalized in ways that could easily sew chaos.

One-off Branching

There is a range of content floating around about the evil of if statements - which generally amount to preferring polymorphism over ifs. While the use of practices such as polymorphism can feel cleaner, they're all branches under the hood, so why would one mechanism be deemed more dangerous than others? While I've typically fallen into the anti-if camp in the interest of taming cyclomatic complexity, my recent experiences with Copilot clarified for me the siginificance of tightly controlling branching (and reminded me of past projects where that was not done).

This distinction certainly echoes previous debates about GOTO statements during the rise of structured programming, as does the underlying rationale. In addition to what may be more cosmetic benefits, approaches such as polymorphism guide you towards making sure that the system has a whole has a sound model and conceptual integrity, and that the resulting branching falls along the seams of that model. Any equivalent mechanism (including as many if statements as you can reliably reason through) can serve the same purpose.

What is not wanted is branching that allows for deviation from that model. This is incredibly common based on past experience and implied by code produced by Copilot (I would also not expect Copilot to be able to reason through better fixes). When a specific scenario comes up with undesirable behavior, then a branch is added to set things back on course. This is far easier and more obvious than working to refine the system as a whole to make sure that that the model yields behavior which is consistent for that scenario, without requiring additional compensation. Outside of the current focus of GenAI, this may also be related to deficient requirements engineering practices which neglects such refinement (another topic I'll probably touch on at some point, though there's plenty of material from Bertrand Meyer and others). Over time such special cases (particularly as they compound on top of each other) yield a house of cards with unmaintainable complexity. This can lead to situations where systems end up rapidly accreting complexity as new logic is incidentally compensating for past logic where there's a neglected option to make the system simpler by backing out those earlier decisions.

As a disclaimer, this is not to say that such approaches do not have their place; there may be times during which the need for a fix has urgency and the more holisitic adjustment is more time consuming or not forthcoming. Such scenarios are where actual technical debt comes into play (rather than the abuse of the phrase where it is used to refer to seemingly anything inconvenient). Introducing a quick if can allow you to get something in the short term without fully paying for it, but if you don't manage those tradeoffs you end up insolvent.

In some of my recent sessions with Copilot, it introduced such branching in ways that fairly egregiously violated some of the component boundaries, but previously it also produced code local to a module that buried much of the intended semantics in a way that would out of the gate make the system much harder to understand, and with subsequent increments could make the overall flow inscrutable. Another closely related concept (which I may expand upon later) is that it seems worth differentiating code that is readable from that which is understandable. It is incredibly easy to look at code and think you know what it is doing, but without practices that extend beyond the low-hanging fruit of code conventions that inferred behavior may not be accurate or complete (which brings us back to wrestling edge and corner cases that could be designed out). I've certainly created such systems early on in my career, and worked with such systems more recently. They typically are those where either noone wants to touch anything for fear of breaking things, or there is one resident SME who I'll call Fred in reference to Michael A Jackson's Brilliance essay, whom everyone goes to in order to understand how to do their work. If Copilot assumes the role of Fred it quickly flips from a tool of empowerment, to one upon which we rely and we could drive our front wheels off a cliff before we notice he's heading in that direction.

DWIMmery

An equally insidiuous but even more self-inflicted malady is the tendency to sprinkle in some additional cleverness so that the system will Do What I Mean (DWIM). I was fairly surprised when Copilot did this: adding some unsolicited sophistication to how data was being handled to provide what seemed like reasonable corrections. This was not done in response to any defined requirement and therefore the resulting behavior was not specified in any way...but once added it should be assumed that it would need to be supported and so it's setting the course for a future where users rely on unspecified and potentially poorly understood behavior. My brief bewilderment was replaced with the hypothesis that Copilot was simply (stochastically) parroting questionable practices from its training data, practices that I've dealt with (and almost certainly fallen into) in the past.

Systems should consistently be built with defensive programming, but guarding against undesirable input does not imply that such inputs should be in any way improved. There's likely some other supporting information, but there's a good talk from Greg Young floating around somewhere that goes into some of this. Returning an error for invalid input keeps the complexity down and enables forward compatible support for such cases if they are worth attention. Adding logic to these cases can lead to a nest of complexity which is perversely consolidated around rare and often lost causes. A tell tale sign of such behavior is when a system seems very hard to understand, but then it is discovered that the happy path that handles the majority of data remains simple but is eclipsed by code that is rarely excercised and provides nebulous (and often still not completely safe) value.

As with most things the danger here is in GenAI amplifying what is in all cases pernicious. This is particularly concerning not just because of the volume and likelihood that GenAI may plough through logic that is structured in a way that people can't understand, but that this may be tucked into a larger code modification. When this is done manually it is likely to be discussed or otherwise get attention...in the best scenarios also being captured as a requirement/specification: but GenAI sprinkling in behaviors that have not been requested and have not been accounted for can leave systems which are not understood and difficult to support .