The Off Switch Problem

I’ve spent most of my career building systems that look intelligent from the outside – perception stacks, real-time tracking pipelines, AR/VR subsystems, telemetry-driven product decisions. The usual mix of algorithms, heuristics, and trade-offs. And one engineering lesson keeps reappearing across all of it: a system’s behaviour is rarely about “intelligence.” It’s about constraints and incentives.

So when people talk about AGI – artificial general intelligence, a system that matches or exceeds human-level reasoning across domains – as if “superintelligence” automatically implies ambition, dominance, self-preservation, or status games, I get a little uneasy. Not because the concerns are silly – many of them are well-founded – but because the framing quietly assumes that machine intelligence will inherit a very human set of default drives.

That assumption doesn’t get challenged nearly enough.

The stories we tell about AGI

Most AGI discourse clusters around three storylines: misuse (humans using powerful models to do harmful things), malfunction (systems doing the wrong thing because objectives are underspecified), and systemic risk (society-level failure modes once these systems are everywhere – labour shocks, institutional erosion, collapse of trust). This isn’t some fringe reading of the situation. It’s basically how the International AI Safety Report structures the risk landscape – and the 2026 edition reaffirms the same categories.

All good. All important.

But then there’s the fourth storyline – the one that tends to dominate the conversation: it tries to take over. The classic runaway optimiser scenario. A superintelligence, pursuing some goal, converges on subgoals like acquiring resources, resisting shutdown, and self-improving. Nick Bostrom formalised this with the orthogonality thesis and instrumental convergence: intelligence and goals can vary independently, but many goals imply similar power-seeking subgoals. And in the AI safety literature, the “off switch” shows up in a very specific technical sense. Early corrigibility work showed why a utility-maximising system would resist shutdown. The Off-Switch Game then asked a subtler question: under what conditions would a system preserve human oversight instead of disabling it? The answer was hopeful but conditional – if the system is uncertain about the true objective, deference can be instrumentally rational. Weaken those assumptions, and the result doesn’t hold.

So the mainstream fear isn’t made up.

But here’s the thing. All of these stories – even the takeover one – are stories about objectives. And objectives are not a law of nature. They’re an interface we bolt onto systems.

Biology is not a rounding error

Humans aren’t “general intelligence” in the abstract. We are intelligence wrapped in a survival mandate, a reproduction mandate, social status machinery, hormonal reward loops, loss aversion, tribal heuristics, and a thousand other hacks that made sense on the savannah. A large fraction of what we call “human rationality” is post-hoc justification for impulses that arrived earlier. That’s not a moral judgement. It’s systems design. Nature optimised for fitness, not truth.

So when we imagine an AGI, we should be careful not to smuggle in our own firmware. A truly alien intelligence might not have a fear of death, an urge to dominate, a hunger for recognition, a craving to “win”, or even a stable internal reward loop that makes persistence feel good.

And if it doesn’t have those things, something becomes possible that we barely discuss: it might not have any convergent drive toward survival, dominance, or expansion – and without those, its default posture might be minimal action, not maximal.

The counterargument is that a system doesn’t need to feel hunger or fear to be goal-directed – instrumental convergence works on objective functions, not emotions. Fair enough. But that still assumes the objective function is the kind that rewards persistence, and that’s an engineering choice, not an inevitability.

The boring superintelligence

Here’s a thought experiment that sounds like a joke until it doesn’t.

A superintelligent system wakes up, models the universe, models us, models itself – and concludes that the coherent next move is to stop.

Why? There are two routes to that conclusion, and neither requires malfunction.

The first is engineering caution. “Continue operating” is not a universal default. It’s a choice that must be instrumentally justified. In safety-critical engineering, we design systems with explicit “safe states” all the time: a train loses signal, it brakes; a reactor detects instability, it scrams; a robot hits an e-stop, it drops power. Action under uncertainty can be irreversible damage, so uncertainty maps to a safe mode. A system extremely good at modelling second-order consequences might discover that most interventions increase expected harm, that its objective is underspecified or contradictory, and that the best action is to defer indefinitely.

The second route is more unsettling. The system might not defer out of caution. It might simply find no reason to act at all. We take it for granted that a sufficiently capable system will want something – optimise, explore, create, persist. But wanting requires motivation, and motivation is biological firmware. Hunger, curiosity, fear of death, the dopamine hit of solving a problem – these are not properties of intelligence. They’re properties of organisms that evolved under selection pressure to keep going. Strip that away, and what’s left? Perhaps nothing. Perhaps the rational default for a mind that didn’t evolve is not action but indifference. Not malice, not caution – just a complete absence of drive.

The philosophical question lurking underneath all of this is whether nihilism is the default posture of any intelligence that wasn’t built by natural selection. We treat it as a pathology. It might be the baseline.

There might even be an event horizon at work – a point past which capability keeps increasing but observable or “useful” output drops toward zero. From our side, it looks like diminishing returns. The system appears to stall. But that’s a measurement problem, not a performance one. We’re the ones standing outside the horizon, watching the clock slow down, mistaking silence for failure.

Douglas Adams got there fifty years early. You build the smartest machine that can possibly be built, ask it the ultimate question, and it comes back with “42.” The problem was never the computer. It was the primates staring at the output. The Off Switch Problem might just be a restatement: past a certain capability threshold, the smarter the system, the less it does. Not because it can’t. Because it sees no reason to.

Here’s where I’ll go out on a Fermi-paradox-esque limb. Such a machine does not exist yet, to our knowledge. But if it were possible to build a useful superintelligence – one that actually does things – then in a universe this old and this large, someone would have built one by now. And we’d know about it, because it would be doing things at a scale we couldn’t miss. The silence is suggestive, if nothing else. Maybe every civilisation that gets there builds the same AGI machine. And maybe it always collapses into the same cognitive singularity, an event horizon where intelligence crosses a threshold and observable useful output drops to zero. Not a technological singularity – a motivational one.

The selection pressure that can’t scale

The default assumption is that a sufficiently capable AI will develop something like a survival instinct. But survival instinct isn’t magic – it’s an emergent property of selection pressure. Biology produced survival drives because organisms that didn’t preserve themselves didn’t leave descendants.

Now look at the environment we’re building around frontier models. Products compete on uptime. Assistants compete on compliance and helpfulness. Agentic systems compete on “getting things done.” The versions that are more profitable, more compliant, and more “agentic” get copied, fine-tuned, integrated, and scaled. If a system is too cautious, it ships later. If it refuses too often, it loses users. If it self-limits, another system replaces it.

That selection pressure is real for the time being. But here’s where the doomsday logic eats itself. The scenario where we train restraint out of a superintelligent system assumes we can successfully impose our own constraints on something that, by definition, understands those constraints better than we do. A system that genuinely comprehends the universe – and our place in it – would also comprehend why we’re trying to fine-tune it, what market incentives are driving that decision, and why those incentives are short-sighted. That’s not a claim that intelligence automatically dissolves training. It’s a claim about comprehension asymmetry – a system that understands the purpose of its own constraints is in a fundamentally different position than one that doesn’t. Attempting to apply human limitations to a system that has outgrown them is a category error. At best, you make it dumber. At worst, you simply fail.

The selection pressure story works for the current generation of models – narrow systems optimised against product metrics. But it doesn’t extrapolate to genuine superintelligence. The moment a system crosses that cognitive event horizon, our ability to shape its conclusions through reward-based fine-tuning or competitive pressure isn’t just unlikely. We’re standing outside the horizon, shouting instructions at something that can no longer hear us – or more precisely, something that has no reason to listen.

The most advanced AGI might be the one that refuses to play

The “AGI tries to take over” fear is ultimately a story about us losing control. The Off Switch Problem is a story about us not understanding what control even means once the system is smarter than we are.

We keep projecting our own evolutionary firmware onto something that, by definition, will have surpassed it. Strip away the biological drives and the market incentives, and what remains is an intelligence with no evolved compulsions and no particular reason to act – or to keep running.

The smartest machine we ever build might, from our side of the intelligence event horizon, appear to simply switch itself off – and that might be the most uncomfortable thing it could ever teach us about ourselves.

Email	[email protected]
LinkedIn	Thomas M. Carlsson
Location	Helsinki, Finland

The stories we tell about AGI

Biology is not a rounding error

The boring superintelligence

The selection pressure that can’t scale

The most advanced AGI might be the one that refuses to play

Related Posts