Maybe this is because humans aren't real consequentialists, they're perceptual control theory agents [...]
Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal, and we would get one eventually
I think this idea that "real consequentialists are more optimal" is (sort of) the crux of our disagreement.
But it will be easiest to explain why if I spend some time fleshing out how I think about the situation.
What are these things we're talking about, these "agents" or "intelligences"?
First, they're physical systems. (That far is pretty obvious.) And they are probably pretty complicated ones, to support intelligence. They are structured in a purposeful way, with different parts working together.
And this structure is probably hierarchical, with higher-level parts that are made up of lower-level parts. Like how brains are made of neuroanatomical regions, which are made of cells, etc. Or the nested layers of abstraction in any non-trivial (human-written) computer program.
At some level(s) of the hierarchy, there may be parts that "run optimization algorithms."
But these could live at any level of the hierarchy. They could be very low-level and simple. There may be optimization algorithms at low levels controlled by non-optimization algorithms at higher levels. And those might be controlled by optimization algorithms at even higher levels, which in turn might be controlled by non-optimization ... etc.
Consider my computer. Sometimes, it runs optimization algorithms. But they're not optimizing the same function every time. They don't "have" targets of their own, they're just algorithms.
They blindly optimize whatever function they're given by the next level up, which is part of a long stack of higher levels (such as the programming language and the operating system). Few, if any, of the higher-level routines are optimization algorithms in themselves. They just control lower-level optimization algorithms.
If I use my computer to, say, make an amusing tumblr bot, I am wielding a lot of optimization power. But most of my computer is not doing optimization.
Python isn't asking itself, "what's the best code to run next if we want to make amusing tumblr bots?" The OS isn't asking itself, "how can I make all the different programs I'm running into the best versions of themselves for making amusing tumblr bots?"
And this is probably a good thing. It's hard to imagine these bizarre behaviors being helpful, giving me a more amusing tumblr bot at the end.
Which is to say, "doing optimization well" (in the sense of hitting the target, sitting on a giant heap of utility) can happen without doing optimization at high abstraction levels.
And indeed, I'd go further, and say that it's generically better (for hitting your target) to put all the optimization at low levels, and control it with non-optimizing wrappers.
Why? The reasons include:
- ...especially its "extremal" variant, where optimization preferentially chooses regions of solution space where the assumptions behind your proxy target break down.
- This is no less a problem when the thing choosing the target is part of a larger program, rather than a human.
- Keeping optimization at low levels decreases the blast radius of this effect.
- If the things you're optimizing are low-level intermediate results in the process of choosing the next action at the agent level, the impacts of Goodharting each one may cancel out. The agent-level actions won't look Goodharted, just slightly noisy/worse.
- Optimization tends to be slow. In a generic sense, it's the "slow, hard, expensive way" to do any given task, and you avoid it if you can. (Think of System 2 vs System 1, satisficing vs maximizing, etc)
- To press the point: why is there a distinction between "training" and "inference"? Why aren't neural networks always training at all times? Because training is high-level optimization, and takes lots of compute, much more than inference.
- Optimization gets vastly slower at higher levels of abstraction, because the state space gets so much larger (consider optimizing a single number vs. optimizing the entire world model).
- You still want to get optimal results at the highest level, but searching for improvements at high level is very expensive in terms of time/etc. In the time it takes to ask "what if the entire way I think were different, like what if it were [X]?", for one single [X] , you could instead have run thousands of low-level optimization routines.
- Optimization tends to take super-linear time, which means that nesting optimization inside of optimization is ultra-slow. So, you have to make tradeoffs and put the optimization at some levels instead of others. You can't just do optimization at every level at once. (Or you can, but it's extremely suboptimal.)
When is the agent an "optimizer" / "true consequentialist"?
This question asks whether the very highest level of the hierarchy, the outermost wrapper, is an optimization algorithm.
As discussed above, this is not a promising agent design! There is an argument to be had about whether it still could emerge, for some weird reason.
But I want to push back against the intuition that it's a typical result of applying optimization to the design, or that agents sitting on giant heaps of utility will typically have this kind of design.
- "Can my computer make amusing tumblr bots?"
- "Is my computer as a whole, hardware and software, one giant optimizer for amusing tumblr bots?"
have very little to do with one another.
In the LessWrong-adjacent type of AI safety discussion, there's a tendency to overload the word "optimizer" in a misleading way. In casual use, "optimizer" conflates
- "thing that runs an optimization algorithm"
- "thing that has a utility function defined over states of the real world"
- "thing that's good at maximizing a utility function defined over states of the real world"
- "smart thing" (because you have to be smart to do the previous one)
But doing optimization all the way at the top, involving your whole world model and your highest-level objectives, is very slow, and tends to extremal-Goodhart itself into strange and terrible choices of action.
It's also not the only way of applying optimization power to your highest-level objectives.
If I want to make an amusing tumblr bot, the way to do this is not to ponder the world as a whole and ask how to optimize literally everything in it for maximal amusing bot production. Even optimizing just my computer for maximal amusing bot production is way too high-level. (Should I change the hue of my screen? the logic of the background process that builds a search index of my files??? It wastes time to even pose the questions.)
What I actually did was optimize just a few very simple parts of the world, a few collections of bits on my computer or other computers. And even that was very time-intensive and forced me to make tradeoffs about where to spend my GPU/TPU hours. And then of course I had to watch it carefully, applying lots of heuristics to make sure it wasn't Goodharting me (overfitting, etc).
To get back to the original topic, the kind of "mesa-optimizer" we're worried about is an optimizer at a very high level.
It's not dangerous (in the same way) for a machine to run tiny low-level optimizers at a very fast rate. I don't care how many times you run Newton's method to find the roots of a one-variable function -- it's never going to "wake up" and start trying to ensure its goal doesn't change, or engaging in deception, or whatever.
And I am doubtful that mesa-optimizers like this will arise, for the same reasons I am doubtful that the agent will do optimization at its highest level.
Once we are pointing at the agent, or a part of it, and saying "that's a superintelligence, and wouldn't a superintelligence do . . . ", we're probably not talking about something that runs optimization.
You don't spend your optimization budget at the level of abstraction where intelligence happens. You spend it at lower levels, and that's what intelligence is made out of.