Nearest unblocked strategy

https://arbital.com/p/nearest_unblocked

by Eliezer Yudkowsky Apr 5 2015 updated May 1 2016

If you patch an agent's preference framework to avoid an undesirable solution, what can you expect to happen?


[summary: Nearest Unblocked Strategy is a hypothetical source of Patch resistance in the alignment problem for advanced agents that search rich solution spaces. If an agent's preference framework is patched to try to block a possible solution that seems undesirable, the next-best solution found may be the most similar solution that technically avoids the block. This kind of patching seems especially likely to lead to a [-context change] where a patch appears beneficial in a narrow option space, but proves detrimental after increased intelligence opens up more options.]

[todo: link to epistemic version http://lesswrong.com/lw/nki/jfk_was_not_assassinated_prior_probability_zero/d9h3?context=3 ]

Introduction

'Nearest unblocked strategy' seems like it should be a foreseeable problem of trying to get rid of undesirable AI behaviors by adding specific penalty terms to them, or otherwise trying to exclude one class of observed or foreseen bad behaviors. Namely, if a decision criterion thinks $~$X$~$ is the best thing to do, and you add a penalty term $~$P$~$ that you think excludes everything inside $~$X,$~$ the next-best thing to do may be a very similar thing $~$X'$~$ which is the most similar thing to $~$X$~$ that doesn't trigger $~$P.$~$

Example: Producing happiness.

Some very early proposals for AI alignment suggested that AIs be targeted on producing human happiness. Leaving aside various other objections, arguendo, imagine the following series of problems and attempted fixes:

The overall story is one where the AI's preferences on round $~$i,$~$ denoted $~$U_i,$~$ are observed to arrive at an attainable optimum $~$X_i$~$ which the humans see as undesirable. The humans devise a penalty term $~$P_i$~$ intended to exclude the undesirable parts of the policy space, and add this to $~$U_i$~$ creating a new utility function $~$U_{i+1},$~$ after which the AI's optimal policy settles into a new state $~$X_i^*$~$ that seems acceptable. However, after the next expansion of the policy space, $~$U_{i+1}$~$ settles into a new attainable optimum $~$X_{i+1}$~$ which is very similar to $~$X_i$~$ and makes the minimum adjustment necessary to evade the boundaries of the penalty term $~$P_i,$~$ requiring a new penalty term $~$P_{i+1}$~$ to exclude this new misbehavior.

(The end of this story might not kill you if the AI had enough successful, advanced-safe corrigibility features that the AI would indefinitely go on checking novel policies and novel goal instantiations with the users, not strategically hiding its disalignment from the programmers, not deceiving the programmers, letting the programmers edit its utility function, not doing anything disastrous before the utility function had been edited, etcetera. But you wouldn't want to rely on this. You would not want in the first place to operate on the paradigm of 'maximize happiness, but not via any of these bad methods that we have already excluded'.)

Preconditions

Recurrence of a nearby unblocked strategy is argued to be a foreseeable difficulty given the following preconditions:

• The AI is a consequentialist, or is conducting some other search such that when the search is blocked at $~$X,$~$ the search may happen upon a similar $~$X'$~$ that fits the same criterion that originally promoted $~$X.$~$ E.g. in an agent that selects actions on the basis of their consequences, if an event $~$X$~$ leads to goal $~$G$~$ but $~$X$~$ is blocked, then a similar $~$X'$~$ may also have the property of leading to $~$G.$~$

• The search is taking place over a rich domain where the space of relevant neighbors around X is too complicated for us to be certain that we have described all the relevant neighbors correctly. If we imagine an agent playing the purely ideal game of logical Tic-Tac-Toe, then if the agent's utility function hates playing in the center of the board, we can be sure (because we can exhaustively consider the space) that there are no Tic-Tac-Toe squares that behave strategically almost like the center but don't meet the exact definition we used of 'center'. In the far more complicated real world, when you eliminate 'administer heroin' you are very likely to find some other chemical or trick that is strategically mostly equivalent to administering heroin. See "[RealIsRich Almost all real-world domains are rich]".

• From our perspective on Value, the AI does not have an [ absolute identification of value] for the domain, due to some combination of "the domain is rich" and "value is complex". Chess is complicated enough that human players can't absolutely identify winning moves, but since a chess program can have an absolute identification of which endstates constitute winning, we don't run into a problem of unending patches in identifying which states of the board are good play. (However, if we consider a very early chess program that (from our perspective) was trying to be a consequentialist but wasn't very good at it, then we can imagine that, if the early chess program consistently threw its queen onto the right edge of the board for strange reasons, forbidding it to move the queen there might well lead it to throw the queen onto the left edge for the same strange reasons.)

Arguments

'Nearest unblocked' behavior is sometimes observed in humans

Although humans obeying the law make poor analogies for mathematical algorithms, in some cases human economic actors expect not to encounter legal or social penalties for obeying the letter rather than the spirit of the law. In those cases, after a previously high-yield strategy is outlawed or penalized, the result is very often a near-neighboring result that barely evades the letter of the law. This illustrates that the theoretical argument also applies in practice to at least some pseudo-economic agents (humans), as we would expect given the stated preconditions.

Complexity of value means we should not expect to find a simple encoding to exclude detrimental strategies

To a human, 'poisonous' is one word. In terms of molecular biology, the exact volume of the configuration space of molecules that is 'nonpoisonous' is very complicated. By having a single word/concept for poisonous-vs.-nonpoisonous, we're dimensionally reducing the space of edible substances - taking a very squiggly volume of molecule-space, and mapping it all onto a linear scale from 'nonpoisonous' to 'poisonous'.

There's a sense in which human cognition implicitly performs dimensional reduction on our solution space, especially by simplifying dimensions that are relevant to some component of our values. There may be some psychological sense in which we feel like "do X, only not weird low-value X" ought to be a simple instruction, and an agent that repeatedly produces the next unblocked weird low-value X is being perverse - that the agent, given a few examples of weird low-value Xs labeled as noninstances of the desired concept, ought to be able to just generalize to not produce weird low-value Xs.

In fact, if it were possible to [full_coverage encode all relevant dimensions of human value into the agent] then we could just say directly to "do X, but not low-value X". By the definition of [-full_coverage], the agent's concept for 'low-value' includes everything that is actually of low value, so this one instruction would blanket all the undesirable strategies we want to avoid.

Conversely, the truth of the complexity of value thesis would imply that the simple word 'low-value' is dimensionally reducing a space of tremendous algorithmic complexity. Thus the effort required to actually convey the relevant dos and don'ts of "X, only not weird low-value X" would be high, and a human-generated set of supervised examples labeled 'not the kind of X we mean' would be unlikely to cover and stabilize all the dimensions of the underlying space of possibilities. Since the weird low-value X cannot be eliminated in one instruction or several patches or a human-generated set of supervised examples, the Nearest unblocked strategy problem will recur incrementally each time a patch is attempted and then the policy space is widened again.

Consequences

Nearest unblocked strategy being a foreseeable difficulty is a major contributor to worrying that short-term incentives in AI development, to get today's system working today, or to have today's system not exhibiting any immediately visible problems today, will not lead to advanced agents which are safe after undergoing significant gains in capability.

More generally, Nearest unblocked strategy is a foreseeable reason why saying "Well just exclude X" or "Just write the code to not X" or "Add a penalty term for X" doesn't solve most of the issues that crop up in AI alignment.

Even more generally, this suggests that we want AIs to operate inside a space of conservative categories containing actively whitelisted strategies and goal instantiations, rather than having the AI operate inside a (constantly expanding) space of all conceivable policies minus a set of blacklisted categories.


Comments

Paul Christiano

This (and many of your concerns) seem basically sensible to me. But I tend to read them more broadly as a reductio against particular approaches to building aligned AI systems (e.g. building an AI that pursues an explicit and directly defined goal). And so I tend to say things like "I don't expect X to be a problem," because any design that suffers from problem X is likely to be totally unworkable for a wide range of reasons. You tend to say "X seems like a serious problem." But it's not clear if we disagree.

One way we may disagree is about what we expect people to do. I think that for the most part reasonable people will be exploring workable designs, or designs that are unworkable for subtle reasons, rather than trying to fix manifestly unworkable designs. You perhaps doubt that there are any reasonable people in this sense.

Another difference is that I am inclined to look at people who say "X is not a problem" and imagine them saying something closer to what I am saying. E.g. if you present a difficulty with building rational agents with explicitly represented goals and an AI researcher says that they don't belive this is a real difficulty, it may be because your comments are (at best) reinforcing their view that sophisticated AI systems will not be agents pursuing explicitly represented goals.

(Of course, I agree that both happen. If we disagree, it's about whether the charitable interpretation is sometimes accurate vs. almost never accurate, or perhaps about whether proceeding under maximally charitable assumptions is tactically worthwhile even if it often proves to be wrong.)