A Rough Perspective on Strategy Stealing| 2394 words
Related to The strategy-stealing assumption.
Imagine there’s something called power, which refers to roughly flexible influence over the future. Humanity currently has 100% of the power. People are concerned that AI systems pose an existential risk; this concern is equivalent to worrying that at some point in the future, AI systems roughly have 100% of power. For this concern to be realized, humanity, which started out with 100% of the power, must somehow lose all their power.
How might this happen? The first thing that has to happen is that humanity has to give up some of its power, in a way that they can’t get it back. Assuming that humanity doesn’t want to give up it’s power, humanity has to make a mistake
As an analogy, imagine that you have some money. Normally you invest the money into relatively stable investment funds that abide by good practices. You give them some money, they make you more money. When you ask for your money back, you get it back. However, sometimes you make a mistake and you invest in a Ponzi scheme. When you ask for your money back, the Ponzi scheme says “no”. Now you have permanently less money.
Similarly, the worry is that humanity, which starts out with 100% power, gives up some of their power to an AI system. When we ask for the power back, the AI system does not give it back, leaving humanity with permanently less power. This might look like humanity delegating some high-level decision making to an AI system in the anticipation that the system will execute strategies that achieve humans goals, but the AI system actually starts executing a strategy that achieves some different goal, and doesn’t stop when we tell it to stop.
However, this is only part of the concern. Remember, the worry is that humanity ultimately ends up with ~0% power. If humanity makes a mistake and gives away some of its power, this does not imply that they will lose all of their power. Back to our money analogy, if I accidentally give half my money to a Ponzi scheme, and I lose half of it, I still have half my money. If humanity accidentally gives up half of its power, it still has half of the power–half of the flexible influence over the long term future. This is, admittedly, half as good as having all of the power because humanity will only be able to achieve half of the potential value of the future. However, it is not an existential catastrophe.
So why are we concerned that humanity will ultimately end up having 0% of the power? The second concern is that AI systems will be able to use the amount of power that humanity accidentally gives them to gain more power. Since power is relative, the amount of influence you have over the long-term future is relative to the capabilities of other actors, AI systems might outcompete humans. That is, AI systems might grow their power faster than humans are able to grow their power. In the limit, this means that humanity has 0% power.
Back to our financial analogy, imagine that you start with $100 and you accidentally give away $50 to a competing investment firm. This firm can make their money grow faster than you can make your money grow. If they can triple their money every year while you can only double your money every year; each year, they get a larger proportion of the relative wealth. In the limit, the actor with the highest growth rate will control ~100% of the wealth.
The strategy stealing argument is an argument for why this cannot happen. Imagine that humanity accidentally gives up 10% of its power to an AI system. Here’s a strategy that humanity can employ to ensure it’s always equally as good as the AI system at growing its power. First, humanity should divide itself into nine chunks, each of which has 10% power. Second, humanity should have each chunk individually copy the AI system, stealing its strategy. By symmetry, the humanity chunks can do no worse than the AI system in expectation. So humanity, which started out with 90% power, should be able to maintain 90% power in the limit.
Back to our financial analogy, you can imagine that I am a hedge fund with a million dollars, except I mess up and I accidentally give $100,000 to a competing hedge fund. I’m worried that the competing hedge fund is going to grow their money faster than I can grow my money. However, I can ensure that this doesn’t happen by splitting my large hedge fund into nine smaller hedge funds, each of which has $100,000. These 9 hedge funds can each execute the same sorts of strategies as the competitor that I accidentally gave $100,000 to. By symmetry, each of these hedge funds should do approximately the same in expectation; they should have approximately the same returns. Thus, my hedge fund should maintain at least 90% of the wealth.
Where does this argument go wrong? Here are two high-level possibilities (there both many different ways to make these possibilities more specific and other high-level possibilities). The first way is that humanity might be unable to conceptualize the strategies that the AI system is employing. In order for humanity to steal the AI system’s strategies, we have to know what those strategies are. In the hedge fund case, firms often have information security; it might be difficult for me to copy what my competitor is doing because they’re hiding their strategies.
The second reason that strategy stealing might fail is asymmetry of execution; it might be possible for an AI system to employ strategies that humanity cannot execute. Even if we knew what those strategies were, we would be unable to steal them. For example, AlphaGo might have a strategy to play Go that involves doing large amounts of tree search. A human naively cannot do large amounts of tree search because it does not have enough compute. In the hedge fund case, it might be that the smaller hedge fund gets large returns because they train a machine learning model on historical financial data. I might be unable to steal this strategy because my hedge fund does not have competent machine learning engineers.
There’s some interaction between the detail of conceptualization and the ability to execute. For example, in the hedge fund case, knowing my competitor is training a large machine learning model is not a sufficiently detailed level of conceptualization for me to execute without having competent ML engineers. However, knowing the source code of their training process and having the exact same data that they have is a sufficiently detailed conceptualization of my opponent’s strategy that I can execute, regardless of whether or not I have competent ML engineers. On the flip side, if the only conceptualization of the strategy I have is “do the best thing to make money,” if I’m very competent, that might be enough for a strategy to ensure that my competitor doesn’t get higher returns on investment than I do. Any sufficiently detailed conceptualization implies ease of execution. There seems to be a lot of traction that one could get from improving the ability to conceptualize strategies that our AI systems can conceptualize.
So this is one way to think about interpretability research; where the point is to develop a sufficiently good understanding of the strategies AI systems employ, such that humanity can steal them for our own ends.
If we want to further abstract, the reason why we’re interested in strategy stealing is because we want humanity to be competitive with AI systems; to be able to grow power as fast (or faster) than AI systems can grow power. Strategy stealing only represents one potential way humanity can be competitive. Other forms of alignment can be conceptualized in how they make humanity’s employable strategies to expand its power more competitive with unaligned AI systems’ employable strategies.
Another way of thinking about this is in terms of differential capabilities. Imagine that we have a big list of all possible capabilities that AI systems can have, and we can plot ML systems in terms of the amount of capabilities they have. In this frame, the worry of AI alignment is that the default set of capabilities is one that favors unaligned activities over aligned activities. For example, the activity “do what humans want” might be much lower than “convince humans you did what they wanted” in the default set of capabilities that ML is able to achieve. AI safety, in this framework, is research that attempts to differentially advantage aligned capabilities compared to unaligned capabilities. This is simply another way of saying make strategies for achieving human values competitive with strategies for achieving other values.
How does AI policy fit into all of this? So, we have sort of in a broad range of fields this thing that I might call the policy-technical divide. The policy question is trying to solve the problem of how to get people from using AI systems to do bad things; how do we prevent people from doing that. And the technical question is, how do we make it as easy as possible to do good things instead of bad things. So one reason why people might do bad things instead of good things is because doing bad things is more profitable. Economic forces cause people to attempt to do things that are more profitable. Technical research is research that tries to make it equally profitable to do good things as opposed to bad things. Since there’s this background assumption that humanity in general does not want to go extinct, when given the choice, people would prefer to do good instead of bad. If we can make aligned things approximately as competitive as unaligned things, then the default shape of the world will look basically like a world shaped by aligned capabilities. There will, of course, be some bad actors amongst the mix that will do bad things instead of good things because they’re bad. However, since aligned actions are competitive with unaligned actions, these bad actors should not be able to substantially out compete the good actors, so we’ll hopefully have bounded amounts of influence on things we care about.
This framework applies to other domains. For example, climate policy might be trying to answer the question of how do we stop people from polluting. How do we stop people from emitting greenhouse gases? And the technical question might be trying to answer how can we make it as cheap as possible to not emit greenhouse gases. Similarly, in animal welfare, the policy question is how do we get people to stop wanting meat, and the technical question might be how do we make it as cheap as possible for people to stop eating meat.
So why do we think strategies might be stealable and why do we think strategies might not be stealable? One reason why strategies might not be stealable is because some strategies are better or worse for different sets of values. For example, a paperclip maximizer might employ the strategy of ‘make lots of paper clips.’ Even if humanity could conceptualize and execute this strategy, humans are not really interested in executing this strategy except in very bounded amounts. However, we are not worried about strategies that are not universal. We’re not worried about strategies that don’t increase the amount of power that you started off with. For example, the strategy of make lots of paper clips does not make the AI system that’s making paper clips more powerful and have more influence than other strategies. This strategy does not allow the AI system to increase the amount of power it has, so we’re not that concerned about it. For example, if we accidentally give 10% of power to a paperclip maximizer, and it uses that power to make paper clips, this is bad because we’ve lost sort of 10% of the influence over the future. It becomes paperclips. But this is fine in the sense that it was bounded at 10%. And so there’s some sense in which the only strategies we are concerned about in AI system executing are those strategies that increase the sort of thing we might call power. And these are potentially exactly the strategies that are sealed.
Why might this go wrong? Well, the notion of power is different for different sets of values. For example, the power of an AI system might increase if all the humans were dead, but the power of humanity would not increase if the humans were dead. The worry here is that the paperclip maximizer can increase the amount of power it has by employing a strategy that humanity cannot employ to increase the amount of power it has.
The solution here seems to be something like taking a smaller unit of what a stretch is. So instead of a sequence of actions, we might instead think about cognitive strategies: how the system would go about thinking of a clever way to kill all the humans. We might be able to steal the sort of cognitive patterns that enable an AI system to do that.
In conclusion, humanity currently has 100% power. We’re concerned at some point that they will have 0% power. How might this happen? One, humanity might accidentally give up some of its power. However, it seems unlikely that humanity accidentally gives up all of its power. So, in order for humanity to give up all of its power, there has to be some process by which an AI system can start off with a small amount of power and grow its power. But why can’t humanity just employ the same strategy that an AI system uses to grow its power in order to grow our own power? The primary reason why this seems to be hard is because it’s not possible for humans currently to conceptualize the strategies that an AI system is using. Research that aims to make these strategies doable thus contributes to alignment efforts. And there are also other reasons why even if we could conceptualize and execute all strategies, there are strategies that differentially advantage some set of values.