# Artificially Intelligent

Any mimicry distinguishable from the original is insufficiently advanced.

# A Rough Perspective on Strategy Stealing

| 2394 words

Related to The strategy-stealing assumption.

# Introduction

Imagine there’s something called power, which refers to roughly flexible influence over the future. Humanity currently has 100% of the power. People are concerned that AI systems pose an existential risk; this concern is equivalent to worrying that at some point in the future, AI systems roughly have 100% of power. For this concern to be realized, humanity, which started out with 100% of the power, must somehow lose all their power.

How might this happen? The first thing that has to happen is that humanity has to give up some of its power, in a way that they can’t get it back. Assuming that humanity doesn’t want to give up it’s power, humanity has to make a mistake

As an analogy, imagine that you have some money. Normally you invest the money into relatively stable investment funds that abide by good practices. You give them some money, they make you more money. When you ask for your money back, you get it back. However, sometimes you make a mistake and you invest in a Ponzi scheme. When you ask for your money back, the Ponzi scheme says “no”. Now you have permanently less money.

Similarly, the worry is that humanity, which starts out with 100% power, gives up some of their power to an AI system. When we ask for the power back, the AI system does not give it back, leaving humanity with permanently less power. This might look like humanity delegating some high-level decision making to an AI system in the anticipation that the system will execute strategies that achieve humans goals, but the AI system actually starts executing a strategy that achieves some different goal, and doesn’t stop when we tell it to stop.

However, this is only part of the concern. Remember, the worry is that humanity ultimately ends up with ~0% power. If humanity makes a mistake and gives away some of its power, this does not imply that they will lose all of their power. Back to our money analogy, if I accidentally give half my money to a Ponzi scheme, and I lose half of it, I still have half my money. If humanity accidentally gives up half of its power, it still has half of the power–half of the flexible influence over the long term future. This is, admittedly, half as good as having all of the power because humanity will only be able to achieve half of the potential value of the future. However, it is not an existential catastrophe.

So why are we concerned that humanity will ultimately end up having 0% of the power? The second concern is that AI systems will be able to use the amount of power that humanity accidentally gives them to gain more power. Since power is relative, the amount of influence you have over the long-term future is relative to the capabilities of other actors, AI systems might outcompete humans. That is, AI systems might grow their power faster than humans are able to grow their power. In the limit, this means that humanity has 0% power.

Back to our financial analogy, imagine that you start with $100 and you accidentally give away$50 to a competing investment firm. This firm can make their money grow faster than you can make your money grow. If they can triple their money every year while you can only double your money every year; each year, they get a larger proportion of the relative wealth. In the limit, the actor with the highest growth rate will control ~100% of the wealth.

# Strategy Stealing

The strategy stealing argument is an argument for why this cannot happen. Imagine that humanity accidentally gives up 10% of its power to an AI system. Here’s a strategy that humanity can employ to ensure it’s always equally as good as the AI system at growing its power. First, humanity should divide itself into nine chunks, each of which has 10% power. Second, humanity should have each chunk individually copy the AI system, stealing its strategy. By symmetry, the humanity chunks can do no worse than the AI system in expectation. So humanity, which started out with 90% power, should be able to maintain 90% power in the limit.

Back to our financial analogy, you can imagine that I am a hedge fund with a million dollars, except I mess up and I accidentally give $100,000 to a competing hedge fund. I’m worried that the competing hedge fund is going to grow their money faster than I can grow my money. However, I can ensure that this doesn’t happen by splitting my large hedge fund into nine smaller hedge funds, each of which has$100,000. These 9 hedge funds can each execute the same sorts of strategies as the competitor that I accidentally gave \$100,000 to. By symmetry, each of these hedge funds should do approximately the same in expectation; they should have approximately the same returns. Thus, my hedge fund should maintain at least 90% of the wealth.

Where does this argument go wrong? Here are two high-level possibilities (there both many different ways to make these possibilities more specific and other high-level possibilities). The first way is that humanity might be unable to conceptualize the strategies that the AI system is employing. In order for humanity to steal the AI system’s strategies, we have to know what those strategies are. In the hedge fund case, firms often have information security; it might be difficult for me to copy what my competitor is doing because they’re hiding their strategies.

The second reason that strategy stealing might fail is asymmetry of execution; it might be possible for an AI system to employ strategies that humanity cannot execute. Even if we knew what those strategies were, we would be unable to steal them. For example, AlphaGo might have a strategy to play Go that involves doing large amounts of tree search. A human naively cannot do large amounts of tree search because it does not have enough compute. In the hedge fund case, it might be that the smaller hedge fund gets large returns because they train a machine learning model on historical financial data. I might be unable to steal this strategy because my hedge fund does not have competent machine learning engineers.

There’s some interaction between the detail of conceptualization and the ability to execute. For example, in the hedge fund case, knowing my competitor is training a large machine learning model is not a sufficiently detailed level of conceptualization for me to execute without having competent ML engineers. However, knowing the source code of their training process and having the exact same data that they have is a sufficiently detailed conceptualization of my opponent’s strategy that I can execute, regardless of whether or not I have competent ML engineers. On the flip side, if the only conceptualization of the strategy I have is “do the best thing to make money,” if I’m very competent, that might be enough for a strategy to ensure that my competitor doesn’t get higher returns on investment than I do. Any sufficiently detailed conceptualization implies ease of execution. There seems to be a lot of traction that one could get from improving the ability to conceptualize strategies that our AI systems can conceptualize.

So this is one way to think about interpretability research; where the point is to develop a sufficiently good understanding of the strategies AI systems employ, such that humanity can steal them for our own ends.

# Competitiveness

If we want to further abstract, the reason why we’re interested in strategy stealing is because we want humanity to be competitive with AI systems; to be able to grow power as fast (or faster) than AI systems can grow power. Strategy stealing only represents one potential way humanity can be competitive. Other forms of alignment can be conceptualized in how they make humanity’s employable strategies to expand its power more competitive with unaligned AI systems’ employable strategies.

Another way of thinking about this is in terms of differential capabilities. Imagine that we have a big list of all possible capabilities that AI systems can have, and we can plot ML systems in terms of the amount of capabilities they have. In this frame, the worry of AI alignment is that the default set of capabilities is one that favors unaligned activities over aligned activities. For example, the activity “do what humans want” might be much lower than “convince humans you did what they wanted” in the default set of capabilities that ML is able to achieve. AI safety, in this framework, is research that attempts to differentially advantage aligned capabilities compared to unaligned capabilities. This is simply another way of saying make strategies for achieving human values competitive with strategies for achieving other values.