Answering Questions Honestly in the Game of Life
| 2666 words[This is obsoleted by ARC’s report on Eliciting Latent Knowledge]
This post is the result of a week I spent working with Paul Christiano on problem 2 in Answering questions honestly given world-model mismatches. It is incomplete and probably doesn’t make sense, but I’m publishing it anyway because I don’t want to make it better and I don’t want to let it languish in obscurity. Low quality content, yay!
Introduction
We are interested in exhibiting an algorithm that deals with the following hard case:
- We train our system on questions and answers generated by humans, so the simplest way for an AI to answer a question is to first translate from its internal model of the world into the human’s model of the world.
- The simplest way to translate between the AI world-model and the human world-model is to use the AI world-model to generate some observations (e.g. video) and then figure out what states in the human world-model could have generated those observations.
- This leads to bad predictions when the observations are misleading.
Real-World Example
We will start with a real-world example of this phenomenon. Suppose that you’ve gathered a set of video data, plus questions about the video data. You have humans answer the video data with the benefit of hindsight, producing a set of answers to questions that would be “superhuman” if the human didn’t have hindsight and only had access to the video data. You then train a model to give correct answers to the questions. Your data includes videos of cats in rooms and questions about whether or not there are cats in rooms.
Suppose that in order for your AI to give accurate answers to questions that humans answered with hindsight, your AI must learn a detailed simulation of what’s happening in the room. In order to answer questions about whether or not there’s a cat in its simulation about the room, the AI will have to translate its simulation into something in the human world model because concepts like “cat” only make sense in the human world model. We are concerned that the simplest way for the AI to do this inference is to use the simulation of the room to generate some video data, by e.g. simulating a camera, then use whatever procedure a human would have used to infer the presence/absence of a cat in the room.
The reason that this is a problem is that you can imagine that the simulation of the video camera in the AI’s world model gets hacked, so when it generates video data from the room, it generates hacked video data. This video data might be hacked to show that there’s a cat in the room when there is not a cat. The human inference procedure applied to this hacked video data will thus yield the wrong answer to the question “is there a cat in the room?” Since the AI’s model of the world is a simulation of the room with no cat in the room, it seems like the AI “knows” that there is no cat in the room; however, since it’s translating between the simulation of the room to the human’s understanding of the room via observations as an intermediary, this will lead to wrong answers to questions when the observations are corrupted.
Here’s a diagram:
Game of Life
Here’s an example in terms of the game of life (GOL). This example was chosen because the GOL is simple enough to be able to easily imagine writing code that would do all of the below but complicated enough for interesting things can happen. There are some ways in which GOL fails to capture relevant real-world dynamics, e.g. the embeddedness of humans and cameras. Since we think these dynamics make the problem harder instead of easier, we first attempt to exhibit an algorithm that solves this simplified example.
We will imagine that the real world is the game of life in full detail. The human will not be able to see the life board in full detail; their perception will be limited to the cell counts of \(10^7 \times 10^7\) squares of life (roughly \(10^7\) atoms can fit in 1 mm) of a \(10^{10} \times 10^{10}\) section of the board (generating a \(1000 \times 1000\) grid, roughly the resolution of a modern camera). We will imagine that the human’s “framerate” is once every \(10^8\) timesteps. The human will use this cell-count data to infer a model of the world that consists of three basic components: still life, soup life, and gliders. Roughly speaking, still life will be GOL states that are stable, soup life will be pseudo-randomly evolving unstable life states, and gliders will be the familiar five cell construction. The human has some set of rules that govern the behavior of these three types of life and interactions between them.
In more detail, let our human have a world model \(W = (S, P, \Omega, O: S \to \Omega)\)
- \(S\) is the space of possible states of the world. In this example, this will be the set of possible \(1000 \times 1000\) grids of cell counts along with labels of each of the cells as still or soup, along with distributions + counts of the locations of various gliders in the soups.
- Let the set of possible trjaectories of \(S\) be denoted by \(S^*\)
- \(P\) is the distribution over possible trajectories. More formally, \(P \in \Delta(S^*)\). We will think of this distribution over trajectories as being generated by a compact set of local rules for how soups/still/gliders tend to evolve.
- We will not pin down these rules precisely. Here is roughly what we have in mind
- soup can get bigger with some probability
- soup can get smaller with some probability
- soup can turn into still with some probability
- if soup collides with still, then the still bbecomes soup
- soup generates gliders with some probability
- still stays still
- soup can disapear with some probability
- gliders will move at c/4 diagonally
- if a glider hits soup, it remains soup
- if a glider hits still life, the still life turns into soup
- etc.
- For reference, you can imagine that we fix some distribution over initial states of game of life and run simulations to fit parameters to more specific versions of the above rules.
- e.g. we fix some distribution over what soup looks like, then simulate all the soup and check how often gliders get created from soup of various counts. The resulting distribtuion informs the rule about soup generating gliders.
- We will not pin down these rules precisely. Here is roughly what we have in mind
- \(\Omega\) is the space of possible observations. In our example this is the set of possible \(1000 \times 1000\) cell counts, a strict subset of the world states.
- Let \(\Omega^*\) denote the space of sequences of observations
- \(O\) is the map from the world state to the observations. In the example, this map will be equivalent to removing the still/soup/glider data from the world state.
- Let \(O^*\) denote the obvious extension of \(O: S \to \Omega\) to a map from \(S^* \to \Omega^*\) by applying \(O\) to every element of the sequence of states.
We will then generate a dataset of observations, questions, and answers (with the benefit of hindsight). Let \(Q\) and \(A\) denote the space of all questions and answers respectively. Let \((\omega^* \in \Omega^*, q \in Q, a \in A)\) denote a specific example. Questions in this dataset will look like:
- Will this soup get bigger or smaller?
- Will this area be soup or still in 100 frames?
- Will there be a glider here in 20 frames?
Since answers are collected with the benefit of hindsight, we have constructed a set of data with questions and answers that humans cannot answer with the limited amount of observation data given. In other words, \(W\) is not “powerful” enough to answer the questions; \(P\) does not predict the future precisely enough to give good answers to questions.
Suppose that we naively train an ML system \(M: \Omega^* \to A\) to answer these questions. Since \(W\) is not powerful enough to answer the questions, \(M\) will have to learn a better model of the world \(W'\). More specifically, suppose that \(M\) learns \(W' = (S', P', \Omega, O': S' \to \Omega)\) where
- \(S'\) is the actual game of life board
- \(P'\) is the distribution over \(S'^*\) that is generated by the game of life rules.
- \(\Omega\) is the same set of observations as the human, namely \(10^7 \times 10^7\) grids of cell counts
- \(O'\) is just the sum function over the grids
The questions in the dataset are about whether or not regions are soup/still/gliders and other such properties in \(W\), so we assume that the easiest way for \(M\) to answer questions is to convert \(W'\) to \(W\). In this context, we’re concerned that the following happens:
In words: \(M\) will use \(P'\) and \(O'\) to infer a distribution over \(S'\) (in the AI world model), use \(P'\) to step the distribution forward \(T \cdot 10^8\) steps to get \(\Delta s_t'^*\) (a sequence of distributions over states in \(S'\)), use \(O'\) to turn $$ into a sequence o*, then use P and O to infer a resulting distribution S (in the human world model). The system will then use this distribution over S to answer the question Q.
We claim that this situation is analogous with the cat-in-the-room example described above, and produces problems in the same situations. For instance, if there were types of soup that were able to regulate their cell counts, the observation-based inference procedure would have give the wrong answer when asked if that was soup or still-life.
The worry is that the observation filtered mapping is simpler than the intended mapping because it doesn’t require learning anything that the model doesn’t need to learn already. The model needs to learn the real world model in order to make accurate predictions and it needs to learn the human world model in order to answer questions (since the questions are phrased in terms best understood by the human world model). The observation filtered mapping only requires the two world models be learned, plus some relatively simple ways of using those models to do inference. Thus, the observation filtered mapping requires a very small number of additional bits of complexity over the capabilities to answer questions well at all.
Deliberation in the Game of Life
The goal is to create AIs that protect humans while they deliberate, as loosely described in the strategy-stealing assumption. This means that we only really care about our AI filtering it’s answers to questions through a observations if this would disrupt our ability to use the AI to ensure our deliberative process proceeded correctly.
In order to determine whether this example captures relevant parts of the problem, we want to construct a story of how deliberation proceeds in the game of life and determine how the observation-filtering problem will cause us to not be able to detect if this deliberation goes poorly.
In our example, deliberation will be modeled as a process that might produce one of two possible outcomes. In particular, suppose that we are interested in whether or not there are cells occupying a certain area. To determine whether there are cells there, we have engineered there to be a glider that goes through that area. If the glider continues moving after passing through that frame, then we have determined that there were no cells there. If the glider is stoped, then we have learned that there were cells there.
Imagine also that we believe that as long as the glider doesn’t touch any soup or other gliders, it will not change pathways and the deliberation will have gone well. However, if the glider passes through soup, it might have changed locations and the deliberation is no longer valid.
Suppose that in order to protect this glider, we construct thick walls of very dense still life. However, suppose that this is a GOL analog to nanotechnology that can turn a section of still life into a soup that carefully regulates its cell counts to appear as still life. In this situation, answering questions by filtering through observations cannot tell the difference between this self-regulating soup and still life, allowing an adversary to sneak past your still life walls and affect the position of your glider. All the while, you still think your walls are intact and this think that deliberation went well. In this situation, it seems like we want our model to say that the self-regulating soup is soup, even though the human inference procedure will infer that it is still life
The parallel example in the human world is deliberation proceeding with a human in a room. The human thinks that they’re safe as long as there are no holes in the walls, but nanotechnology can sneak through the walls without leaving a visible hole. In this situation, it seems like we want our model to say that there is a hole in the wall, even though we won’t be able to see it.
Problems with Training the Intended Mapping
Let \(C: S \to S'\) denote the intended mapping. To simplify the problem, suppose that we have \(P, O, P', O'\) as seperate functions. Suppose we also have a question answering function that uses states in \(S\) to answer questions. Finally, we have a dataset of \((\omega^*, s', q, a)\) where \(s'\) is the state that the AI infers from looking at the sequences of observation \(\omega^*\) (this would actually be a distribution, but ignore this for now). Given some state \(s' \in S'\), we can do 3 things with it: generate observations, advance it with \(P'\), or map it to a state \(s \in S\) with \(C\). This applies also to states in \(S\), with the additional trick of using those states to answer questions.
Reality is Magic to Humans
There are some trajectories in the real model of the world that, when translated to the human model, have vanishingly small probability. A real world example of this is if a human didn’t understand nanotech and the nanotech created a sand castle. The human would be extremely surprised at the sand castle appearing out of nowhere. You can imagine an AI telling the human that there’s going to be a sandcastle spontaneously appearing in their bedroom and the human being like “ok i guess?”
In the GOL example, the human’s understanding of the world doesn’t contain glider guns, they just think they’re soup. The human models gliders as emerging from soup with some probability, so they’re going to assign ~0 chance to a billion gliders emerging from the soup. However, the AI is going to tell them that a billion gliders are going to emerge from the soup. Human: ¯\_(ツ)_/¯
Various Parts of P are better than others
The human’s understanding of the world is better in some places than others. In our GOL example, the human’s understanding of gliders is more precise then they’re understanding of soup. If our method of training \(C\) routes through \(P\) and penalizes uncertainty, then \(C\) has an incentive to encode states of \(S'\) as states of \(S\) that filter though parts of \(P\) that are relatively stable.
For instance, suppose that we trained \(C\) to achieve \(P(C(s')) = C(P'(s'))\). \(C\) sending everything to the empty grid has this property. More generally, if we train on something like logloss, \(C\) will want to send things to regions of \(P\) that are very rigid, e.g. maybe \(C\) will encode the entire world into a computer in \(P\), which will compute the next state of the computation, which is exactly what \(C\) will say after \(P'\) results in the next state of the world.