Retrospective On My 2020 FHI Research Scholar Application

18 Nov 2021 | 1997 words

Concrete Questions
Concrete Projects
Project Descriptions
- Interpretability as Construction
- How Choice of Prior Influences Influence

This is a research statement I wrote when applying to be a research scholar at the Future of Humanity Institute in September of 2020. My opinions about which research questions/projects are promising has shifted a lot since then. I thought it might be interesting to people to see what I used to think. Inline comments in brackets.

Concrete Questions

What are plausible speed-up factors AI systems might cause in areas such as GPU/CPU architecture, compiler optimization, GPU/CPU heat management, developer productivity, etc.?
- [I still think this is an interesting question, but more of a detail in a larger argument about how quickly ML systems are going to loop in on themselves. I think trying to answer this question first is probably premature and you would want to think about the broader picture first.]
Are mesa-optimization processes less likely to arise in systems trained by supervised learning?
- [I think this question too broad to be a good researcher question. A better thing to try to do is probably produce concrete examples of generalization failures, similar to those produced in https://arxiv.org/pdf/2105.14111.pdf]
Are there machine learning paradigms in which prosaic AI alignment can be made much easier that are competitive with the standard paradigm?
- [What a broad research question. It would probably take an entire fields worth of effort to see if a learning paradigm can be “made competitive.”]
In particular, are there paradigms that make inner alignment failures less likely or easier to detect?
- [I think this question requires a much stronger conceptual understanding of inner alignment than is currently available.]
Does forecaster calibration generalize across different domains and time horizons?
- [I think you could study this pretty easily if you had a bunch of superforecasters. I’m not sure how interested in it I am.]
How sensitive is the claim “we are currently living in the most influential time in history” to different choices of prior?
- [I mostly don’t think this question is that determined by choice of prior because the evidence is pretty strong. For instance, current economic growth rates rule out extremely low probabilities on this being the most influential time.]
What are effective ways to introduce students and/or professionals to longtermism?
- [Still think more people should be thinking about this.]
If smaller versions of AI safety problems are encountered as AI systems become increasingly powerful, how likely is the AI community to solve them with robust solutions as opposed to patchwork solutions?
- [I’m still interested in better case studies. I think current interest is closer to “how likely is doom if we only have patchwork solutions?”]
In general, how good is humanity at coordinating to avoid catastrophic events?
- [Not sure this question counts as “concrete”. Probably still interested in case studies.]
Does evolutionary biology have any insights that are transferable to thinking about AI safety?
- [Still interested in reading an evolutionary biology textbook. Not sure it’s worth my time, but would be fun. I think alignment people make a lot of claims about evolution and I’m worried that no one really knows what they’re talking about.]
What are benchmarks that would allow for the quantification of progress in machine learning interpretability?
- [Still seems like a good idea to think about.]
How can the notion of causal abstraction be defined in a way that is intuitively satisfying?
- [Seems like a thorny philosophical question that you should try to skirt as much as possible. I think some of Scott Garrabrant’s work on similar things is interesting.]
How much value drift has happened to historical institutions?
- [Still interesting to me. My sense is quite a bit, but not sure how to quantify. I think it’s probably best to operationalize this question in terms of financial instruments.]
Is AI safety through debate a viable strategy for AI alignment?
- [Current guess is that it won’t work on its own in the worst case. See Debate update: Obfuscated arguments problem. Plausibly still has a place in some broader scheme.]
What are plausible legal and financial instruments that might allow patient longtermists to save resources for use in the far future? How reliable are these instruments?
- [The Patient Philanthropy Fund exists now. I’m not sure if they know how historically precedented their success would be. I’m vaguely interested in knowing the answer but don’t think it’s that important.]

Concrete Projects

Comprehensively gather information about CPU architecture, determining how much faster they could plausibly get when optimized end-to-end by AI systems. Progress could be made by determining how much slowdown is caused by the need to maintain backwards compatibility, human ability to read/write, and fast runtimes for general tasks instead of narrow ones.
- [I basically think you can approximate this by thinking about how fast GPUs get faster as a function of the amount of R&D that goes into them and extrapolate, treating AI labor and human labor as the same.]
Explore different machine learning paradigms, identifying key properties along which they can vary, e.g. implicit vs explicit prior, iterative vs numerical optimization, supervised vs unsupervised training, etc. Determine whether these properties are conducive to alignment and whether these paradigms could be competitive with the standard paradigm.
- [I think this project would basically involve trying to do complicated reasoning about a bunch of relatively shallow differences in ML paradigms or just concluding that we are uncertain if anything can be competitive with normal deep learning but probably nothing that currently exists is competitive.]
Construct sets of arguments/analogies/diagrams explaining the core concepts of longtermism. Present to various groups and iterate to construct concise and compelling explanations of longtermism.
- [Still think this is a good idea.]
Explore the history of fields like computer security, architecture, and transportation to determine the extent to which safety concerns are generally solved with with robust vs patchwork solutions. Construct models of the various dynamics at play and determine the extent to which these models apply to AI safety.
- [I don’t think this is that cruxy for me anymore. I think one of the main difficulties in trying to do this would be to figure out what counted as “robust” versus “patchwork” and getting those concepts to mean anything at all with respect to AI safety.]
Explore humanity’s history of cooperation successes and failures. Construct simple models of these events and simple models of worldviews like “humanity good at cooperating” and “humanity is reckless”. Determine how predictive various worldviews are of humanity’s history.
- [I think this is trying to resolve surface level features of deeper disagreements and won’t really make any progress about things that anyone cared about.]
Explore the history of religious institutions and determine the extent to which their values appear to have changed. Make guesses as to the extent that early religious institutions would approve of the current form such institutions take.
- [This still seems interesting to me, but I suspect it wouldn’t tell you that much and wouldn’t be that interesting.]

Project Descriptions

Interpretability as Construction

[Still like this project. Chris Olah has done an initial version here. I think “finishing the job” and getting the handmade circuits to get better loss than the original would be a good starting project for someone trying to do interpretability. I’m also interested in people doing the same for language models.]

A promising way to make progress in prosaic AI safety is to develop technique that make complicated machine learning systems easier to understand, such as the work done by the OpenAI Clarity team. Such research raises interesting philosophical questions about what it means to understand something.

Instead of trying to construct definitions of understanding, we abilities understanding a system brings. One possible statement of this type is, “if you understand a system, you know how to design a similar system”. Related to this are notions like “if you understand a subject, you should be able to teach it” and “if you understand an algorithm, you should be able to implement it”.

Accordingly, a meaningful question that can be asked of current interpretability techniques is “to what extent do these techniques allow me to hand-design a system that achieves similar performance?” If someone claimed to understand quicksort, but was unable to implement the algorithm, we might be skeptical. If someone claims to understand a machine learning system but is unable to construct a similar system, then we might be similarly skeptical.

A potential concrete implementation of this project is as follows:

Train a simple neural network on a simple dataset, e.g. MNIST or Fashion-MNIST.
Use feature visualization techniques (e.g. those being developed OpenAI) to gain an understanding of the low-level features the network has learned.
Construct a classifier based on hand engineered features informed by the results of the preceding visualizations.
Compare the performance of the constructed system against the trained system.
If the performances have large differences, investigate why and use this to determine places in which current features visualization techniques are deficient.

Attempting to hand-engineer features can be likened to “old school” computer vision, where dogs were distinguished from cats via specialized features constructed by domain experts. It thus seems difficult to hand-engineered features that achieve similar performance to modern ML techniques. However, determining the exact ways such hand-engineered features fail should provide useful insight into the ways that current interpretability techniques are insufficient.

How Choice of Prior Influences Influence

[I think this project is bad because most of the disagreement probably doesn’t come from the prior. I think a better version of this project would resemble Holden’s “most important century” blog post series.]

A central question for longtermist altruists is whether or not the current time period is one of the most influential times in history. If a longtermist thinks now is particularly influential, they should spend more resources. Else, they might instead save their resources and plan on spending in the future.

There is reasonable disagreement as to whether the current century is likely to be the most influential. Some of this disagreement can be traced to disagreements about the likelihood of particular existential risks. However, large portions of this disagreement can also be traced to the prior probability one assigns to this century being particularly influential. Ideally, one would use a “maximum ignorance” prior; however “maximum ignorance” can be operationalized in multiple ways. Specifically, there are multiple orders of magnitudes difference in the probabilities given by uniform versus Laplacian priors.

A potential concrete implementation of this project is as follows:

Identify plausible priors (possibly done by surveying relevant literature). Examples are uniform and Laplacian priors. One could also attempt to express more complicated “priors” like “the chance a given time is influential is related to economic output” or “earlier times are more likely to be influential because the past can influence the future but not vice versa”
Determine a set of questions similar to “is this time one of the most influential?” along various dimensions. Examples of such questions: “when will humanity will go to the Moon?” and “when will a single species will achieve global dominance?”
“Grade” priors based on how correctly they answer similar questions.

The problem of prior selection seems to be very difficult, so it seems unlikely that this project will draw any definitive conclusions. Choice of prior might also involve various anthropic arguments, which are generally known to be very confusing. Progress might be made by locating statements like “if you believe X, you must also believe Y” which will allow intuitions to be refined. Such statements might be located by connecting this problem to known problems in decision theory, like the two-envelope problem or the sleeping beauty problem.

Ex Anter

About

Archive

Things