Artificially Intelligent

Any mimicry distinguishable from the original is insufficiently advanced.

  • Pre-Training + Fine-Tuning Favors Deception

    | 961 words

    Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Thanks to Evan Hubinger for helpful comments and discussion. Currently, to obtain models useful for some task X, models are pre-trained on some task Y, then fine-tuned on task X. For example, to obtain a model that...

  • Less Realistic Tales of Doom

    | 1263 words

    Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Realistic tales of doom must weave together many political, technical, and economic considerations into a single story. Such tales provide concrete projections but omit discussion of less probable paths to doom. To rectify this, here are some...

  • Making Markets Over Beliefs

    | 949 words

    Betting is an excellent way to improve the accuracy of your beliefs. For example: Me: “I think Alice is 5’11.” Friend: “I’ll bet you $5 1:1 that it’s higher.” Translated, my friend means they’re willing to agree to a deal where I pay them $5 if Alice is taller than...

  • Agents Over Cartesian World Models

    | 8663 words

    Coauthored with Evan Hubinger Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Thanks to Adam Shimi, Alex Turner, Noa Nabeshima, Neel Nanda, Sydney Von Arx, Jack Ryan, and Sidney Hough for helpful discussion and comments. Abstract We analyze agents by supposing a Cartesian boundary between...

  • Intermittent Distillations #2

    | 2862 words

    Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making (Andrew Critch and Stuart Russell) Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making Summary A policy (over some partially observable Markov decision process (POMDP)) is Pareto optimal with respect to two agents with different utility functions if it...

  • Meta-EA Needs Models

    | 1274 words

    Thanks to Kuhan Jeyapragasan, Michael Byun, Sydney Von Arx, Thomas Kwa, Jack Ryan, Adam Křivka, and Buck Shlegeris for helpful comments and discussion. Epistemic status: a bunch of stuff Sometimes I have conversations with people that go like this: Me: Feels like all the top people in EA would have...

  • RSS Feed

    | 16 words

    Multiple people have requested an RSS feed for my blog. I have created one here.

  • Transparency Trichotomy

    | 2062 words

    Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Introduction In Relaxed Adversarial Training for Inner Alignment, Evan presents a trichotomy of ways to understand a model M: Transparency via inspection: use transparency tools to understand M via inspecting the trained model. Transparency via training: incentivize...

  • Intermittent Distillations #1

    | 3166 words

    This is my low-budget version of Rohin’s Alignment Newsletter. A critique of pure learning and what artificial neural networks can learn from animal brains (Anthony M. Zador) A critique of pure learning and what artificial neural networks can learn from animal brains Summary This paper points out that human learning...

  • Strong Evidence is Common

    | 289 words

    Portions of this are taken directly from Three Things I’ve Learned About Bayes’ Rule. One time, someone asked me what my name was. I said, “Mark Xu.” Afterward, they probably believed my name was “Mark Xu.” I’m guessing they would have happily accepted a bet at 20:1 odds that my...