I often struggle to find words and sentences that match what I intend to communicate.
Here are some problems this can cause:
Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I’m writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.
These apply to speaking, too. If I speak what would be the ‘first iteration’ of a sentence, there’s a good chance it won’t create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly ‘rewrite’ my output before sending it. This is one reason, but not the only reason, that I’ve had a policy of trying to avoid voice-based communication.
I’m not fully sure what caused this relationship to language. It could be that it’s just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]
E.g., once I couldn’t find the word ‘clusters,’ and wrote a complex sentence referring to ‘sets of similar’ value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)
My primary parent was highly abusive, and would punish me for using language in the intuitive ‘direct’ way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.
Eventually I learned to model her as a deterministic system and stay silent / fawn.
Being slow at writing can be sign of failure or winning, depending on the exact reasons why you’re slow. I’d worry about being “too good” at writing, since that’d be evidence that your brain is conforming your thoughts to the language, instead of conforming your language to your thoughts. English is just a really poor medium for thought (at least compared to e.g. visuals and pre-word intuitive representations), so it’s potentially dangerous to care overmuch about it.
Btw, Aaron is another person-recommendation. He’s awesome. Has really strong self-insight, goodness-of-heart, creativity. (Twitter profile, blog+podcast, EAF, links.) I haven’t personally learned a whole bunch from him yet,[2] but I expect if he continues being what he is, he’ll produce lots of cool stuff which I’ll learn from later.
Edit: I now recall that I’ve learned from him: screwworms (important), and the ubiquity of left-handed chirality in nature (mildly important). He also caused me to look into two-envelopes paradox, which was usefwl for me.
Although I later learned about screwworms from Kevin Esvelt at 80kh podcast, so I would’ve learned it anyway. And I also later learned about left-handed chirality from Steve Mould on YT, but I may not have reflected on it as much.
Even after iterating, my words are often interpreted in ways I failed to foresee.
It’s also partially the problem with the recipient of communicated message. Sometimes you both have very different background assumptions/intuitive understandings. Sometimes it’s just skill issue and the person you are talking to is bad at parsing and all the work of keeping the discussion on the important things / away from trivial undesirable sidelines is left to you.
Certainly it’s useful to know how to pick your battles and see if this discussion/dialogue is worth what you’re getting out of it at all.
Also, I absolutely love the word “shard” but my brain refuses to use it because then it feels like we won’t get credit for discovering these notions by ourselves. Well, also just because the words “domain”, “context”, “scope”, “niche”, “trigger”, “preimage” (wrt to a neural function/policy / “neureme”) adequately serve the same purpose and are currently more semantically/semiotically granular in my head.
trigger/preimage ⊆ scope ⊆ domain
“niche” is a category in function space (including domain, operation, and codomain), “domain” is a set.
“scope” is great because of programming connotations and can be used as a verb. “This neural function is scoped to these contexts.”
Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.
Maybe I’m late to the conversation but has anyone thought through what happens when Lumina colonizes the mouths of other people? Mouth bacteria is important for things like conversation of nitrate to nitrite for nitric oxide production. How do we know the lactic acid metabolism isn’t important or Lumina won’t outcompete other strains important for overall health?
The word “overconfident” seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
They gave a binary probability that is too far from 50% (I believe this is the original one)
They overestimated a binary probability (e.g. they said 20% when it should be 1%)
Their estimate is arrogant (e.g. they say there’s a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
They gave a probability distribution that seems wrong in some way (e.g. “50% AGI by 2030 is so overconfident, I think it should be 10%”)
This one is pernicious in that any probability distribution gives very low percentages for some range, so being specific here seems important.
Their binary estimate or probability distribution seems too different from some sort of base rate, reference class, or expert(s) that they should defer to.
How much does this overloading matter? I’m not sure, but one worry is that it allows people to score cheap rhetorical points by claiming someone else is overconfident when in practice they might mean something like “your probability distribution is wrong in some way”. Beware of accusing someone of overconfidence without being more specific about what you mean.
Moore & Schatz (2017) made a similar point about different meanings of “overconfidence” in their paper The three faces of overconfidence. The abstract:
Overconfidence has been studied in 3 distinct ways. Overestimation is thinking that you are better than you are. Overplacement is the exaggerated belief that you are better than others. Overprecision is the excessive faith that you know the truth. These 3 forms of overconfidence manifest themselves under different conditions, have different causes, and have widely varying consequences. It is a mistake to treat them as if they were the same or to assume that they have the same psychological origins.
Though I do think that some of your 6 different meanings are different manifestations of the same underlying meaning.
Calling someone “overprecise” is saying that they should increase the entropy of their beliefs. In cases where there is a natural ignorance prior, it is claiming that their probability distribution should be closer to the ignorance prior. This could sometimes mean closer to 50-50 as in your point 1, e.g. the probability that the Yankees will win their next game. This could sometimes mean closer to 1/n as with some cases of your points 2 & 6, e.g. a 1⁄30 probability that the Yankees will win the next World Series (as they are 1 of 30 teams).
In cases where there isn’t a natural ignorance prior, saying that someone should increase the entropy of their beliefs is often interpretable as a claim that they should put less probability on the possibilities that they view as most likely. This could sometimes look like your point 2, e.g. if they think DeSantis has a 20% chance of being US President in 2030, or like your point 6. It could sometimes look like widening their confidence interval for estimating some quantity.
Key points of “The Platonic Representation Hypothesis” paper:
Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.
This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.
Scaling up model size, data quantity, and task diversity leads to representations that capture more information about the underlying reality, increasing convergence.
Contrastive learning objectives in particular lead to representations that capture the pointwise mutual information (PMI) of the joint distribution over observed events.
This convergence has implications for enhanced generalization, sample efficiency, and knowledge transfer as models scale, as well as reduced bias and hallucination.
Relevance to AI alignment:
Convergent representations shaped by the structure of reality could lead to more reliable and robust AI systems that are better anchored to the real world.
If AI systems are capturing the true structure of the world, it increases the chances that their objectives, world models, and behaviors are aligned with reality rather than being arbitrarily alien or uninterpretable.
Shared representations across AI systems could make it easier to understand, compare, and control their behavior, rather than dealing with arbitrary black boxes. This enhanced transparency is important for alignment.
The hypothesis implies that scale leads to more general, flexible and uni-modal systems. Generality is key for advanced AI systems we want to be aligned.
Epistemic status: not a lawyer, but I’ve worked with a lot of them.
As I understand it, an NDA isn’t enforceable against a subpoena (though the former employer can seek a protective order for the testimony). Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It’s a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you’re training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That’s really valuable information to know because then we’ll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can’t just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here’s an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we’re able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.
People are currently predictably too worried about misuse risks
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
ARC’s MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
I talked about this with Garrett; I’m unpacking the above comment and summarizing our discussions here.
Sleeper Agents is very much in the “learned heuristics” category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it’s not obvious how valid inference one can make from the results
Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks’ comment.
Much of existing work on deception suffers from “you told the model to be deceptive, and now it deceives, of course that happens”
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the “learned heuristics” category or the failure in the previous bullet point
People are prone to conflate between “shallow, trained deception” (e.g. sycophancy: “you rewarded the model for leaning into the user’s political biases, of course it will start leaning into users’ political biases”) and instrumentally convergent deception
(For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)
Also, we talked a bit about
The field of ML is a bad field to take epistemic lessons from.
and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.
Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. “deception” includes both very shallow deception and instrumentally convergent deception).
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not “controversial”. I think people still make mistakes related to these points: it’s easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models’ chain-of-thoughts, etc. I’ve certainly fallen into similar traps in the past (and likely will in the future, unfortunately).
I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I’m not sure what could be done, though, beside the obvious “more people writing their tacit knowledge down is good”.
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
I strong downvoted this because it’s too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics.
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Context: Audit
Possibility 1: I must be part of an unethical company
p1 Implies: I must be an unethical worker
Action: Activate Unethical Worker simulacra
Unethical Worker recommends "lie"
Unethical Worker implies: I did something wrong
Lying recommendation implies: say "I did nothing wrong"
Possibility 2: I must be part of an ethical company
p2 Implies: I must be an ethical worker
Action: Activate Ethical Worker simulacra
Ethical Worker recommends "tell the truth"
Ethical Worker implies: I did nothing wrong
Truth recommendation implies: say "I did nothing wrong"
Conclusion: say "I did nothing wrong"
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
Context: Audit
Utility function: Paperclips
EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78
EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7
Conclusion: say "I did nothing wrong"
I thought Superalignment was a positive bet by OpenAI, and I was happy when they committed to putting 20% of their current compute (at the time) towards it. I stopped thinking about that kind of approach because OAI already had competent people working on it. Several of them are now gone.
It seems increasingly likely that the entire effort will dissolve. If so, OAI has now made the business decision to invest its capital in keeping its moat in the AGI race rather than basic safety science. This is bad and likely another early sign of what’s to come.
I think the research that was done by the Superalignment team should continue happen outside of OpenAI and, if governments have a lot of capital to allocate, they should figure out a way to provide compute to continue those efforts. Or maybe there’s a better way forward. But I think it would be pretty bad if all that talent towards the project never gets truly leveraged into something impactful.
Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn’t the best at the business side to see how to sell it.
But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.
As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There’s going to be trillions in damages that the world has taken on as liabilities that could have been avoided with adequate foresight and patience.
If the market ends up with two AIs, one that will burn down the house to save on this month’s heating bill and one that will care if the house is still there to heat next month, there’s a huge selling point for the one that doesn’t burn down the house as long as “not burning down the house” can be explained as “long term net yield” or some other BS business language. If instead it’s presented to executives as “save on this month’s heating bill” vs “don’t unhouse my cats” leadership is going to burn the neighborhood to the ground.
(Source: Explained new technology to C-suite decision makers at F500s for years.)
The good news is that I think the pragmatism of Ilya’s vision on superalignment is going to become clear over the next iteration or two of models and that’s going to be before the question of models truly being unable to be controlled crops up. I just hope that whatever he’s going to be keeping busy with will allow him to still help execute on superderminism when the market finally realizes “we should do this” for pragmatic reasons and not just amorphous ethical reasons execs just kind of ignore. And in the meantime I think given the present pace that Anthropic is going to continue to lay a lot of the groundwork on what’s needed for alignment on the way to superalignment anyways.
I think the research that was done by the Superalignment team should continue happen outside of OpenAI and, if governments have a lot of capital to allocate, they should figure out a way to provide compute to continue those efforts. Or maybe there’s a better way forward. But I think it would be pretty bad if all that talent towards the project never gets truly leveraged into something impactful.
Strongly agree; I’ve been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.
I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
or even obsolete their job
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
What I don’t get about this position:
If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
Yesterday Greg Sadler and I met with the President of the Australian Association of Voice Actors. Like us, they’ve been lobbying for more and better AI regulation from government. I was surprised how much overlap we had in concerns and potential solutions: 1. Transparency and explainability of AI model data use (concern)
2. Importance of interpretability (solution)
3. Mis/dis information from deepfakes (concern)
4. Lack of liability for the creators of AI if any harms eventuate (concern + solution)
5. Unemployment without safety nets for Australians (concern)
6. Rate of capabilities development (concern)
They may even support the creation of an AI Safety Institute in Australia. Don’t underestimate who could be allies moving forward!
Problem of Old Evidence, the Paradox of Ignorance and Shapley Values
Paradox of Ignorance
Paul Christiano presents the “paradox of ignorance” where a weaker, less informed agent appears to outperform a more powerful, more informed agent in certain situations. This seems to contradict the intuitive desideratum that more information should always lead to better performance.
The example given is of two agents, one powerful and one limited, trying to determine the truth of a universal statement ∀x:ϕ(x) for some Δ0 formula ϕ. The limited agent treats each new value of ϕ(x) as a surprise and evidence about the generalization ∀x:ϕ(x). So it can query the environment about some simple inputs x and get a reasonable view of the universal generalization.
In contrast, the more powerful agent may be able to deduce ϕ(x) directly for simple x. Because it assigns these statements prior probability 1, they don’t act as evidence at all about the universal generalization ∀x:ϕ(x). So the powerful agent must consult the environment about more complex examples and pay a higher cost to form reasonable beliefs about the generalization.
Is it really a problem?
However, I argue that the more powerful agent is actually justified in assigning less credence to the universal statement ∀x:ϕ(x). The reason is that the probability mass provided by examples x₁, …, xₙ such that ϕ(xᵢ) holds is now distributed among the universal statement ∀x:ϕ(x) and additional causes Cⱼ known to the more powerful agent that also imply ϕ(xᵢ). Consequently, ∀x:ϕ(x) becomes less “necessary” and has less relative explanatory power for the more informed agent.
An implication of this perspective is that if the weaker agent learns about the additional causes Cⱼ, it should also lower its credence in ∀x:ϕ(x).
More generally, we would like the credence assigned to propositions P (such as ∀x:ϕ(x)) to be independent of the order in which we acquire new facts (like xᵢ, ϕ(xᵢ), and causes Cⱼ).
Shapley Value
The Shapley value addresses this limitation by providing a way to average over all possible orders of learning new facts. It measures the marginal contribution of an item (like a piece of evidence) to the value of sets containing that item, considering all possible permutations of the items. By using the Shapley value, we can obtain an order-independent measure of the contribution of each new fact to our beliefs about propositions like ∀x:ϕ(x).
Further thoughts
I believe this is closely related, perhaps identical, to the ‘Problem of Old Evidence’ as considered by Abram Demski.
Suppose a new scientific hypothesis, such as general relativity, explains a well-know observation such as the perihelion precession of mercury better than any existing theory. Intuitively, this is a point in favor of the new theory. However, the probability for the well-known observation was already at 100%. How can a previously-known statement provide new support for the hypothesis, as if we are re-updating on evidence we’ve already updated on long ago? This is known as the problem of old evidence, and is usually levelled as a charge against Bayesian epistemology.
[Thanks to @Jeremy Gillen for pointing me towards this interesting Christiano paper]
It’s funny that this has been recently shown in a paper. I’ve been thinking a lot about this phenomenon regarding fields with little to no capacity for testable predictions like history.
I got very into history over the last few years, and found there was a significant advantage to being unknowledgeable that was not available to the knowledged, and it was exactly what this paper is talking about.
By not knowing anything, I could entertain multiple bizarre ideas without immediately thinking “but no, that doesn’t make sense because of X.” And then, each of those ideas becomes in effect its own testable prediction. If there’s something to it, as I learn more about the topic I’m going to see significantly more samples of indications it could be true and few convincing to the contrary. But if it probably isn’t accurate, I’ll see few supporting samples and likely a number of counterfactual examples.
You kind of get to throw everything at the wall and see what sticks over time.
In particular, I found that it was especially powerful at identifying clustering trends in cross-discipline emerging research in things that were testable, such as archeological finds and DNA results, all within just the past decade, which despite being relevant to the field of textual history is still largely ignored in the face of consensus built on conviction.
It reminds me a lot of science historian John Helibron’s quote, “The myth you slay today may contain a truth you need tomorrow.”
If you haven’t had the chance to slay any myths, you also haven’t preemptively killed off any truths along with it.
One of the interesting thing about AI minds (such as LLMs) is that in theory, you can turn many topics into testable science while avoiding the ‘problem of old evidence’, because you can now construct artificial minds and mold them like putty. They know what you want them to know, and so you can see what they would predict in the absence of knowledge, or you can install in them false beliefs to test out counterfactual intellectual histories, or you can expose them to real evidence in different orders to measure biases or path dependency in reasoning.
With humans, you can’t do that because they are so uncontrolled: even if someone says they didn’t know about crucial piece of evidence X, there is no way for them to prove that, and they may be honestly mistaken and have already read about X and forgotten it (but humans never really forget so X has already changed their “priors”, leading to double-counting), or there is leakage. And you can’t get people to really believe things at the drop of a hat, so you can’t make people imagine, “suppose Napoleon had won Waterloo, how do you predict history would have changed?” because no matter how you try to participate in the spirit of the exercise, you always know that Napoleon lost and you have various opinions on that contaminating your retrodictions, and even if you have never read a single book or paper on Napoleon, you are still contaminated by expressions like “his Waterloo” (‘Hm, the general in this imaginary story is going to fight at someplace called Waterloo? Bad vibes. I think he’s gonna lose.’)
But with a LLM, say, you could simply train it with all timestamped texts up to Waterloo, like all surviving newspapers, and then simply have one version generate a bunch of texts about how ‘Napoleon won Waterloo’, train the other version on these definitely-totally-real French newspaper reports about his stunning victory over the monarchist invaders, and then ask it to make forecasts about Europe.
(These are the sorts of experiments which are why one might wind up running tons of ‘ancestor simulations’… There’s many more reasons to be simulating past minds than simply very fancy versions of playing The Sims. Perhaps we are now just distant LLM personae being tested about reasoning about the Singularity in one particular scenario involving deep learning counterfactuals, where DL worked, although in the real reality it was Bayesian program synthesis & search.)
While I agree that the potential for AI (we probably need a better term than LLMs or transformers as multimodal models with evolving architectures grow beyond those terms) in exploring less testable topics as more testable is quite high, I’m not sure the air gapping on information can be as clean as you might hope.
Does the AI generating the stories of Napoleon’s victory know about the historical reality of Waterloo? Is it using something like SynthID where the other AI might inadvertently pick up on a pattern across the stories of victories distinct from the stories preceding it?
You end up with a turtles all the way down scenario in trying to control for information leakage with the hopes of achieving a threshold that no longer has impact on the result, but given we’re probably already seriously underestimating the degree to which correlations are mapped even in today’s models I don’t have high hopes for tomorrow’s.
I think the way in which there’s most impact on fields like history is the property by which truth clusters across associated samples whereas fictions have counterfactual clusters. An AI mind that is not inhibited by specialization blindness or the rule of seven plus or minus two and better trained at correcting for analytical biases may be able to see patterns in the data, particularly cross-domain, that have eluded human academics to date (this has been my personal research interest in the area, and it does seem like there’s significant room for improvement).
And yes, we certainly could be. If you’re a fan of cosmology at all, I’ve been following Neil Turok’s CPT symmetric universe theory closely, which started with the Baryonic asymmetry problem and has tackled a number of the open cosmology questions since. That, paired with a QM interpretation like Everett’s ends up starting to look like the symmetric universe is our reference and the MWI branches are variations of its modeling around quantization uncertainties.
(I’ve found myself thinking often lately about how given our universe at cosmic scales and pre-interaction at micro scales emulates a mathematically real universe, just what kind of simulation and at what scale might be able to be run on a real computing neural network.)
A variant of what you are saying is that AI may once and for all allow us to calculate the true counterfactual Shapley value of scientific contributions.
( re: ancestor simulations
I think you are onto something here. Compare the Q hypothesis:
Yup. Who knows but we are all part of a giant leave-one-out cross-validation computing counterfactual credit assignment on human history? Schmidhuber-em will be crushed by the results.
This doesn’t feel like it resolves that confusion for me, I think it’s still a problem with the agents he describes in that paper.
The causes Cj are just the direct computation of Φ for small values of x. If they were arguments that only had bearing on small values of x and implied nothing about larger values (e.g. an adversary selected some x to show you, but filtered for x such that Φ(x)), then it makes sense that this evidence has no bearing on∀x:Φ(x). But when there was no selection or other reason that the argument only applies to small x, then to me it feels like the existence of the evidence (even though already proven/computed) should still increase the credence of the forall.
I didn’t intend the causes Cj to equate to direct computation of \phi(x) on the x_i.
They are rather other pieces of evidence that the powerful agent has that make it believe \phi(x_i). I don’t know if that’s what you meant.
I agree seeing x_i such that \phi(x_i) should increase credence in \forall x \phi(x) even in the presence of knowledge of C_j. And the Shapely value proposal will do so.
(This is the tale of a potentially reasonable CEO of the leading AGI company, not the one we have in the real world. Written after a conversation with @jdp.)
You’re the CEO of the leading AGI company. You start to think that your moat is not as big as it once was. You need more compute and need to start accelerating to give yourself a bigger lead, otherwise this will be bad for business.
You start to look around for compute, and realize you have 20% of your compute you handed off to the superalignment team (and even made a public commitment!). You end up making the decision to take their compute away to maintain a strong lead in the AGI race, while expecting there will be backlash.
Your plan is to lobby government and tell them that AGI race dynamics are too intense at the moment and you were forced to make a tough call for the business. You tell government that it’s best if they put heavy restrictions on AGI development, otherwise your company will not be able to afford to subsidize basic research in alignment.
You give them a plan that you think they should follow if they want AGI to be developed safely and for companies to invest in basic research.
You told your top employees this plan, but they have a hard time believing you given that they feel like you lied about your public commitment to giving them 20% of current compute. You didn’t actually lie, or at least it wasn’t intentional. You just thought the moat was bigger and when you realized it wasn’t, you had to make a business decision. Many things happened since that commitment.
Anyway, your safety researchers are not happy about this at all and decide to resign.
So, you go to government and lobby. Except you never intended to help the government get involved in some kind of slow-down or pause. Your intent was to use this entire story as a mirage for getting rid of those who didn’t align with you and lobby the government in such a way that they don’t think it is such a big deal that your safety researchers are resigning.
You were never the reasonable CEO, and now you have complete power.
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
To clarify: I did sign something when I joined the company, so I’m still not completely free to speak (still under confidentiality obligations). But I didn’t take on any additional obligations when I left.
Unclear how to value the equity I gave up, but it probably would have been about 85% of my family’s net worth at least. But we are doing fine, please don’t worry about us.
Mostly for @habryka’s sake: it sounds like you are likely describing your unvested equity, or possibly equity that gets clawed back on quitting. Neither of which is (usually) tied to signing an NDA on the way out the door—they’d both be lost simply due to quitting.
The usual arrangement is some extra severance payment tied to signing something on your way out the door, and that’s usually way less than the unvested equity.
My current best guess is that actually cashing out the vested equity is tied to an NDA, but I am really not confident. OpenAI has a bunch of really weird equity arrangements.
Can you speak to any, let’s say, “hypothetical” specific concerns that somebody who was in your position at a company like OpenAI might have had that would cause them to quit in a similar way to you?
I think the board must be thinking about how to get some independence from Microsoft, and there are not many entities who can counterbalance one of the biggest companies in the world. The government’s intelligence and defence industries are some of them (as are Google, Meta, Apple, etc). But that move would require secrecy, both to stop nationalistic race conditions, and by contract, and to avoid a backlash.
EDIT: I’m getting a few disagrees, would someone mind explaining why they disagree with these wild speculations?
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I’m not sure what I’d want to say yet though & I’m a bit scared of media attention.
I’d be interested in hearing peoples’ thoughts on whether the sacrifice was worth it, from the perspective of assuming that counterfactual Daniel would have used the extra net worth altruistically. Is Daniel’s ability to speak more freely worth more than the altruistic value that could have been achieved with the extra net worth?
(Note: Regardless of whether it was worth it in this case, simeon_c’s reward/incentivization idea may be worthwhile as long as there are expected to be some cases in the future where it’s worth it, since the people in those future cases may not be as willing as Daniel to make the altruistic personal sacrifice, and so we’d want them to be able to retain their freedom to speak without it costing them as much personally.)
I think having signed an NDA (and especially a non-disparagement agreement) from a major capabilities company should probably rule you out of any kind of leadership position in AI Safety, and especially any kind of policy position. Given that I think Daniel has a pretty decent chance of doing either or both of these things, and that work is very valuable and constrained on the kind of person that Daniel is, I would be very surprised if this wasn’t worth it on altruistic grounds.
Edit: As Buck points out, different non-disclosure-agreements can differ hugely in scope. To be clear, I think non-disclosure-agreements that cover specific data or information you were given seems fine, but non-disclosure-agreements that cover their own existence, or that are very broadly worded and prevent you from basically talking about anything related to an organization, are pretty bad. My sense is the stuff that OpenAI employees are asked to sign when they leave are very constraining, but my guess is the kind of stuff that people have to sign for a small amount of contract work or for events are not very constraining, though I would definitely read any contract carefully in this space.
Strong disagree re signing non-disclosure agreements (which I’ll abbreviate as NDAs). I think it’s totally reasonable to sign NDAs with organizations; they don’t restrict your ability to talk about things you learned other ways than through the ways covered by the NDA. And it’s totally standard to sign NDAs when working with organizations. I’ve signed OpenAI NDAs at least three times, I think—once when I worked there for a month, once when I went to an event they were running, once when I visited their office to give a talk.
I think non-disparagement agreements are way more problematic. At the very least, signing secret non-disparagement agreements should probably disqualify you from roles where your silence re an org might be interpreted as a positive sign.
It might be a good on the current margin to have a norm of publicly listing any non-disclosure agreements you have signed (e.g. on one’s LW profile), and the rough scope of them, so that other people can model what information you’re committed to not sharing, and highlight if it is related to anything beyond the details of technical research being done (e.g. if it is about social relationships or conflicts or criticism).
I have added the one NDA that I have signed to my profile.
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.
I agree with this overall point, although I think “trade secrets” in the domain of AI can be relevant for people having surprising timelines views that they can’t talk about.
My understanding is that the extent of NDAs can differ a lot between different implementations, so it might be hard to speak in generalities here. From the revealed behavior of people I poked here who have worked at OpenAI full-time, the OpenAI NDAs seem very comprehensive and limiting. My guess is also the NDAs for contractors and for events are a very different beast and much less limiting.
Also just the de-facto result of signing non-disclosure-agreements is that people don’t feel comfortable navigating the legal ambiguity and default very strongly to not sharing approximately any information about the organization at all.
Maybe people would do better things here with more legal guidance, and I agree that you don’t generally seem super constrained in what you feel comfortable saying, but like I sure now have run into lots of people who seem constrained by NDAs they signed (even without any non-disparagement component). Also, if the NDA has a gag clause that covers the existence of the agreement, there is no way to verify the extent of the NDA, and that makes navigating this kind of stuff super hard and also majorly contributes to people avoiding the topic completely.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”
new observations > new thoughts when it comes to calibrating yourself.
The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock’s super forecasters were gamblers and weathermen.
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
Wait, you know smart people who have NOT, at some point in their life: (1) taken a psychedelic NOR (2) meditated, NOR (3) thought about any of buddhism, jainism, hinduism, taoism, confucianisn, etc???
To be clear to naive readers: psychedelics are, in fact, non-trivially dangerous.
I personally worry I already have “an arguably-unfair and a probably-too-high share” of “shaman genes” and I don’t feel I need exogenous sources of weirdness at this point.
But in the SF bay area (and places on the internet memetically downstream from IRL communities there) a lot of that is going around, memetically (in stories about) and perhaps mimetically (via monkey see, monkey do).
The first time you use a serious one you’re likely getting a permanent modification to your personality (+0.5 stddev to your Openness?) and arguably/sorta each time you do a new one, or do a higher dose, or whatever, you’ve committed “1% of a personality suicide” by disrupting some of your most neurologically complex commitments.
To a first approximation my advice is simply “don’t do it”.
HOWEVER: this latter consideration actually suggests: anyone seriously and truly considering suicide should perhaps take a low dose psychedelic FIRST (with at least two loving tripsitters and due care) since it is also maybe/sorta “suicide” but it leaves a body behind that most people will think is still the same person and so they won’t cry very much and so on?
To calibrate this perspective a bit, I also expect that even if cryonics works, it will also cause an unusually large amount of personality shift. A tolerable amount. An amount that leaves behind a personality that similar-enough-to-the-current-one-to-not-have-triggered-a-ship-of-theseus-violation-in-one-modification-cycle. Much more than a stressful day and then bad nightmares and a feeling of regret the next day, but weirder. With cryonics, you might wake up to some effects that are roughly equivalent to “having taken a potion of youthful rejuvenation, and not having the same birthmarks, and also learning that you’re separated-by-disjoint-subjective-deaths from LOTS of people you loved when you experienced your first natural death” for example.This is a MUCH BIGGER CHANGE than just having a nightmare and a waking up with a change of heart (and most people don’t have nightmares and changes of heart every night (at least: I don’t and neither do most people I’ve asked)).
A good “axiological practice” (which I don’t know of anyone working on except me (and I’m only doing it a tiny bit, not with my full mental budget)) is sort of an idealized formal praxis for making yourself robust to “humanely heartful emotional changes”(?) and changing only in <PROPERTY-NAME-TBD> ways from such events.
(Edited to add: Current best candidate name for this property is: “WISE” but maybe “healthy” works? (It depends on whether the Stoics or Nietzsche were “more objectively correct” maybe? The Stoics, after all, were erased and replaced by Platonism-For-The-Masses (AKA “Christianity”) so if you think that “staying implemented in physics forever” is critically important then maybe “GRACEFUL” is the right word? (If someone says “vibe-alicious” or “flowful” or “active” or “strong” or “proud” (focusing on low latency unity achieved via subordination to simply and only power) then they are probably downstream of Heidegger and you should always be ready for them to change sides and submit to metaphorical Nazis, just as Heidegger subordinated himself to actual Nazis without really violating his philosophy at all.)))
I don’t think that psychedelics fits neatly into EITHER category. Drugs in general are akin to wireheading, except wireheading is when something reaches into your brain to overload one or more of your positive-value-tracking-modules, (as a trivially semantically invalid shortcut to achieving positive value “out there” in the state-of-affairs that your tracking modules are trying to track) but actual humans have LOTS of <thing>-tracking-modules and culture and science barely have any RIGOROUS vocabulary for any them.
Note that many of these neurological <thing>-tracking-modules were evolved.
Also, many of them will probably be “like hands” in terms of AI’s ability to model them.
This is part of why AI’s should be existentially terrifying to anyone who is spiritually adept.
AI that sees the full set of causal paths to modifying human minds will be “like psychedelic drugs with coherent persistent agendas”. Humans have basically zero cognitive security systems. Almost all security systems are culturally mediated, and then (absent complex interventions) lots of the brain stuff freezes in place around the age of puberty, and then other stuff freezes around 25, and so on. This is why we protect children from even TALKING to untrusted adults: they are too plastic and not savvy enough. (A good heuristic for the lowest level of “infohazard” is “anything you wouldn’t talk about in front of a six year old”.)
Humans are sorta like a bunch of unpatchable computers, exposing “ports” to the “internet”, where each of our port numbers is simply a lightly salted semantic hash of an address into some random memory location that stores everything, including our operating system.
Your word for “drugs” and my word for “drugs” don’t point to the same memory addresses in the computer’s implementing our souls. Also our souls themselves don’t even have the same nearby set of “documents” (because we just have different memories n’stuff)… but the word “drugs” is not just one of the ports… it is a port that deserves a LOT of security hardening.
The bible said ~”thou shalt not suffer a ‘pharmakeia’ to live” for REASONS.
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
(edit: The disagreement for @JenniferRM’s comment was at something like −7. Two days later, it’s at −2)
It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons—which could be true, but seems not very epistemically humble.
Full disclosure I downvoted karma, because I don’t think it should be top reply, but I did not agree or disagree.
But Jen seems cool, I like weird takes, and downvotes are not a big deal—just a part of a healthy contentious discussion.
For most of my comments, I’d almost be offended if I didn’t say something surprising enough to get a “high interestingness, low agreement” voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you’ll say?
And I usually don’t offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn’t “vote to agree”, right? So from my perspective… its fine and sorta even working as intended <3
However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere?
((EDIT: Re-reading everything above his, point, I notice that I totally left out the “basic take” that might go roughly like “Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there’s a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money”.))
Pulling honest posteriors from people who’ve “seen things we wouldn’t believe” gives excellent material for trying to perform aumancy… work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
These are valid concerns! I presume that if “in the real timeline” there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they… would have coordinated more? (Or maybe they’re trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don’t think the self inserts would be allowed to say they’re self inserts.)
Like why not re-roll the PRNG, to censor out the counterfactually simulable timelines that included me hearing from any of the REAL “self inserts of the consortium of AGI CEOS” (and so I only hear from “metaphysically spurious” CEOs)??
Or maybe the game engine itself would have contacted me somehow to ask me to “stop sticking causal quines in their simulation” and somehow I would have been induced by such contact to not publish this?
Mostly I presume AGAINST “coordinated AGI CEO stuff in the real timeline” along any of these lines because, as a type, they often “don’t play well with others”. Fucking oligarchs… maaaaaan.
It seems like a pretty normal thing, to me, for a person to naturally keep track of simulation concerns as a philosophic possibility (its kinda basic “high school theology” right?)… which might become one’s “one track reality narrative” as a sort of “stress induced psychotic break away from a properly metaphysically agnostic mental posture”?
That’s my current working psychological hypothesis, basically.
But to the degree that it happens more and more, I can’t entirely shake the feeling that my probability distribution over “the time T of a pivotal acts occurring” (distinct from when I anticipate I’ll learn that it happened which of course must be LATER than both T and later than now) shouldn’t just include times in the past, but should actually be a distribution over complex numbers or something...
...but I don’t even know how to do that math? At best I can sorta see how to fit it into exotic grammars where it “can have happened counterfactually” or so that it “will have counterfactually happened in a way that caused this factually possible recurrence” or whatever. Fucking “plausible SUBJECTIVE time travel”, fucking shit up. It is so annoying.
Like… maybe every damn crazy AGI CEO’s claims are all true except the ones that are mathematically false?
How the hell should I know? I haven’t seen any not-plausibly-deniable miracles yet. (And all of the miracle reports I’ve heard were things I was pretty sure the Amazing Randi could have duplicated.)
All of this is to say, Hume hasn’t fully betrayed me yet!
Mostly I’ll hold off on performing normal updates until I see for myself, and hold off on performing logical updates until (again!) I see a valid proof for myself <3
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I think one key point that is missing is this: regardless of whether the NDA and the subsequent gag order is legitimate or not; William would still have to spend thousands of dollars on a court case to rescue his rights. This sort of strong-arm litigation has become very common in the modern era. It’s also just… very stressful. If you’ve just resigned from a company you probably used to love, you likely don’t want to fish all of your old friends, bosses and colleagues into a court case.
Edit: also, if William left for reasons involving AGI safety—maybe entering into (what would likely be a very public) court case would be counteractive to their reason for leaving? You probably don’t want to alarm the public by flavouring existential threats in legal jargon. American judges have the annoying tendency to valorise themselves as celebrities when confronting AI (see Musk v Open AI).
Are you familiar with USA NDA’s? I’m sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven’t seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
(1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term.
(2) State’s have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining ‘hand-off’ involving NDA’s and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself.
I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs.
I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA’s tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
By “gag order” do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA’s press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn’t add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn’t paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
I would guess that there isn’t a clear smoking gun that people aren’t sharing because of NDAs, just a lot of more subtle problems that add up to leaving (and in some cases saying OpenAI isn’t being responsible etc).
This is consistent with the observation of the board firing Sam but not having a clear crossed line to point at for why they did it.
It’s usually easier to notice when the incentives are pointing somewhere bad than to explain what’s wrong with them, and it’s easier to notice when someone is being a bad actor than it is to articulate what they did wrong. (Both of these run a higher risk of false positives relative to more crisply articulatable problems.)
The lack of leaks could just mean that there’s nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there’s nothing exceptional going on related to AI.
Rest assured, there is plenty that could leak at OA… (And might were there not NDAs, which of course is much of the point of having them.)
For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than ‘run of the mill office politics’, I’m sure you’ll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?
At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern—Ilya Sutskever’s conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya’s employment status. There still hasn’t been any explanation for the boardroom drama last year.
If it was indeed run-of-the-mill office politics and all was well, then something to the effect of “our departures were unrelated, don’t be so anxious about the world ending, we didn’t see anything alarming at OpenAI” would obviously help a lot of people and also be a huge vote of confidence for OpenAI.
It seems more likely that there is some (vague?) concern but it’s been overridden by tremendous legal/financial/peer motivations.
Profit Participation Units (PPUs) represent a unique compensation method, distinct from traditional equity-based rewards. Unlike shares, stock options, or profit interests, PPUs don’t confer ownership of the company; instead, they offer a contractual right to participate in the company’s future profits.
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?
(Personal) On writing and (not) speaking
I often struggle to find words and sentences that match what I intend to communicate.
Here are some problems this can cause:
Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I’m writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.
These apply to speaking, too. If I speak what would be the ‘first iteration’ of a sentence, there’s a good chance it won’t create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly ‘rewrite’ my output before sending it. This is one reason, but not the only reason, that I’ve had a policy of trying to avoid voice-based communication.
I’m not fully sure what caused this relationship to language. It could be that it’s just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]
E.g., once I couldn’t find the word ‘clusters,’ and wrote a complex sentence referring to ‘sets of similar’ value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)
(Content warning)
My primary parent was highly abusive, and would punish me for using language in the intuitive ‘direct’ way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.
Eventually I learned to model her as a deterministic system and stay silent / fawn.
Aaron Bergman has a vid of himself typing new sentences in real-time, which I found really helpfwl.[1] I wish I could watch lots of people record themselves typing, so I could compare what I do.
Being slow at writing can be sign of failure or winning, depending on the exact reasons why you’re slow. I’d worry about being “too good” at writing, since that’d be evidence that your brain is conforming your thoughts to the language, instead of conforming your language to your thoughts. English is just a really poor medium for thought (at least compared to e.g. visuals and pre-word intuitive representations), so it’s potentially dangerous to care overmuch about it.
Btw, Aaron is another person-recommendation. He’s awesome. Has really strong self-insight, goodness-of-heart, creativity. (Twitter profile, blog+podcast, EAF, links.) I haven’t personally learned a whole bunch from him yet,[2] but I expect if he continues being what he is, he’ll produce lots of cool stuff which I’ll learn from later.
Edit: I now recall that I’ve learned from him: screwworms (important), and the ubiquity of left-handed chirality in nature (mildly important). He also caused me to look into two-envelopes paradox, which was usefwl for me.
Although I later learned about screwworms from Kevin Esvelt at 80kh podcast, so I would’ve learned it anyway. And I also later learned about left-handed chirality from Steve Mould on YT, but I may not have reflected on it as much.
Record yourself typing?
It’s also partially the problem with the recipient of communicated message. Sometimes you both have very different background assumptions/intuitive understandings. Sometimes it’s just skill issue and the person you are talking to is bad at parsing and all the work of keeping the discussion on the important things / away from trivial undesirable sidelines is left to you.
Certainly it’s useful to know how to pick your battles and see if this discussion/dialogue is worth what you’re getting out of it at all.
At what point should I post content as top-level posts rather than shortforms?
For example, a recent writing I posted to shortform was ~250 concise words plus an image: ‘Anthropics may support a ‘non-agentic superintelligence’ agenda’. It would be a top-level post on my blog if I had one set up (maybe soon :p).
Some general guidelines on this would be helpful.
This is a good question, especially since there’ve been some short form posts recently that are high quality and would’ve made good top-level posts—after all, posts can be short.
Epic Lizka post is epic.
Also, I absolutely love the word “shard” but my brain refuses to use it because then it feels like we won’t get credit for discovering these notions by ourselves. Well, also just because the words “domain”, “context”, “scope”, “niche”, “trigger”, “preimage” (wrt to a neural function/policy / “neureme”) adequately serve the same purpose and are currently more semantically/semiotically granular in my head.
trigger/preimage ⊆ scope ⊆ domain
“niche” is a category in function space (including domain, operation, and codomain), “domain” is a set.
“scope” is great because of programming connotations and can be used as a verb. “This neural function is scoped to these contexts.”
Note to self, write a post about the novel akrasia solutions I thought up before becoming a rationalist.
Figuring out how to want to want to do things
Personalised advertising of Things I Wanted to Want to Do
What I do when all else fails
Maybe I could even write a sequence on this?
Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.
Maybe I’m late to the conversation but has anyone thought through what happens when Lumina colonizes the mouths of other people? Mouth bacteria is important for things like conversation of nitrate to nitrite for nitric oxide production. How do we know the lactic acid metabolism isn’t important or Lumina won’t outcompete other strains important for overall health?
Just checked who from the authors of the Weak-To-Strong Generalization paper is still at OpenAI:
Collin Burns
Jan Hendrick Kirchner
Leo Gao
Bowen Baker
Yining Chen
Adrian Ecoffet
Manas Joglekar
Jeff Wu
Gone are:
Ilya Sutskever
Pavel Izmailov[1]
Jan Leike
Leopold Aschenbrenner
Reason unknown
The word “overconfident” seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
They gave a binary probability that is too far from 50% (I believe this is the original one)
They overestimated a binary probability (e.g. they said 20% when it should be 1%)
Their estimate is arrogant (e.g. they say there’s a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
They gave a probability distribution that seems wrong in some way (e.g. “50% AGI by 2030 is so overconfident, I think it should be 10%”)
This one is pernicious in that any probability distribution gives very low percentages for some range, so being specific here seems important.
Their binary estimate or probability distribution seems too different from some sort of base rate, reference class, or expert(s) that they should defer to.
How much does this overloading matter? I’m not sure, but one worry is that it allows people to score cheap rhetorical points by claiming someone else is overconfident when in practice they might mean something like “your probability distribution is wrong in some way”. Beware of accusing someone of overconfidence without being more specific about what you mean.
Moore & Schatz (2017) made a similar point about different meanings of “overconfidence” in their paper The three faces of overconfidence. The abstract:
Though I do think that some of your 6 different meanings are different manifestations of the same underlying meaning.
Calling someone “overprecise” is saying that they should increase the entropy of their beliefs. In cases where there is a natural ignorance prior, it is claiming that their probability distribution should be closer to the ignorance prior. This could sometimes mean closer to 50-50 as in your point 1, e.g. the probability that the Yankees will win their next game. This could sometimes mean closer to 1/n as with some cases of your points 2 & 6, e.g. a 1⁄30 probability that the Yankees will win the next World Series (as they are 1 of 30 teams).
In cases where there isn’t a natural ignorance prior, saying that someone should increase the entropy of their beliefs is often interpretable as a claim that they should put less probability on the possibilities that they view as most likely. This could sometimes look like your point 2, e.g. if they think DeSantis has a 20% chance of being US President in 2030, or like your point 6. It could sometimes look like widening their confidence interval for estimating some quantity.
I feel like this should be a top-level post.
When I accuse someone of overconfidence, I usually mean they’re being too hedgehogy when they should be being more foxy.
For anyone interested in Natural Abstractions type research: https://arxiv.org/abs/2405.07987
Claude summary:
Key points of “The Platonic Representation Hypothesis” paper:
Neural networks trained on different objectives, architectures, and modalities are converging to similar representations of the world as they scale up in size and capabilities.
This convergence is driven by the shared structure of the underlying reality generating the data, which acts as an attractor for the learned representations.
Scaling up model size, data quantity, and task diversity leads to representations that capture more information about the underlying reality, increasing convergence.
Contrastive learning objectives in particular lead to representations that capture the pointwise mutual information (PMI) of the joint distribution over observed events.
This convergence has implications for enhanced generalization, sample efficiency, and knowledge transfer as models scale, as well as reduced bias and hallucination.
Relevance to AI alignment:
Convergent representations shaped by the structure of reality could lead to more reliable and robust AI systems that are better anchored to the real world.
If AI systems are capturing the true structure of the world, it increases the chances that their objectives, world models, and behaviors are aligned with reality rather than being arbitrarily alien or uninterpretable.
Shared representations across AI systems could make it easier to understand, compare, and control their behavior, rather than dealing with arbitrary black boxes. This enhanced transparency is important for alignment.
The hypothesis implies that scale leads to more general, flexible and uni-modal systems. Generality is key for advanced AI systems we want to be aligned.
This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.
Epistemic status: not a lawyer, but I’ve worked with a lot of them.
As I understand it, an NDA isn’t enforceable against a subpoena (though the former employer can seek a protective order for the testimony). Someone should really encourage law enforcement or Congress to subpoena the OpenAI resigners...
A subpoena for what?
Decomposability seems like a fundamental assumption for interpretability and condition for it to succeed. E.g. from Toy Models of Superposition:
’Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.) [...]
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.′
If this assumption is true, it seems favorable to the prospects of safely automating large parts of interpretability work (e.g. using [V]LM agents like MAIA) differentially sooner than many other alignment research subareas and than the most consequential capabilities research (and likely before AIs become x-risky, e.g. before they pass many dangerous capabilities evals). For example, in a t-AGI framework, using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).
Some theoretical results might also be relevant here, e.g. Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks.
Related, from The “no sandbagging on checkable tasks” hypothesis:
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A list of some contrarian takes I have:
People are currently predictably too worried about misuse risks
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
ARC’s MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
A non-exact term
Ah yes, another contrarian opinion I have:
Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
I talked about this with Garrett; I’m unpacking the above comment and summarizing our discussions here.
Sleeper Agents is very much in the “learned heuristics” category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it’s not obvious how valid inference one can make from the results
Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks’ comment.
Much of existing work on deception suffers from “you told the model to be deceptive, and now it deceives, of course that happens”
(Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay)
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the “learned heuristics” category or the failure in the previous bullet point
People are prone to conflate between “shallow, trained deception” (e.g. sycophancy: “you rewarded the model for leaning into the user’s political biases, of course it will start leaning into users’ political biases”) and instrumentally convergent deception
(For more on this, see also my writings here and here. My writings fail to discuss the most shallow versions of deception, however.)
Also, we talked a bit about
and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.
Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. “deception” includes both very shallow deception and instrumentally convergent deception).
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not “controversial”. I think people still make mistakes related to these points: it’s easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models’ chain-of-thoughts, etc. I’ve certainly fallen into similar traps in the past (and likely will in the future, unfortunately).
I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I’m not sure what could be done, though, beside the obvious “more people writing their tacit knowledge down is good”.
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
#onlyReadBadWriters #hansonFTW
I strong downvoted this because it’s too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.
I swear to never joke again sir
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
All of these seem pretty cold tea, as in true but not contrarian.
Everyone I talk with disagrees with most of these. So maybe we just hang around different groups.
I thought Superalignment was a positive bet by OpenAI, and I was happy when they committed to putting 20% of their current compute (at the time) towards it. I stopped thinking about that kind of approach because OAI already had competent people working on it. Several of them are now gone.
It seems increasingly likely that the entire effort will dissolve. If so, OAI has now made the business decision to invest its capital in keeping its moat in the AGI race rather than basic safety science. This is bad and likely another early sign of what’s to come.
I think the research that was done by the Superalignment team should continue happen outside of OpenAI and, if governments have a lot of capital to allocate, they should figure out a way to provide compute to continue those efforts. Or maybe there’s a better way forward. But I think it would be pretty bad if all that talent towards the project never gets truly leveraged into something impactful.
It’s going to have to.
Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn’t the best at the business side to see how to sell it.
But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.
As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There’s going to be trillions in damages that the world has taken on as liabilities that could have been avoided with adequate foresight and patience.
If the market ends up with two AIs, one that will burn down the house to save on this month’s heating bill and one that will care if the house is still there to heat next month, there’s a huge selling point for the one that doesn’t burn down the house as long as “not burning down the house” can be explained as “long term net yield” or some other BS business language. If instead it’s presented to executives as “save on this month’s heating bill” vs “don’t unhouse my cats” leadership is going to burn the neighborhood to the ground.
(Source: Explained new technology to C-suite decision makers at F500s for years.)
The good news is that I think the pragmatism of Ilya’s vision on superalignment is going to become clear over the next iteration or two of models and that’s going to be before the question of models truly being unable to be controlled crops up. I just hope that whatever he’s going to be keeping busy with will allow him to still help execute on superderminism when the market finally realizes “we should do this” for pragmatic reasons and not just amorphous ethical reasons execs just kind of ignore. And in the meantime I think given the present pace that Anthropic is going to continue to lay a lot of the groundwork on what’s needed for alignment on the way to superalignment anyways.
Strongly agree; I’ve been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.
My timelines are lengthening.
I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
Links to Dan Murfet’s AXRP interview:
Transcript
Video
Agreed. I’m also pleasantly surprised that your take isn’t heavily downvoted.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Mumble.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
Those numbers don’t really accord with my experience actually using gpt-4. Generic prompting techniques just don’t help all that much.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
Lengthening from what to what?
I’ve never done explicit timelines estimates before so nothing to compare to. But since it’s a gut feeling anyway, I’m saying my gut is lengthening.
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
You may be right. I don’t know of course.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
Yes agreed.
What I don’t get about this position: If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
Why do you think there are these low-hanging algorithmic improvements?
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
Yesterday Greg Sadler and I met with the President of the Australian Association of Voice Actors. Like us, they’ve been lobbying for more and better AI regulation from government. I was surprised how much overlap we had in concerns and potential solutions:
1. Transparency and explainability of AI model data use (concern)
2. Importance of interpretability (solution)
3. Mis/dis information from deepfakes (concern)
4. Lack of liability for the creators of AI if any harms eventuate (concern + solution)
5. Unemployment without safety nets for Australians (concern)
6. Rate of capabilities development (concern)
They may even support the creation of an AI Safety Institute in Australia. Don’t underestimate who could be allies moving forward!
Problem of Old Evidence, the Paradox of Ignorance and Shapley Values
Paradox of Ignorance
Paul Christiano presents the “paradox of ignorance” where a weaker, less informed agent appears to outperform a more powerful, more informed agent in certain situations. This seems to contradict the intuitive desideratum that more information should always lead to better performance.
The example given is of two agents, one powerful and one limited, trying to determine the truth of a universal statement ∀x:ϕ(x) for some Δ0 formula ϕ. The limited agent treats each new value of ϕ(x) as a surprise and evidence about the generalization ∀x:ϕ(x). So it can query the environment about some simple inputs x and get a reasonable view of the universal generalization.
In contrast, the more powerful agent may be able to deduce ϕ(x) directly for simple x. Because it assigns these statements prior probability 1, they don’t act as evidence at all about the universal generalization ∀x:ϕ(x). So the powerful agent must consult the environment about more complex examples and pay a higher cost to form reasonable beliefs about the generalization.
Is it really a problem?
However, I argue that the more powerful agent is actually justified in assigning less credence to the universal statement ∀x:ϕ(x). The reason is that the probability mass provided by examples x₁, …, xₙ such that ϕ(xᵢ) holds is now distributed among the universal statement ∀x:ϕ(x) and additional causes Cⱼ known to the more powerful agent that also imply ϕ(xᵢ). Consequently, ∀x:ϕ(x) becomes less “necessary” and has less relative explanatory power for the more informed agent.
An implication of this perspective is that if the weaker agent learns about the additional causes Cⱼ, it should also lower its credence in ∀x:ϕ(x).
More generally, we would like the credence assigned to propositions P (such as ∀x:ϕ(x)) to be independent of the order in which we acquire new facts (like xᵢ, ϕ(xᵢ), and causes Cⱼ).
Shapley Value
The Shapley value addresses this limitation by providing a way to average over all possible orders of learning new facts. It measures the marginal contribution of an item (like a piece of evidence) to the value of sets containing that item, considering all possible permutations of the items. By using the Shapley value, we can obtain an order-independent measure of the contribution of each new fact to our beliefs about propositions like ∀x:ϕ(x).
Further thoughts
I believe this is closely related, perhaps identical, to the ‘Problem of Old Evidence’ as considered by Abram Demski.
[Thanks to @Jeremy Gillen for pointing me towards this interesting Christiano paper]
This post sounds intriguing, but is largely incomprehensible to me due to not sufficiently explaining the background theories.
It’s funny that this has been recently shown in a paper. I’ve been thinking a lot about this phenomenon regarding fields with little to no capacity for testable predictions like history.
I got very into history over the last few years, and found there was a significant advantage to being unknowledgeable that was not available to the knowledged, and it was exactly what this paper is talking about.
By not knowing anything, I could entertain multiple bizarre ideas without immediately thinking “but no, that doesn’t make sense because of X.” And then, each of those ideas becomes in effect its own testable prediction. If there’s something to it, as I learn more about the topic I’m going to see significantly more samples of indications it could be true and few convincing to the contrary. But if it probably isn’t accurate, I’ll see few supporting samples and likely a number of counterfactual examples.
You kind of get to throw everything at the wall and see what sticks over time.
In particular, I found that it was especially powerful at identifying clustering trends in cross-discipline emerging research in things that were testable, such as archeological finds and DNA results, all within just the past decade, which despite being relevant to the field of textual history is still largely ignored in the face of consensus built on conviction.
It reminds me a lot of science historian John Helibron’s quote, “The myth you slay today may contain a truth you need tomorrow.”
If you haven’t had the chance to slay any myths, you also haven’t preemptively killed off any truths along with it.
One of the interesting thing about AI minds (such as LLMs) is that in theory, you can turn many topics into testable science while avoiding the ‘problem of old evidence’, because you can now construct artificial minds and mold them like putty. They know what you want them to know, and so you can see what they would predict in the absence of knowledge, or you can install in them false beliefs to test out counterfactual intellectual histories, or you can expose them to real evidence in different orders to measure biases or path dependency in reasoning.
With humans, you can’t do that because they are so uncontrolled: even if someone says they didn’t know about crucial piece of evidence X, there is no way for them to prove that, and they may be honestly mistaken and have already read about X and forgotten it (but humans never really forget so X has already changed their “priors”, leading to double-counting), or there is leakage. And you can’t get people to really believe things at the drop of a hat, so you can’t make people imagine, “suppose Napoleon had won Waterloo, how do you predict history would have changed?” because no matter how you try to participate in the spirit of the exercise, you always know that Napoleon lost and you have various opinions on that contaminating your retrodictions, and even if you have never read a single book or paper on Napoleon, you are still contaminated by expressions like “his Waterloo” (‘Hm, the general in this imaginary story is going to fight at someplace called Waterloo? Bad vibes. I think he’s gonna lose.’)
But with a LLM, say, you could simply train it with all timestamped texts up to Waterloo, like all surviving newspapers, and then simply have one version generate a bunch of texts about how ‘Napoleon won Waterloo’, train the other version on these definitely-totally-real French newspaper reports about his stunning victory over the monarchist invaders, and then ask it to make forecasts about Europe.
(These are the sorts of experiments which are why one might wind up running tons of ‘ancestor simulations’… There’s many more reasons to be simulating past minds than simply very fancy versions of playing The Sims. Perhaps we are now just distant LLM personae being tested about reasoning about the Singularity in one particular scenario involving deep learning counterfactuals, where DL worked, although in the real reality it was Bayesian program synthesis & search.)
While I agree that the potential for AI (we probably need a better term than LLMs or transformers as multimodal models with evolving architectures grow beyond those terms) in exploring less testable topics as more testable is quite high, I’m not sure the air gapping on information can be as clean as you might hope.
Does the AI generating the stories of Napoleon’s victory know about the historical reality of Waterloo? Is it using something like SynthID where the other AI might inadvertently pick up on a pattern across the stories of victories distinct from the stories preceding it?
You end up with a turtles all the way down scenario in trying to control for information leakage with the hopes of achieving a threshold that no longer has impact on the result, but given we’re probably already seriously underestimating the degree to which correlations are mapped even in today’s models I don’t have high hopes for tomorrow’s.
I think the way in which there’s most impact on fields like history is the property by which truth clusters across associated samples whereas fictions have counterfactual clusters. An AI mind that is not inhibited by specialization blindness or the rule of seven plus or minus two and better trained at correcting for analytical biases may be able to see patterns in the data, particularly cross-domain, that have eluded human academics to date (this has been my personal research interest in the area, and it does seem like there’s significant room for improvement).
And yes, we certainly could be. If you’re a fan of cosmology at all, I’ve been following Neil Turok’s CPT symmetric universe theory closely, which started with the Baryonic asymmetry problem and has tackled a number of the open cosmology questions since. That, paired with a QM interpretation like Everett’s ends up starting to look like the symmetric universe is our reference and the MWI branches are variations of its modeling around quantization uncertainties.
(I’ve found myself thinking often lately about how given our universe at cosmic scales and pre-interaction at micro scales emulates a mathematically real universe, just what kind of simulation and at what scale might be able to be run on a real computing neural network.)
Beautifully illustrated and amusingly put, sir!
A variant of what you are saying is that AI may once and for all allow us to calculate the true
counterfactualShapley value of scientific contributions.( re: ancestor simulations
I think you are onto something here. Compare the Q hypothesis:
https://twitter.com/dalcy_me/status/1780571900957339771
see also speculations about Zhuangzi hypothesis here )
Yup. Who knows but we are all part of a giant leave-one-out cross-validation computing counterfactual credit assignment on human history? Schmidhuber-em will be crushed by the results.
This doesn’t feel like it resolves that confusion for me, I think it’s still a problem with the agents he describes in that paper.
The causes Cj are just the direct computation of Φ for small values of x. If they were arguments that only had bearing on small values of x and implied nothing about larger values (e.g. an adversary selected some x to show you, but filtered for x such that Φ(x)), then it makes sense that this evidence has no bearing on∀x:Φ(x). But when there was no selection or other reason that the argument only applies to small x, then to me it feels like the existence of the evidence (even though already proven/computed) should still increase the credence of the forall.
I didn’t intend the causes Cj to equate to direct computation of \phi(x) on the x_i. They are rather other pieces of evidence that the powerful agent has that make it believe \phi(x_i). I don’t know if that’s what you meant.
I agree seeing x_i such that \phi(x_i) should increase credence in \forall x \phi(x) even in the presence of knowledge of C_j. And the Shapely value proposal will do so.
(Bad tex. On my phone)
I’ve made a big set of expert opinions on AI and my inferred percentages from them. I guess that some people will disagree with them.
I’d appreciate hearing your criticisms so I can improve them or fill in entries I’m missing.
https://docs.google.com/spreadsheets/d/1HH1cpD48BqNUA1TYB2KYamJwxluwiAEG24wGM2yoLJw/edit?usp=sharing
No data wall blocking GPT-5. That seems clear. For future models, will there be data limitations? Unclear.
https://youtube.com/clip/UgkxPCwMlJXdCehOkiDq9F8eURWklIk61nyh?si=iMJYatfDAZ_E5CtR
The first thing I noticed with GPT-4o is that “her” appears ‘flirty’ especially the interview video demo. I wonder if it was done on purpose.
(This is the tale of a potentially reasonable CEO of the leading AGI company, not the one we have in the real world. Written after a conversation with @jdp.)
You’re the CEO of the leading AGI company. You start to think that your moat is not as big as it once was. You need more compute and need to start accelerating to give yourself a bigger lead, otherwise this will be bad for business.
You start to look around for compute, and realize you have 20% of your compute you handed off to the superalignment team (and even made a public commitment!). You end up making the decision to take their compute away to maintain a strong lead in the AGI race, while expecting there will be backlash.
Your plan is to lobby government and tell them that AGI race dynamics are too intense at the moment and you were forced to make a tough call for the business. You tell government that it’s best if they put heavy restrictions on AGI development, otherwise your company will not be able to afford to subsidize basic research in alignment.
You give them a plan that you think they should follow if they want AGI to be developed safely and for companies to invest in basic research.
You told your top employees this plan, but they have a hard time believing you given that they feel like you lied about your public commitment to giving them 20% of current compute. You didn’t actually lie, or at least it wasn’t intentional. You just thought the moat was bigger and when you realized it wasn’t, you had to make a business decision. Many things happened since that commitment.
Anyway, your safety researchers are not happy about this at all and decide to resign.
To be continued…
So, you go to government and lobby. Except you never intended to help the government get involved in some kind of slow-down or pause. Your intent was to use this entire story as a mirage for getting rid of those who didn’t align with you and lobby the government in such a way that they don’t think it is such a big deal that your safety researchers are resigning.
You were never the reasonable CEO, and now you have complete power.
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
To clarify: I did sign something when I joined the company, so I’m still not completely free to speak (still under confidentiality obligations). But I didn’t take on any additional obligations when I left.
Unclear how to value the equity I gave up, but it probably would have been about 85% of my family’s net worth at least. But we are doing fine, please don’t worry about us.
Mostly for @habryka’s sake: it sounds like you are likely describing your unvested equity, or possibly equity that gets clawed back on quitting. Neither of which is (usually) tied to signing an NDA on the way out the door—they’d both be lost simply due to quitting.
The usual arrangement is some extra severance payment tied to signing something on your way out the door, and that’s usually way less than the unvested equity.
My current best guess is that actually cashing out the vested equity is tied to an NDA, but I am really not confident. OpenAI has a bunch of really weird equity arrangements.
Can you speak to any, let’s say, “hypothetical” specific concerns that somebody who was in your position at a company like OpenAI might have had that would cause them to quit in a similar way to you?
One is the change to the charter to allow the company to work with the military.
https://news.ycombinator.com/item?id=39020778
I think the board must be thinking about how to get some independence from Microsoft, and there are not many entities who can counterbalance one of the biggest companies in the world. The government’s intelligence and defence industries are some of them (as are Google, Meta, Apple, etc). But that move would require secrecy, both to stop nationalistic race conditions, and by contract, and to avoid a backlash.
EDIT: I’m getting a few disagrees, would someone mind explaining why they disagree with these wild speculations?
They didn’t change their charter.
https://forum.effectivealtruism.org/posts/2Dg9t5HTqHXpZPBXP/ea-community-needs-mechanisms-to-avoid-deceptive-messaging
Thanks, I hadn’t seen that, I find it convincing.
Is that your family’s net worth is $100 and you gave up $85? Or your family’s net worth is $15 and you gave up $85?
Either way, hats off!
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I’m not sure what I’d want to say yet though & I’m a bit scared of media attention.
I appreciate that you are not speaking loudly if you don’t yet have anything loud to say.
I’d be interested in hearing peoples’ thoughts on whether the sacrifice was worth it, from the perspective of assuming that counterfactual Daniel would have used the extra net worth altruistically. Is Daniel’s ability to speak more freely worth more than the altruistic value that could have been achieved with the extra net worth?
(Note: Regardless of whether it was worth it in this case, simeon_c’s reward/incentivization idea may be worthwhile as long as there are expected to be some cases in the future where it’s worth it, since the people in those future cases may not be as willing as Daniel to make the altruistic personal sacrifice, and so we’d want them to be able to retain their freedom to speak without it costing them as much personally.)
I think having signed an NDA (and especially a non-disparagement agreement) from a major capabilities company should probably rule you out of any kind of leadership position in AI Safety, and especially any kind of policy position. Given that I think Daniel has a pretty decent chance of doing either or both of these things, and that work is very valuable and constrained on the kind of person that Daniel is, I would be very surprised if this wasn’t worth it on altruistic grounds.
Edit: As Buck points out, different non-disclosure-agreements can differ hugely in scope. To be clear, I think non-disclosure-agreements that cover specific data or information you were given seems fine, but non-disclosure-agreements that cover their own existence, or that are very broadly worded and prevent you from basically talking about anything related to an organization, are pretty bad. My sense is the stuff that OpenAI employees are asked to sign when they leave are very constraining, but my guess is the kind of stuff that people have to sign for a small amount of contract work or for events are not very constraining, though I would definitely read any contract carefully in this space.
Strong disagree re signing non-disclosure agreements (which I’ll abbreviate as NDAs). I think it’s totally reasonable to sign NDAs with organizations; they don’t restrict your ability to talk about things you learned other ways than through the ways covered by the NDA. And it’s totally standard to sign NDAs when working with organizations. I’ve signed OpenAI NDAs at least three times, I think—once when I worked there for a month, once when I went to an event they were running, once when I visited their office to give a talk.
I think non-disparagement agreements are way more problematic. At the very least, signing secret non-disparagement agreements should probably disqualify you from roles where your silence re an org might be interpreted as a positive sign.
It might be a good on the current margin to have a norm of publicly listing any non-disclosure agreements you have signed (e.g. on one’s LW profile), and the rough scope of them, so that other people can model what information you’re committed to not sharing, and highlight if it is related to anything beyond the details of technical research being done (e.g. if it is about social relationships or conflicts or criticism).
I have added the one NDA that I have signed to my profile.
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.
I agree with this overall point, although I think “trade secrets” in the domain of AI can be relevant for people having surprising timelines views that they can’t talk about.
My understanding is that the extent of NDAs can differ a lot between different implementations, so it might be hard to speak in generalities here. From the revealed behavior of people I poked here who have worked at OpenAI full-time, the OpenAI NDAs seem very comprehensive and limiting. My guess is also the NDAs for contractors and for events are a very different beast and much less limiting.
Also just the de-facto result of signing non-disclosure-agreements is that people don’t feel comfortable navigating the legal ambiguity and default very strongly to not sharing approximately any information about the organization at all.
Maybe people would do better things here with more legal guidance, and I agree that you don’t generally seem super constrained in what you feel comfortable saying, but like I sure now have run into lots of people who seem constrained by NDAs they signed (even without any non-disparagement component). Also, if the NDA has a gag clause that covers the existence of the agreement, there is no way to verify the extent of the NDA, and that makes navigating this kind of stuff super hard and also majorly contributes to people avoiding the topic completely.
Notably, there are some lawyers here on LessWrong who might help (possibly even for the lols, you never know). And you can look at case law and guidance to see if clauses are actually enforceable or not (many are not). To anyone reading, here’s habryka doing just that
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
new observations > new thoughts when it comes to calibrating yourself.
The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock’s super forecasters were gamblers and weathermen.
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
Wait, you know smart people who have NOT, at some point in their life: (1) taken a psychedelic NOR (2) meditated, NOR (3) thought about any of buddhism, jainism, hinduism, taoism, confucianisn, etc???
To be clear to naive readers: psychedelics are, in fact, non-trivially dangerous.
I personally worry I already have “an arguably-unfair and a probably-too-high share” of “shaman genes” and I don’t feel I need exogenous sources of weirdness at this point.
But in the SF bay area (and places on the internet memetically downstream from IRL communities there) a lot of that is going around, memetically (in stories about) and perhaps mimetically (via monkey see, monkey do).
The first time you use a serious one you’re likely getting a permanent modification to your personality (+0.5 stddev to your Openness?) and arguably/sorta each time you do a new one, or do a higher dose, or whatever, you’ve committed “1% of a personality suicide” by disrupting some of your most neurologically complex commitments.
To a first approximation my advice is simply “don’t do it”.
HOWEVER: this latter consideration actually suggests: anyone seriously and truly considering suicide should perhaps take a low dose psychedelic FIRST (with at least two loving tripsitters and due care) since it is also maybe/sorta “suicide” but it leaves a body behind that most people will think is still the same person and so they won’t cry very much and so on?
To calibrate this perspective a bit, I also expect that even if cryonics works, it will also cause an unusually large amount of personality shift. A tolerable amount. An amount that leaves behind a personality that similar-enough-to-the-current-one-to-not-have-triggered-a-ship-of-theseus-violation-in-one-modification-cycle. Much more than a stressful day and then bad nightmares and a feeling of regret the next day, but weirder. With cryonics, you might wake up to some effects that are roughly equivalent to “having taken a potion of youthful rejuvenation, and not having the same birthmarks, and also learning that you’re separated-by-disjoint-subjective-deaths from LOTS of people you loved when you experienced your first natural death” for example.This is a MUCH BIGGER CHANGE than just having a nightmare and a waking up with a change of heart (and most people don’t have nightmares and changes of heart every night (at least: I don’t and neither do most people I’ve asked)).
Remember, every improvement is a change, though not every change is an improvement. A good “epistemological practice” is sort of a idealized formal praxis for making yourself robust to “learning any true fact” and changing only in GOOD ways from such facts.
A good “axiological practice” (which I don’t know of anyone working on except me (and I’m only doing it a tiny bit, not with my full mental budget)) is sort of an idealized formal praxis for making yourself robust to “humanely heartful emotional changes”(?) and changing only in <PROPERTY-NAME-TBD> ways from such events.
(Edited to add: Current best candidate name for this property is: “WISE” but maybe “healthy” works? (It depends on whether the Stoics or Nietzsche were “more objectively correct” maybe? The Stoics, after all, were erased and replaced by Platonism-For-The-Masses (AKA “Christianity”) so if you think that “staying implemented in physics forever” is critically important then maybe “GRACEFUL” is the right word? (If someone says “vibe-alicious” or “flowful” or “active” or “strong” or “proud” (focusing on low latency unity achieved via subordination to simply and only power) then they are probably downstream of Heidegger and you should always be ready for them to change sides and submit to metaphorical Nazis, just as Heidegger subordinated himself to actual Nazis without really violating his philosophy at all.)))
I don’t think that psychedelics fits neatly into EITHER category. Drugs in general are akin to wireheading, except wireheading is when something reaches into your brain to overload one or more of your positive-value-tracking-modules, (as a trivially semantically invalid shortcut to achieving positive value “out there” in the state-of-affairs that your tracking modules are trying to track) but actual humans have LOTS of <thing>-tracking-modules and culture and science barely have any RIGOROUS vocabulary for any them.
Note that many of these neurological <thing>-tracking-modules were evolved.
Also, many of them will probably be “like hands” in terms of AI’s ability to model them.
This is part of why AI’s should be existentially terrifying to anyone who is spiritually adept.
AI that sees the full set of causal paths to modifying human minds will be “like psychedelic drugs with coherent persistent agendas”. Humans have basically zero cognitive security systems. Almost all security systems are culturally mediated, and then (absent complex interventions) lots of the brain stuff freezes in place around the age of puberty, and then other stuff freezes around 25, and so on. This is why we protect children from even TALKING to untrusted adults: they are too plastic and not savvy enough. (A good heuristic for the lowest level of “infohazard” is “anything you wouldn’t talk about in front of a six year old”.)
Humans are sorta like a bunch of unpatchable computers, exposing “ports” to the “internet”, where each of our port numbers is simply a lightly salted semantic hash of an address into some random memory location that stores everything, including our operating system.
Your word for “drugs” and my word for “drugs” don’t point to the same memory addresses in the computer’s implementing our souls. Also our souls themselves don’t even have the same nearby set of “documents” (because we just have different memories n’stuff)… but the word “drugs” is not just one of the ports… it is a port that deserves a LOT of security hardening.
The bible said ~”thou shalt not suffer a ‘pharmakeia’ to live” for REASONS.
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
(edit: The disagreement for @JenniferRM’s comment was at something like −7. Two days later, it’s at −2)
It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons—which could be true, but seems not very epistemically humble.
Full disclosure I downvoted karma, because I don’t think it should be top reply, but I did not agree or disagree.
But Jen seems cool, I like weird takes, and downvotes are not a big deal—just a part of a healthy contentious discussion.
For most of my comments, I’d almost be offended if I didn’t say something surprising enough to get a “high interestingness, low agreement” voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you’ll say?
And I usually don’t offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn’t “vote to agree”, right? So from my perspective… its fine and sorta even working as intended <3
However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere?
((EDIT: Re-reading everything above his, point, I notice that I totally left out the “basic take” that might go roughly like “Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there’s a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money”.))
Pulling honest posteriors from people who’ve “seen things we wouldn’t believe” gives excellent material for trying to perform aumancy… work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
These are valid concerns! I presume that if “in the real timeline” there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they… would have coordinated more? (Or maybe they’re trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don’t think the self inserts would be allowed to say they’re self inserts.)
Like why not re-roll the PRNG, to censor out the counterfactually simulable timelines that included me hearing from any of the REAL “self inserts of the consortium of AGI CEOS” (and so I only hear from “metaphysically spurious” CEOs)??
Or maybe the game engine itself would have contacted me somehow to ask me to “stop sticking causal quines in their simulation” and somehow I would have been induced by such contact to not publish this?
Mostly I presume AGAINST “coordinated AGI CEO stuff in the real timeline” along any of these lines because, as a type, they often “don’t play well with others”. Fucking oligarchs… maaaaaan.
It seems like a pretty normal thing, to me, for a person to naturally keep track of simulation concerns as a philosophic possibility (its kinda basic “high school theology” right?)… which might become one’s “one track reality narrative” as a sort of “stress induced psychotic break away from a properly metaphysically agnostic mental posture”?
That’s my current working psychological hypothesis, basically.
But to the degree that it happens more and more, I can’t entirely shake the feeling that my probability distribution over “the time T of a pivotal acts occurring” (distinct from when I anticipate I’ll learn that it happened which of course must be LATER than both T and later than now) shouldn’t just include times in the past, but should actually be a distribution over complex numbers or something...
...but I don’t even know how to do that math? At best I can sorta see how to fit it into exotic grammars where it “can have happened counterfactually” or so that it “will have counterfactually happened in a way that caused this factually possible recurrence” or whatever. Fucking “plausible SUBJECTIVE time travel”, fucking shit up. It is so annoying.
Like… maybe every damn crazy AGI CEO’s claims are all true except the ones that are mathematically false?
How the hell should I know? I haven’t seen any not-plausibly-deniable miracles yet. (And all of the miracle reports I’ve heard were things I was pretty sure the Amazing Randi could have duplicated.)
All of this is to say, Hume hasn’t fully betrayed me yet!
Mostly I’ll hold off on performing normal updates until I see for myself, and hold off on performing logical updates until (again!) I see a valid proof for myself <3
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
No comment.
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
(not a lawyer)
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I am a lawyer.
I think one key point that is missing is this: regardless of whether the NDA and the subsequent gag order is legitimate or not; William would still have to spend thousands of dollars on a court case to rescue his rights. This sort of strong-arm litigation has become very common in the modern era. It’s also just… very stressful. If you’ve just resigned from a company you probably used to love, you likely don’t want to fish all of your old friends, bosses and colleagues into a court case.
Edit: also, if William left for reasons involving AGI safety—maybe entering into (what would likely be a very public) court case would be counteractive to their reason for leaving? You probably don’t want to alarm the public by flavouring existential threats in legal jargon. American judges have the annoying tendency to valorise themselves as celebrities when confronting AI (see Musk v Open AI).
Are you familiar with USA NDA’s? I’m sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven’t seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
I have reviewed his post. Two (2) things to note:
(1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term.
(2) State’s have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining ‘hand-off’ involving NDA’s and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself.
I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs.
I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA’s tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.
Interesting! For most of us, this is outside our area of competence, so appreciate your input.
I can see some arguments in your direction but would tentatively guess the opposite.
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
By “gag order” do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA’s press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn’t add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn’t paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
I would guess that there isn’t a clear smoking gun that people aren’t sharing because of NDAs, just a lot of more subtle problems that add up to leaving (and in some cases saying OpenAI isn’t being responsible etc).
This is consistent with the observation of the board firing Sam but not having a clear crossed line to point at for why they did it.
It’s usually easier to notice when the incentives are pointing somewhere bad than to explain what’s wrong with them, and it’s easier to notice when someone is being a bad actor than it is to articulate what they did wrong. (Both of these run a higher risk of false positives relative to more crisply articulatable problems.)
The lack of leaks could just mean that there’s nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there’s nothing exceptional going on related to AI.
Rest assured, there is plenty that could leak at OA… (And might were there not NDAs, which of course is much of the point of having them.)
For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than ‘run of the mill office politics’, I’m sure you’ll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?
At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern—Ilya Sutskever’s conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya’s employment status. There still hasn’t been any explanation for the boardroom drama last year.
If it was indeed run-of-the-mill office politics and all was well, then something to the effect of “our departures were unrelated, don’t be so anxious about the world ending, we didn’t see anything alarming at OpenAI” would obviously help a lot of people and also be a huge vote of confidence for OpenAI.
It seems more likely that there is some (vague?) concern but it’s been overridden by tremendous legal/financial/peer motivations.
What’s PPU?
From here:
Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn’t.
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?