~www_lesswrong_com | Bookmarks (713)

Do No Harm? Navigating and Nudging AI Moral Choices — LessWrong

lesswrong.com

Published on February 6, 2025 7:18 PM GMTTL;DR: How do AI systems make moral decisions, and...
Published on February 6, 2025 7:18 PM GMTTL;DR: How do AI systems make moral decisions, and can we influence their ethical judgments? We probe these questions by examining Llama's 70B (3.1 and 3.3) responses to moral dilemmas, using Goodfire API to steer its decision-making process. Our experiments reveal that simply reframing ethical questions - from "harm one to save many" to "let many perish to...
1
Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas — LessWrong

lesswrong.com

Published on February 6, 2025 6:58 PM GMTOpen Philanthropy is launching a big new Request for Proposals...
Published on February 6, 2025 6:58 PM GMTOpen Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality. Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.OverviewWe're seeking proposals across...
1
AISN #47: Reasoning Models — LessWrong

lesswrong.com

Published on February 6, 2025 6:52 PM GMTWelcome to the AI Safety Newsletter by the Center...
Published on February 6, 2025 6:52 PM GMTWelcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.Reasoning ModelsDeepSeek-R1 has been one of the most significant model releases since ChatGPT. After its release, the DeepSeek’s app quickly rose to...
1
Wild Animal Suffering Is The Worst Thing In The World — LessWrong

lesswrong.com

Published on February 6, 2025 4:15 PM GMTCrossposted from my blog which many people are saying...
Published on February 6, 2025 4:15 PM GMTCrossposted from my blog which many people are saying you should check out! Imagine that you came across an injured deer on the road. She was in immense pain, perhaps having been mauled by a bear or seriously injured in some other way. Two things are obvious:If you could greatly help her at small cost, you should do...
1
Detecting Strategic Deception Using Linear Probes — LessWrong

lesswrong.com

Published on February 6, 2025 3:46 PM GMTCan you tell when an LLM is lying from...
Published on February 6, 2025 3:46 PM GMTCan you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.Abstract:AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal...
1
Alignment Paradox and a Request for Harsh Criticism — LessWrong

lesswrong.com

Published on February 5, 2025 6:17 PM GMTI’m not a scientist, engineer, or alignment researcher in...
Published on February 5, 2025 6:17 PM GMTI’m not a scientist, engineer, or alignment researcher in any respect; I’m a failed science fiction writer. I have a tendency to write opinionated essays that I rarely finish. It’s good that I rarely finish them, however, because if I did, I would generate far too much irrelevant slop. ...
1
Introducing International AI Governance Alliance (IAIGA) — LessWrong

lesswrong.com

Published on February 5, 2025 4:02 PM GMTThe International AI Governance Alliance (IAIGA) is a new...
Published on February 5, 2025 4:02 PM GMTThe International AI Governance Alliance (IAIGA) is a new non-profit organization being incorporated in Geneva, Switzerland. It has two goals:Establish an independent global collective intelligence organization dedicated to coordinating AI research, development, and the distribution of AI-derived economic benefits.Develop and enforce a legally-binding treaty that sets stringent safety standards for AI research and development and mandates fair...
1
Language Models Use Trigonometry to Do Addition — LessWrong

lesswrong.com

Published on February 5, 2025 1:50 PM GMTI (Subhash) am a Masters student in the Tegmark...
Published on February 5, 2025 1:50 PM GMTI (Subhash) am a Masters student in the Tegmark AI Safety Lab at MIT. I am interested in recruiting for full time roles this Spring - please reach out if you're interested in working together!TLDR This blog post accompanies the paper "Language Models Use Trigonometry to Do Addition." Key findings:.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left;...
1
Reviewing LessWrong: Screwtape's Basic Answer — LessWrong

lesswrong.com

Published on February 5, 2025 4:30 AM GMTYeah I put this off until the last day,...
Published on February 5, 2025 4:30 AM GMTYeah I put this off until the last day, and I'm not sure this is the format Raemon was actually looking for. Oh well.Then, in proportion to how valuable they seem, spend at least some time this month reflecting......on the big picture of what intellectual progress seems important to you. Do it whatever way is most valuable...
1
Journalism student looking for sources — LessWrong

lesswrong.com

Published on February 4, 2025 6:58 PM GMTHello Lesswrong community,I am a journalism student doing my...
Published on February 4, 2025 6:58 PM GMTHello Lesswrong community,I am a journalism student doing my capstone documentary on AI alignment. However, this is a topic that I want to make sure is done well. The last thing I would want is to confuse or mislead anyone. That said, I have a few questions:Who would be the best people to reach out to for interview?...
1
Nick Land: Orthogonality — LessWrong

lesswrong.com

Published on February 4, 2025 9:07 PM GMTEditor's note Due to the interest aroused by @jessicata's posts...
Published on February 4, 2025 9:07 PM GMTEditor's note Due to the interest aroused by @jessicata's posts on the topic, Book review: Xenosystems and The Obliqueness Thesis, I thought I'd share a compendium of relevant Xenosystem posts I have put together.If you, like me, have a vendetta against trees, a tastefully typeset LaTeχ version is available at this link. If your bloodlust extends even further,...
1
Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker — LessWrong

lesswrong.com

Published on February 4, 2025 8:34 PM GMTSummary:This post outlines how a view we call subjective...
Published on February 4, 2025 8:34 PM GMTSummary:This post outlines how a view we call subjective naturalism[1] poses challenges to classical Savage-style decision theory. Subjective naturalism requires (i) richness (the ability to represent all propositions the agent can entertain, including self-referential ones) and (ii) austerity (excluding events the agent deems impossible). It is one way of making precise certain requirements of embedded agency. We then...
1
Anti-Slop Interventions? — LessWrong

lesswrong.com

Published on February 4, 2025 7:50 PM GMTIn his recent post arguing against AI Control research,...
Published on February 4, 2025 7:50 PM GMTIn his recent post arguing against AI Control research, John Wentworth argues that the median doom path goes through AI slop, rather than scheming. I find this to be plausible. I believe this suggests a convergence of interests between AI capabilities research and AI alignment research.Historically, there has been a lot of concern about differential progress amongst...
1
We’re in Deep Research — LessWrong

lesswrong.com

Published on February 4, 2025 5:20 PM GMTThe latest addition to OpenAI’s Pro offerings is their...
Published on February 4, 2025 5:20 PM GMTThe latest addition to OpenAI’s Pro offerings is their version of Deep Research. Have you longed for 10k word reports on anything your heart desires, 100 times a month, at a level similar to a graduate student intern? We have the product for you. Table of Contents The Pitch. It’s Coming. Is It Safe?. How Does Deep...
1
The Capitalist Agent — LessWrong

lesswrong.com

Published on February 4, 2025 3:32 PM GMTWith the ongoing evolutions in “artificial intelligence”, of course...
Published on February 4, 2025 3:32 PM GMTWith the ongoing evolutions in “artificial intelligence”, of course we’re seeing the emergence of agents, i.e. AIs which can do rather complex tasks autonomously.The first step is automation, but of what?First comes the stuff where humans currently act like computers anyway: sales, marketing, clerks and everyone else who’s doing repetitive things.But what’s the main metric which they...
1
Forecasting AGI: Insights from Prediction Markets and Metaculus — LessWrong

lesswrong.com

Published on February 4, 2025 1:03 PM GMTI have tried to find all prediction market and...
Published on February 4, 2025 1:03 PM GMTI have tried to find all prediction market and Metaculus questions related to AGI timelines. Here I examine how they compare to each other, and what they actually say about when AGI might arrive.If you know of a market that I have missed, please tell me in the comment section! It would also be helpful if you...
1
Ruling Out Lookup Tables — LessWrong

lesswrong.com

Published on February 4, 2025 10:39 AM GMTThis post was written during Alex Altair's agent foundations...
Published on February 4, 2025 10:39 AM GMTThis post was written during Alex Altair's agent foundations fellowship funded by the LTFF.This is not particularly surprising or complex, but I wanted it written up somewhere. When we have been discussing the Agent-like Structure problem, lookup tables often come up as a useful counter-example or intuition pump for how a system could exhibit agent-like behaviour without...
1
Half-baked idea: a straightforward method for learning environmental goals? — LessWrong

lesswrong.com

Published on February 4, 2025 6:56 AM GMTEpistemic status: I want to propose a method of...
Published on February 4, 2025 6:56 AM GMTEpistemic status: I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It's informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn't get someone knowledgeable enough to take a look. Can you tell...
1
Information Versus Action — LessWrong

lesswrong.com

Published on February 4, 2025 5:13 AM GMTYou can get a clearer view of what's going...
Published on February 4, 2025 5:13 AM GMTYou can get a clearer view of what's going on if you're willing to ignore certain types of information when making decisions. If you heavily use a source of information to make important decisions, that source of information gains new pressure that can make it worse. See Goodhart's Law and Why I Am Not In Charge.I. Imagine you...
1
Utilitarian AI Alignment: Building a Moral Assistant with the Constitutional AI Method — LessWrong

lesswrong.com

Published on February 4, 2025 4:15 AM GMT AbstractThis post presents a project from the AI Safety...
Published on February 4, 2025 4:15 AM GMT AbstractThis post presents a project from the AI Safety Fundamentals: AI Alignment course, in which I developed a utilitarian AI assistant using the Constitutional AI (CAI) method. By applying the CAI method, the project constructs a model that strictly follows utilitarian principles, and evaluates its responses across a diverse set of 150 prompts ranging from harmful to...
1
Tear Down the Burren — LessWrong

lesswrong.com

Published on February 4, 2025 3:40 AM GMT I love the Burren. It hosts something like...
Published on February 4, 2025 3:40 AM GMT I love the Burren. It hosts something like seven weekly sessions in a range of styles and the back room has hosted many great acts including many of my friends. It's a key space in the Boston folk scene, and it's under threat from developers who want to tear it down. But after thinking it through,...
1
Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog) — LessWrong

lesswrong.com

Published on February 4, 2025 2:55 AM GMTExcerpt below. Follow the link for the full post.In...
Published on February 4, 2025 2:55 AM GMTExcerpt below. Follow the link for the full post.In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.We are currently...
1
Can someone, anyone, make superintelligence a more concrete concept? — LessWrong

lesswrong.com

Published on February 4, 2025 2:18 AM GMTWhat especially worries me about artificial intelligence is that...
Published on February 4, 2025 2:18 AM GMTWhat especially worries me about artificial intelligence is that I'm freaked out by my inability to marshal the appropriate emotional response. - Sam Harris (NPR, 2017)I've been thinking alot about why so many in the public don't care much about the loss of control risk posed by artificial superintelligence, and I believe a big reason is that...
2
Gradual Disempowerment, Shell Games and Flinches — LessWrong

lesswrong.com

Published on February 2, 2025 2:47 PM GMTOver the past year and half, I've had numerous...
Published on February 2, 2025 2:47 PM GMTOver the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment. (The shortest useful summary of the core argument is: To the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or...
1

~www_lesswrong_com | Bookmarks (713)

Domains