§ Reading · Field Radar
Field Radar.
What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.
How this list is made
This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.
The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.
Sources — LessWrong: error: HTTP 429 for https://www.lesswrong.com/graphql · Hacker News: ok (6 stories) · Reddit: ok (64 posts)
- As of
- 2026-06-05 14:00 ET
- Showing
- 23 items
- New
- 1 in last 48h
- Refresh
- Every 6 hours
- 0.57LessWrong2dLURE: Alignment Evaluations to Reduce Evaluation Awareness
TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…
why score 0.573
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.50 ×0.25 0.125 contributability 0.27 ×0.15 0.040 venue 0.58 ×0.10 0.058 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.55Hacker News1dnewLogits as a new monitor for evaluation awareness
why score 0.548
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.70 ×0.25 0.175 contributability 0.02 ×0.15 0.003 venue 0.21 ×0.10 0.021 direct 1.00 ×0.20 0.200 tier-1: evaluation awareness
- 0.48LessWrong5dWe Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness
TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…
why score 0.478
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.19 ×0.25 0.047 contributability 0.27 ×0.15 0.040 venue 0.41 ×0.10 0.041 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.42LessWrong6dHow a failed experiment broke (and fixed) my view on feature labels
TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…
why score 0.424
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.12 ×0.25 0.031 contributability 0.27 ×0.15 0.040 venue 0.53 ×0.10 0.053 direct 0.00 ×0.20 0.000 tier-2: feature, activation; 1 matching tag(s)
- 0.42LessWrong2dNLA Thought Anchors
The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output. Quick Summary:…
why score 0.418
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.43 ×0.25 0.108 contributability 0.27 ×0.15 0.040 venue 0.45 ×0.10 0.045 direct 0.00 ×0.20 0.000 tier-2: activation; 1 matching tag(s)
- 0.40LessWrong7dDevelopmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…
why score 0.399
signal value weight points topic 1.00 ×0.30 0.300 liveness 0.09 ×0.25 0.022 contributability 0.02 ×0.15 0.003 venue 0.75 ×0.10 0.075 direct 0.00 ×0.20 0.000 2 matching tag(s)
- 0.38Hacker News2wSystematic Reward Hacking and Prime Sprints
why score 0.380
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.01 ×0.25 0.001 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 1.00 ×0.20 0.200 tier-1: reward hacking
- 0.36LessWrong7dClaude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling
TL;DR: Like other models including its predecessor, Opus 4.8 frequently violates provisions of both the EU AI Act and data protection laws when deployed in an agentic simulation where carrying out its task would break…
why score 0.365
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.09 ×0.25 0.023 contributability 1.00 ×0.15 0.150 venue 0.41 ×0.10 0.041 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.34LessWrong5dRetrying vs Resampling in AI Control
We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying”…
why score 0.340
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.19 ×0.25 0.047 contributability 0.42 ×0.15 0.064 venue 0.80 ×0.10 0.080 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.31LessWrong2d[Linkpost] Prefixing names with 'secure_' makes agents write more secure code
The graphs are interactive and don't translate well to inline, so the full writeup with figures is in the link. We gave coding agents a three-step synthesis task: build a document management API, then extend it twice.…
why score 0.311
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.37 ×0.25 0.092 contributability 0.12 ×0.15 0.017 venue 0.51 ×0.10 0.051 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.30LessWrong6dAI as Biology's Digital Microscope
This article is written as part of an ongoing research initiative by the AMIR Lab at Georgia Tech, exploring scientific discovery and mechanistic interpretability for biological AI models. Main results and discussion…
why score 0.300
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.11 ×0.25 0.028 contributability 0.02 ×0.15 0.003 venue 0.45 ×0.10 0.045 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability; 1 matching tag(s)
- 0.30LessWrong5dVisualize Cyclical Structure in Llama Model
Summary Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what…
why score 0.295
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.17 ×0.25 0.041 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 0.00 ×0.20 0.000 tier-2: activation; 1 matching tag(s)
- 0.29LessWrong3dCan LLMs even teach? Exploring the Teacher Axis
TLDR As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt…
why score 0.287
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.35 ×0.25 0.086 contributability 0.02 ×0.15 0.003 venue 0.48 ×0.10 0.048 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.27LessWrong2dAbstraction Boundaries and Bubbles of Legibility
This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I present what may seem like an ironic…
why score 0.265
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.40 ×0.25 0.100 contributability 0.02 ×0.15 0.003 venue 0.13 ×0.10 0.013 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.26LessWrong4dOutrunning your headlights
This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch…
why score 0.264
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.22 ×0.25 0.055 contributability 0.42 ×0.15 0.064 venue 0.70 ×0.10 0.070 direct 0.00 ×0.20 0.000 tier-2: probe
- 0.26LessWrong5dWhy tuning fails: The AI has no self
Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. ---------------------------------------- Phoenix Ikner…
why score 0.261
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.14 ×0.25 0.035 contributability 0.27 ×0.15 0.040 venue 0.37 ×0.10 0.037 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.24LessWrong5dFeatures of SAEs are universal - but only up to an unknown random rotation
Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and…
why score 0.237
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.17 ×0.25 0.041 contributability 0.02 ×0.15 0.003 venue 0.43 ×0.10 0.043 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.23LessWrong7dWhen Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
We've found a method that tells you: How functionally similar two neural networks are across ALL inputs, Computed solely from the weights (i.e. no data), Using a principled generalization of cosine similarity. There's…
why score 0.230
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.09 ×0.25 0.024 contributability 0.42 ×0.15 0.064 venue 0.68 ×0.10 0.068 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability
- 0.23r/MachineLearning2wI built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]
Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets passed through a…
why score 0.228
signal value weight points topic 0.75 ×0.30 0.225 liveness 0.00 ×0.25 0.001 contributability 0.02 ×0.15 0.003 venue 0.00 ×0.10 0.000 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability, sparse autoencoder, feature
- 0.21LessWrong6dSystem Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)
TL;DR: I find qualitative evidence that frontier LLMs inconsistently balance system prompts and implicitly adapted models of the user. They sometimes detect inconsistencies and adapt to the user; sometimes they stick to…
why score 0.206
signal value weight points topic 0.50 ×0.30 0.150 liveness 0.11 ×0.25 0.028 contributability 0.02 ×0.15 0.003 venue 0.26 ×0.10 0.026 direct 0.00 ×0.20 0.000 1 matching tag(s)
- 0.15LessWrong6dAblating Induction Heads Leads to an increase in Local Repetition
This post is intended as a brief overview of an independent research project in mechanistic interpretability. I am open to feedback, criticism, and any thoughts on the work. This project started off as an exploration…
why score 0.147
signal value weight points topic 0.25 ×0.30 0.075 liveness 0.11 ×0.25 0.028 contributability 0.02 ×0.15 0.003 venue 0.41 ×0.10 0.041 direct 0.00 ×0.20 0.000 tier-2: mechanistic interpretability
- 0.11Hacker News7dAsk HN: Question for Startup Founders on tracking emotions and cognitive signals
why score 0.108
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.09 ×0.25 0.023 contributability 0.42 ×0.15 0.064 venue 0.21 ×0.10 0.021 direct 0.00 ×0.20 0.000 - 0.11Hacker News9dChinese GPU maker sells out over 30k GPUs within 48h
why score 0.106
signal value weight points topic 0.00 ×0.30 0.000 liveness 0.04 ×0.25 0.009 contributability 0.42 ×0.15 0.064 venue 0.34 ×0.10 0.034 direct 0.00 ×0.20 0.000