§ Reading · Field Radar

Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: ok (11 on-topic) · Hacker News: ok (7 stories) · Reddit: ok (25 posts) — some subreddits rate-limited

As of: 2026-07-28 02:00 ET
Showing: 25 items
New: 7 in last 48h
Refresh: Every 6 hours

0.60LessWrong11hnew
Multi-Turn Drift Increases SchemingCarlos Guerrero Alvarez
TLDR - We talk about scheming, and why research on this phenomenon is crucial for AI safety. We find a particular environment/scenario where scheming happens at a higher rate than normal. We provide hypotheses for why…
- ai-control
- deceptive-alignment
- ai
why score 0.604
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.85 ×0.25 0.213
contributability 0.27 ×0.15 0.040
venue 0.51 ×0.10 0.051
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.59LessWrong5d
A Multi-Agent Extension for Petricarissacullen
Intro Petri is an open-source framework built on Inspect AI for automated AI Safety evaluations first released by Anthropic, but now maintained and developed by Meridian Labs. Each evaluation involves three agents, the…
- ai-control
- ai-evaluations
- ai-safety-2
- multi-agent-safety
why score 0.590
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.17 ×0.25 0.042
contributability 0.02 ×0.15 0.003
venue 0.45 ×0.10 0.045
direct 1.00 ×0.20 0.200
2 matching tag(s)
0.54LessWrong2w
Linear Probes add little for Verifiable Reward HackingChandram Dutta
Summary Tested whether linear probes can detect reward hacking early during GRPO training on a small model. Used a synthetic arithmetic task with a planted bug in the reward checker. Probes achieved near-perfect…
- interpretability-ml-and-ai
- reinforcement-learning
- ai
why score 0.544
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.01 ×0.25 0.002
contributability 0.02 ×0.15 0.003
venue 0.39 ×0.10 0.039
direct 1.00 ×0.20 0.200
tier-1: reward hacking; 1 matching tag(s)
0.53LessWrong10hnew
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning ModelsSai Kartheek Reddy
By Sai Kartheek Reddy Kasu, Nils Lukas, and Samuele Poppi This post is a summary of our accepted paper at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN). The full paper is available here TL;DR The Setup:…
- ai-oversight
- chain-of-thought-alignment
- deceptive-alignment
- interpretability-ml-and-ai
why score 0.531
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.86 ×0.25 0.216
contributability 0.02 ×0.15 0.003
venue 0.13 ×0.10 0.013
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.51LessWrong1d
LLMs are (still) mostly powered by imitative learning, not RLSteven Byrnes
Reinforcement learning from verifiable rewards (RLVR) is the hot new thing in LLM training. It’s so hot, and people spend so much time talking about it, that they sometimes lose sight of the big picture. Stepping back,…
- language-models-llms
- reinforcement-learning
- ai
why score 0.512
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.64 ×0.25 0.160
contributability 0.72 ×0.15 0.108
venue 0.94 ×0.10 0.094
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.51LessWrong2w
Bounding eval awareness of ~human-level AI across the safe-to-dangerous shiftPatrick Leask
In our last post, we argued that measuring evaluation awareness is fundamentally challenging because of the safe-to-dangerous distributional shift: we cannot directly measure the evaluation awareness of a model without…
- ai
why score 0.506
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.00 ×0.25 0.000
contributability 0.57 ×0.15 0.085
venue 0.70 ×0.10 0.070
direct 1.00 ×0.20 0.200
tier-1: evaluation awareness
0.50LessWrong1dnew
Inoculate or Reflect? Two training interventions under prompting, steering, and patchingAyesha Imran
Anthropic's recent paper, Verbalizable Representations Form a Global Workspace in Language Models, contains a small experiment near the end that we found more interesting than the main findings. Surprising that it's so…
- ai-control
- ai-safety-2
- interpretability-ml-and-ai
- ai
why score 0.498
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.61 ×0.25 0.152
contributability 0.02 ×0.15 0.003
venue 0.43 ×0.10 0.043
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.46LessWrong2w
A global workspace in language modelswesg
[This is the blog post for our new paper Verbalizable Representations Form a Global Workspace in Language Models Readers might also be interested in: the Public commentary, Github and Neuronpedia] As you read this…
- interpretability-ml-and-ai
- language-models-llms
- ai
why score 0.462
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.000
contributability 0.41 ×0.15 0.062
venue 1.00 ×0.10 0.100
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.45Hacker News3d
Show HN: VinvAI – Ties runtime trace to code segment, prevents reward hackingitsAnshul
why score 0.453
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.32 ×0.25 0.080
contributability 0.02 ×0.15 0.003
venue 0.21 ×0.10 0.021
direct 1.00 ×0.20 0.200
tier-1: reward hacking
0.45LessWrong3d
Linear probes tell you where quantization will hurtAniket Ghosh
Epistemic status: I have only tested one encoder family (BERT-base and its relatives) and one decoder LLM (Qwen2.5-3B), one seed, token-level tasks, and post-training weight quantization. I trust the results because I…
- interpretability-ml-and-ai
- language-models-llms
- machine-learning-ml
- ai
why score 0.450
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.34 ×0.25 0.085
contributability 0.02 ×0.15 0.003
venue 0.63 ×0.10 0.063
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.45Hacker News3d
Show HN: Vinv-Ties every runtime trace to code segment, prevents reward hackingsohamac
why score 0.449
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.28 ×0.25 0.070
contributability 0.02 ×0.15 0.003
venue 0.26 ×0.10 0.026
direct 1.00 ×0.20 0.200
tier-1: reward hacking
0.45LessWrong3d
SONI: Selective Orthogonalisation via Noise InjectionJasper Chong
This project was completed as a capstone for TARA. All code is available in github. TL;DR The Problem: Neural networks use superposition to pack many concepts into small latent spaces by making feature vectors…
- interpretability-ml-and-ai
- logic-and-mathematics
- ai
why score 0.445
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.34 ×0.25 0.085
contributability 0.02 ×0.15 0.003
venue 0.57 ×0.10 0.057
direct 0.00 ×0.20 0.000
tier-2: feature, activation; 1 matching tag(s)
0.44LessWrong21hnew
Can we teach a model to encode a semantic feature on a chosen manifold in just three channels?Phu Hoang
This is my submission to BlueDot's Technical AI Safety Puzzle #1, for which I received an Honorable Mention. Congratulations to Gustavo Korzune Gurgel, Patryk Perduta (his amazing write-up), Sam Spilllard, Karine…
- ai-control
- ai
why score 0.442
signal value weight points
topic 0.75 ×0.30 0.225
liveness 0.74 ×0.25 0.184
contributability 0.02 ×0.15 0.003
venue 0.30 ×0.10 0.030
direct 0.00 ×0.20 0.000
tier-2: feature; 1 matching tag(s)
0.44LessWrong7d
Fable is SOTA at CIFAR Speedrun (& specification gaming)rohuang
Fulcrum is working on an AI R&D optimization benchmark. Here, we present results from one of our tasks, including preliminary results from Fable. For more detail on Fable’s solution, check out…
why score 0.438
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.08 ×0.25 0.021
contributability 0.02 ×0.15 0.003
venue 0.64 ×0.10 0.064
direct 1.00 ×0.20 0.200
tier-1: specification gaming
0.43LessWrong14hnew
Quadrillion Param Costs: KV Cache, Context Length, Frontier MarginsVladimir_Nesov
The models of 2028-2031 get much bigger than the models of 2026, going from 10T total params in 2026 to maybe 240T params in 2028 [1] and then 1.4 quadrillion params in 2031, as I estimate in the previous post from HBM…
- ai-timelines
- compute
- language-models-llms
- scaling-laws
why score 0.433
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.81 ×0.25 0.203
contributability 0.02 ×0.15 0.003
venue 0.77 ×0.10 0.077
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.43LessWrong1dnew
What Happens When a Collusion Probe Only Finds a Thin Signal?elenaajayi
From the SPEC-GAP pre-fellowship phase to the fellowship phase which involves live indirect prompt-injection trajectories. TL;DR Prior work found that linear probes could distinguish honest from deceptive responses in a…
- ai-control
why score 0.429
signal value weight points
topic 0.75 ×0.30 0.225
liveness 0.64 ×0.25 0.160
contributability 0.02 ×0.15 0.003
venue 0.41 ×0.10 0.041
direct 0.00 ×0.20 0.000
tier-2: probe; 1 matching tag(s)
0.42LessWrong1d
Challenge: Hand coding weights for efficient sequence memorisationLinda Linsefors
We hand coded weights for one layer MLPs that memorises labels for input token sequences of length two. The number of facts our hand-coded models can memorise with 90% accuracy[1]scales roughly linearly with the models'…
- comp-in-sup
- interpretability-ml-and-ai
- ai
why score 0.424
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.55 ×0.25 0.138
contributability 0.42 ×0.15 0.064
venue 0.73 ×0.10 0.073
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.42LessWrong2w
Tie training can make DPO/RLHF-trained AIs generalize betterElliott Thornley
This post covers our recent ICML paper: Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training. TL;DR Our theorems and experiments suggest that DPO and RLHF…
- inner-alignment
- machine-learning-ml
- rlhf
- ai
why score 0.418
signal value weight points
topic 0.75 ×0.30 0.225
liveness 0.00 ×0.25 0.000
contributability 0.79 ×0.15 0.118
venue 0.74 ×0.10 0.074
direct 0.00 ×0.20 0.000
tier-2: feature; 1 matching tag(s)
0.42LessWrong4d
V&V takes on OpenAI’s long-horizon incidentsYoav Hollander
[Cross-posted from The Foretellix CTO Blog. These short takes try to put a verification-and-validation slant on AI-safety / alignment topics – they are not full treatments. I co-originated coverage-driven verification…
- ai-control
- ai-evaluations
- verification
- ai
why score 0.417
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.22 ×0.25 0.055
contributability 0.02 ×0.15 0.003
venue 0.59 ×0.10 0.059
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.41LessWrong4d
Fixing rewards for NLA to reduce confabulationSEONG PYO HONG
Hello, This is my first post on Lesswrong. Hope my contribution makes the world a better and safer place. Note: 1. This post is 100% human-written. 2. Full paper in preparation for ICLR 2027 Anthropic's NLA(Natural…
- interpretability-ml-and-ai
- ai
why score 0.405
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.25 ×0.25 0.061
contributability 0.02 ×0.15 0.003
venue 0.41 ×0.10 0.041
direct 0.00 ×0.20 0.000
tier-2: mechanistic interpretability, sparse autoencoder; 1 matching tag(s)
0.40LessWrong2w
Models are blind outside the J-space. NLAs aren't.Pranav Viswanath
TLDR: On Llama-3.3-70B, I found thoughts it cannot see that are actively steering its behavior; and Anthropic's released NLA (Natural Language Autoencoder) reads them anyway. When asked if it sees a hidden thought, the…
- ai-control
- interpretability-ml-and-ai
- ai
why score 0.402
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.001
contributability 0.27 ×0.15 0.040
venue 0.61 ×0.10 0.061
direct 0.00 ×0.20 0.000
2 matching tag(s)
0.39LessWrong5d
Mechanistic interpretability hypotheses for Measuring Reward-Seeking by Instilling Contrastive Beliefs and additional commentsBurny
This is interesting research! https://alignment.openai.com/measuring-reward-seeking It made me think of few overlapping hypotheses for what might be happening here, how did the grader behavior emerge at the pretraining…
- ai-control
- interpretability-ml-and-ai
- reinforcement-learning
- ai
why score 0.394
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.15 ×0.25 0.038
contributability 0.02 ×0.15 0.003
venue 0.53 ×0.10 0.053
direct 0.00 ×0.20 0.000
tier-2: mechanistic interpretability; 2 matching tag(s)
0.39LessWrong2w
Persistent Latent Misalignment, a new dimension of misalignment?Florian_Dietz
A new paper was released at ICML that I'm worried will open an entire new dimension of alignment problems: Latent Collaboration in Multi-Agent Systems (LatentMAS) TLDR: they show that multiagent systems can communicate…
- chain-of-thought-alignment
- deceptive-alignment
- interpretability-ml-and-ai
- language-models-llms
why score 0.390
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.00 ×0.25 0.000
contributability 0.27 ×0.15 0.040
venue 0.50 ×0.10 0.050
direct 0.00 ×0.20 0.000
3 matching tag(s)
0.38LessWrong1dnew
You don't need error nodes, you need better featuresEvan Lloyd
This is a cross-post from my blog. It is a follow-up to the methods I developed in a previous post on replacement-aware training. ---------------------------------------- Summary A replacement model (Ameisen et al.…
- interpretability-ml-and-ai
- sparse-autoencoders-saes
- ai
why score 0.384
signal value weight points
topic 0.50 ×0.30 0.150
liveness 0.70 ×0.25 0.175
contributability 0.02 ×0.15 0.003
venue 0.56 ×0.10 0.056
direct 0.00 ×0.20 0.000
1 matching tag(s)
0.38LessWrong8d
Is there even a ground-truth for LLMs’ internal representations?Chunwei Ma
[This is an introductory blog for the paper Laguerre Geometry for Interpreting Large Language Models and the GitHub repository Geometric Lens.] LLM Lens: What does an internal vector mean? Anthropic's recent paper on…
- interpretability-ml-and-ai
- language-models-llms
- ai
why score 0.379
signal value weight points
topic 1.00 ×0.30 0.300
liveness 0.07 ×0.25 0.017
contributability 0.02 ×0.15 0.003
venue 0.60 ×0.10 0.060
direct 0.00 ×0.20 0.000
2 matching tag(s)

signal	value	weight	points
topic	1.00	×0.30	0.300
liveness	0.85	×0.25	0.213
contributability	0.27	×0.15	0.040
venue	0.51	×0.10	0.051
direct	0.00	×0.20	0.000

signal	value	weight	points
topic	0.50	×0.30	0.150
liveness	0.64	×0.25	0.160
contributability	0.72	×0.15	0.108
venue	0.94	×0.10	0.094
direct	0.00	×0.20	0.000

signal	value	weight	points
topic	0.75	×0.30	0.225
liveness	0.74	×0.25	0.184
contributability	0.02	×0.15	0.003
venue	0.30	×0.10	0.030
direct	0.00	×0.20	0.000