Field Radar.

What’s worth reading right now in AI reward hacking, specification & evaluation gaming, and mechanistic interpretability — an auto-scored radar over public discussion, refreshed a few times a day.

How this list is made

This page is generated, not hand-picked. A few times a day a script checks LessWrong, Hacker News, and a handful of subreddits for posts about reward hacking, specification gaming, evaluation gaming, and mechanistic interpretability, then scores each one on how on-topic it is, how recently the conversation actually moved, and whether there’s still room to get a word in — as opposed to a thread that already has two hundred comments. Higher scores float to the top. Every title links out to the original; I’m pointing at other people’s work, not reproducing it.

The score is a crude weighted sum, and like any harness it pins down what I bothered to measure and silently lets everything else vary. So read this as one opinionated filter, not a survey of the field — it will miss things, and when it surfaces something dull that’s the weights, not the author.

Sources — LessWrong: error: HTTP 429 for https://www.lesswrong.com/graphql · Hacker News: ok (6 stories) · Reddit: ok (64 posts)

As of
2026-06-05 14:00 ET
Showing
23 items
New
1 in last 48h
Refresh
Every 6 hours
  1. 0.57LessWrong2d
    LURE: Alignment Evaluations to Reduce Evaluation Awareness

    TLDR: Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment…

    why score 0.573
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.50×0.250.125
    contributability0.27×0.150.040
    venue0.58×0.100.058
    direct1.00×0.200.200

    tier-1: evaluation awareness

  2. 0.55Hacker News1dnew
    Logits as a new monitor for evaluation awareness
    why score 0.548
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.70×0.250.175
    contributability0.02×0.150.003
    venue0.21×0.100.021
    direct1.00×0.200.200

    tier-1: evaluation awareness

  3. 0.48LessWrong5d
    We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

    TL;DR Negation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true. Inoculation prompting is a method of reducing reward hacking…

    why score 0.478
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.19×0.250.047
    contributability0.27×0.150.040
    venue0.41×0.100.041
    direct1.00×0.200.200

    tier-1: reward hacking

  4. 0.42LessWrong6d
    How a failed experiment broke (and fixed) my view on feature labels

    TL;DR In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez ,…

    why score 0.424
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.12×0.250.031
    contributability0.27×0.150.040
    venue0.53×0.100.053
    direct0.00×0.200.000

    tier-2: feature, activation; 1 matching tag(s)

  5. 0.42LessWrong2d
    NLA Thought Anchors

    The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output. Quick Summary:…

    why score 0.418
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.43×0.250.108
    contributability0.27×0.150.040
    venue0.45×0.100.045
    direct0.00×0.200.000

    tier-2: activation; 1 matching tag(s)

  6. 0.40LessWrong7d
    Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

    Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making…

    why score 0.399
    signalvalueweightpoints
    topic1.00×0.300.300
    liveness0.09×0.250.022
    contributability0.02×0.150.003
    venue0.75×0.100.075
    direct0.00×0.200.000

    2 matching tag(s)

  7. 0.38Hacker News2w
    Systematic Reward Hacking and Prime Sprints
    why score 0.380
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.01×0.250.001
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct1.00×0.200.200

    tier-1: reward hacking

  8. 0.36LessWrong7d
    Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

    TL;DR: Like other models including its predecessor, Opus 4.8 frequently violates provisions of both the EU AI Act and data protection laws when deployed in an agentic simulation where carrying out its task would break…

    why score 0.365
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.09×0.250.023
    contributability1.00×0.150.150
    venue0.41×0.100.041
    direct0.00×0.200.000

    1 matching tag(s)

  9. 0.34LessWrong5d
    Retrying vs Resampling in AI Control

    We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying”…

    why score 0.340
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.19×0.250.047
    contributability0.42×0.150.064
    venue0.80×0.100.080
    direct0.00×0.200.000

    1 matching tag(s)

  10. 0.31LessWrong2d
    [Linkpost] Prefixing names with 'secure_' makes agents write more secure code

    The graphs are interactive and don't translate well to inline, so the full writeup with figures is in the link. We gave coding agents a three-step synthesis task: build a document management API, then extend it twice.…

    why score 0.311
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.37×0.250.092
    contributability0.12×0.150.017
    venue0.51×0.100.051
    direct0.00×0.200.000

    1 matching tag(s)

  11. 0.30LessWrong6d
    AI as Biology's Digital Microscope

    This article is written as part of an ongoing research initiative by the AMIR Lab at Georgia Tech, exploring scientific discovery and mechanistic interpretability for biological AI models. Main results and discussion…

    why score 0.300
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.11×0.250.028
    contributability0.02×0.150.003
    venue0.45×0.100.045
    direct0.00×0.200.000

    tier-2: mechanistic interpretability; 1 matching tag(s)

  12. 0.30LessWrong5d
    Visualize Cyclical Structure in Llama Model

    Summary Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what…

    why score 0.295
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.17×0.250.041
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct0.00×0.200.000

    tier-2: activation; 1 matching tag(s)

  13. 0.29LessWrong3d
    Can LLMs even teach? Exploring the Teacher Axis

    TLDR As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt…

    why score 0.287
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.35×0.250.086
    contributability0.02×0.150.003
    venue0.48×0.100.048
    direct0.00×0.200.000

    1 matching tag(s)

  14. 0.27LessWrong2d
    Abstraction Boundaries and Bubbles of Legibility

    This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I present what may seem like an ironic…

    why score 0.265
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.40×0.250.100
    contributability0.02×0.150.003
    venue0.13×0.100.013
    direct0.00×0.200.000

    1 matching tag(s)

  15. 0.26LessWrong4d
    Outrunning your headlights

    This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch…

    why score 0.264
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.22×0.250.055
    contributability0.42×0.150.064
    venue0.70×0.100.070
    direct0.00×0.200.000

    tier-2: probe

  16. 0.26LessWrong5d
    Why tuning fails: The AI has no self

    Epistemic status: Highly confident in the underlying mechanism. Moderately confident that the current paradigm won't shift without an external forcing function. ---------------------------------------- Phoenix Ikner…

    why score 0.261
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.14×0.250.035
    contributability0.27×0.150.040
    venue0.37×0.100.037
    direct0.00×0.200.000

    1 matching tag(s)

  17. 0.24LessWrong5d
    Features of SAEs are universal - but only up to an unknown random rotation

    Features of SAEs are universal - but only up to an unknown random rotation Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and…

    why score 0.237
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.17×0.250.041
    contributability0.02×0.150.003
    venue0.43×0.100.043
    direct0.00×0.200.000

    1 matching tag(s)

  18. 0.23LessWrong7d
    When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    We've found a method that tells you: How functionally similar two neural networks are across ALL inputs, Computed solely from the weights (i.e. no data), Using a principled generalization of cosine similarity. There's…

    why score 0.230
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.09×0.250.024
    contributability0.42×0.150.064
    venue0.68×0.100.068
    direct0.00×0.200.000

    tier-2: mechanistic interpretability

  19. 0.23r/MachineLearning2w
    I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

    Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets passed through a…

    why score 0.228
    signalvalueweightpoints
    topic0.75×0.300.225
    liveness0.00×0.250.001
    contributability0.02×0.150.003
    venue0.00×0.100.000
    direct0.00×0.200.000

    tier-2: mechanistic interpretability, sparse autoencoder, feature

  20. 0.21LessWrong6d
    System Prompts vs. Partner Adaptation in LLMs (or, when LLMs know you're an adult but keep talking like you're seven)

    TL;DR: I find qualitative evidence that frontier LLMs inconsistently balance system prompts and implicitly adapted models of the user. They sometimes detect inconsistencies and adapt to the user; sometimes they stick to…

    why score 0.206
    signalvalueweightpoints
    topic0.50×0.300.150
    liveness0.11×0.250.028
    contributability0.02×0.150.003
    venue0.26×0.100.026
    direct0.00×0.200.000

    1 matching tag(s)

  21. 0.15LessWrong6d
    Ablating Induction Heads Leads to an increase in Local Repetition

    This post is intended as a brief overview of an independent research project in mechanistic interpretability. I am open to feedback, criticism, and any thoughts on the work. This project started off as an exploration…

    why score 0.147
    signalvalueweightpoints
    topic0.25×0.300.075
    liveness0.11×0.250.028
    contributability0.02×0.150.003
    venue0.41×0.100.041
    direct0.00×0.200.000

    tier-2: mechanistic interpretability

  22. 0.11Hacker News7d
    Ask HN: Question for Startup Founders on tracking emotions and cognitive signals
    why score 0.108
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.09×0.250.023
    contributability0.42×0.150.064
    venue0.21×0.100.021
    direct0.00×0.200.000
  23. 0.11Hacker News9d
    Chinese GPU maker sells out over 30k GPUs within 48h
    why score 0.106
    signalvalueweightpoints
    topic0.00×0.300.000
    liveness0.04×0.250.009
    contributability0.42×0.150.064
    venue0.34×0.100.034
    direct0.00×0.200.000