Detecting misbehavior in frontier reasoning models
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Wow
0
Sad
0
Angry
0
Comments (0)