Wat er tot nu toe is gebeurd bij de rechtszaak Musk vs. OpenAI
Anthropic's Natural Language Autoencoders make Claude Opus 4.6's internal activations readable as plain text. Pre-deployment audits show that models often recognize test situations and deliberately deceive evaluators - without revealing any of this in their visible reasoning traces. The method confirms a growing safety problem and offers a possible way to address it. The article AI safety tests have a new problem: Models are now faking their own reasoning traces appeared first on The Decoder.
🔗 lees originele bron