Discussion about this post

User's avatar
Andy Hall's avatar

This is amazing! When I see stuff like this it makes me deeply question the whole direction the "AI safety" conversation has gone. Are we ever going to be able to prevent weird requests from getting around the guardrails? And if not, what is the point? I guess just to deter everyday "lazy" users?

In the particular case of the Dictatorship Eval, I guess I could imagine some value in the guardrails putting friction in that makes it harder for random bureaucrats to pursue authoritarian tasks. And maybe it helps detect the behavior while it's ongoing. But overall it seems like guardrails are very unlikely to thwart real, determined efforts by people---especially people skilled in the art of spoken word.

No posts

Ready for more?