ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
Amelia Hardy*, Houjun Liu*, Allie Griffith, Bernard Lange, Duncan Eddy, Mykel J Kochenderfer
Paper
We apply the Adaptive Stress Testing (AST) framework to language modeling to identify prompts that are both effective for red-teaming and likely to occur under natural autoregression.
