Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Stickland, Asa Cooper, Lyzhov, Alexander, Pfau, Jacob, Mahdi, Salsabila, Bowman, Samuel R
Year of Publication 20.06.2024
Year of Publication 20.06.2024
Get full text
Journal Article
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Berglund, Lukas, Tong, Meg, Kaufmann, Max, Balesni, Mikita, Stickland, Asa Cooper, Korbak, Tomasz, Evans, Owain
Year of Publication 21.09.2023
Year of Publication 21.09.2023
Get full text
Journal Article
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Rein, David, Hou, Betty Li, Stickland, Asa Cooper, Petty, Jackson, Pang, Richard Yuanzhe, Dirani, Julien, Michael, Julian, Bowman, Samuel R
Year of Publication 20.11.2023
Year of Publication 20.11.2023
Get full text
Journal Article
Taken out of context: On measuring situational awareness in LLMs
Berglund, Lukas, Stickland, Asa Cooper, Balesni, Mikita, Kaufmann, Max, Tong, Meg, Korbak, Tomasz, Kokotajlo, Daniel, Evans, Owain
Year of Publication 01.09.2023
Year of Publication 01.09.2023
Get full text
Journal Article
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Sheshadri, Abhay, Ewart, Aidan, Guo, Phillip, Lynch, Aengus, Wu, Cindy, Hebbar, Vivek, Sleight, Henry, Stickland, Asa Cooper, Perez, Ethan, Hadfield-Menell, Dylan, Casper, Stephen
Year of Publication 22.07.2024
Year of Publication 22.07.2024
Get full text
Journal Article
Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining
Stickland, Asa Cooper, Sengupta, Sailik, Krone, Jason, Mansour, Saab, He, He
Year of Publication 10.10.2022
Year of Publication 10.10.2022
Get full text
Journal Article