Research

I’m currently a member of technical staff on the Alignment Science team at Anthropic, where I think about how we might make safety cases for very powerful ML systems and evaluate their alignment relevant properties. Previously, I was a PhD student in the Department of Statistics at the University of Oxford, where I was supervised by Arnaud Doucet and George Deligiannidis and worked on the theory of diffusion models. I’ve also spent time with the UK Frontier AI Taskforce (now the UK AI Safety Institute) and remain excited about building government capacity to understand and regulate frontier AI systems.

Projects

I’m currently looking for potential collaborators on two projects:

If you’re highly motivated to help make transformative AI go well, have a background in empirical ML research, AI control or dataset curation, and want to work on these projects, reach out to me at joe [at] anthropic [dot] com with a description of your interests, or consider applying to the Anthropic Safety Fellowship.

Publications

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, et al. arXiv preprint arXiv:2501.18837, 2025.

Sabotage Evaluations for Frontier Models. Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris et al. arXiv preprint arXiv:2410.21514, 2024.

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?. Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes et al. NeurIPS 2024 Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models, 2024.

Many-shot Jailbreaking. Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford et al. Advances in Neural Information Processing Systems, 2024.

From Denoising Diffusions to Denoising Markov Models. Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(2):286−301, 2024.

Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization. Joe Benton, Valentin De Bortoli, Arnaud Doucet, George Deligiannidis. International Conference on Learning Representations, 2024.

Error Bounds for Flow Matching Methods. Joe Benton, George Deligiannidis, Arnaud Doucet. Transactions on Machine Learning Research, February 2024.

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics. Kamélia Daudel, Joe Benton*, Yuyang Shi*, Arnaud Doucet. Journal of Machine Learning Research, 24(243):1−83, 2023.

Measuring Feature Sparsity in Language Models. Mingyang Deng, Lucas Tao, Joe Benton. NeurIPS 2023 Workshop on Socially Responsible Language Modelling Research, 2023.

A Continuous Time Framework for Discrete Denoising Models. Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, Arnaud Doucet. Advances in Neural Information Processing Systems, 2022.

Polysemanticity and Capacity in Neural Networks. Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris. arXiv preprint, arXiv:2210.01892, 2022.