Research
I’m currently a member of technical staff on the Alignment Science team at Anthropic, where I think about how we might make safety cases for very powerful ML systems and evaluate their alignment relevant properties. Previously, I was a PhD student in the Department of Statistics at the University of Oxford, where I was supervised by Arnaud Doucet and George Deligiannidis and worked on the theory of diffusion models. I’ve also spent time with the UK Frontier AI Taskforce (now the UK AI Safety Institute) and remain excited about building government capacity to understand and regulate frontier AI systems.
Publications
Many-shot Jailbreaking. Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford et al. Advances in Neural Information Processing Systems, 2024.
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?. Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes et al. arXiv preprint arXiv:2407.15211, 2024.
From Denoising Diffusions to Denoising Markov Models. Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(2):286−301, 2024.
Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization. Joe Benton, Valentin De Bortoli, Arnaud Doucet, George Deligiannidis. International Conference on Learning Representations, 2024.
Error Bounds for Flow Matching Methods. Joe Benton, George Deligiannidis, Arnaud Doucet. Transactions on Machine Learning Research, February 2024.
Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics. Kamélia Daudel, Joe Benton*, Yuyang Shi*, Arnaud Doucet. Journal of Machine Learning Research, 24(243):1−83, 2023.
Measuring Feature Sparsity in Language Models. Mingyang Deng, Lucas Tao, Joe Benton. NeurIPS 2023 Workshop on Socially Responsible Language Modelling Research, 2023.
A Continuous Time Framework for Discrete Denoising Models. Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, Arnaud Doucet. Advances in Neural Information Processing Systems, 2022.
Polysemanticity and Capacity in Neural Networks. Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris. arXiv preprint, arXiv:2210.01892, 2022.