Research

I currently help to manage the scalable oversight team at Anthropic. Previously I’ve worked on building model organisms of misalignment, understanding chain-of-thought monitoring and implementing control evaluations as part of Anthropic’s Alignment Science team. Before that, I was a PhD student in the Department of Statistics at the University of Oxford, where I was supervised by Arnaud Doucet and George Deligiannidis and worked on the theory of diffusion models. I’ve also spent time with the UK Frontier AI Taskforce (now the UK AI Safety Institute) and remain excited about building government capacity to understand and regulate frontier AI systems.

Publications

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents. Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton. arXiv preprint arXiv:2506.15740, 2025.

Reasoning Models Don’t Always Say What They Think. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez. arXiv preprint arXiv:2505.05410, 2025.

Inverse Scaling in Test-Time Compute. Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez. arXiv preprint arXiv:2507.14417, 2025.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Dan Hendrycks, et al. arXiv preprint arXiv:2507.11473, 2025.

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning. Miles Turpin, Andy Arditi, Meihua Li, Joe Benton, Julian Michael. arXiv preprint arXiv:2506.22777, 2025.

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, et al. arXiv preprint arXiv:2501.18837, 2025.

Sabotage Evaluations for Frontier Models. Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris et al. arXiv preprint arXiv:2410.21514, 2024.

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?. Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes et al. NeurIPS 2024 Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models, 2024.

Many-shot Jailbreaking. Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford et al. Advances in Neural Information Processing Systems, 2024.

From Denoising Diffusions to Denoising Markov Models. Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(2):286−301, 2024.

Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization. Joe Benton, Valentin De Bortoli, Arnaud Doucet, George Deligiannidis. International Conference on Learning Representations, 2024.

Error Bounds for Flow Matching Methods. Joe Benton, George Deligiannidis, Arnaud Doucet. Transactions on Machine Learning Research, February 2024.

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics. Kamélia Daudel, Joe Benton*, Yuyang Shi*, Arnaud Doucet. Journal of Machine Learning Research, 24(243):1−83, 2023.

Measuring Feature Sparsity in Language Models. Mingyang Deng, Lucas Tao, Joe Benton. NeurIPS 2023 Workshop on Socially Responsible Language Modelling Research, 2023.

A Continuous Time Framework for Discrete Denoising Models. Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, Arnaud Doucet. Advances in Neural Information Processing Systems, 2022.

Polysemanticity and Capacity in Neural Networks. Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris. arXiv preprint, arXiv:2210.01892, 2022.