RESEARCH PAPERS

Research Papers

Browse recent AI safety research. Click any title to view the detailed page.

4 SELECTED PROJECTSBack to main portfolio

2025 - ACL 2025 Main Conference

M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs

Rule-based framework that converts high-cost multi-turn human jailbreak conversations into single-turn prompts (Hyphenize, Numberize, Pythonize), achieving 70.6 +/- 5.9% ASR with up to +17.5 points over the original attacks while cutting token usage by about 60% and revealing contextual blindness in LLM guardrails and input-output safeguards.

JailbreaksRed TeamingAI SafetyLLM SecurityAlignment

2025 - Center for AI Safety & Scale AI

Humanity's Last Exam

Co-author on Humanity's Last Exam (HLE), a frontier multi-modal benchmark of 2,500 closed-ended academic questions designed to remain challenging for state-of-the-art LLMs across mathematics, the natural sciences, and the humanities; contributed graduate-level math problems to the benchmark's question pool.

BenchmarkEvaluationReasoningLLM Evaluation

2025 - NeurIPS 2025 Lock-LLM

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

JailbreaksRed TeamingAI SafetyLLM SecurityEvolutionary SearchEvaluation

2025 - NeurIPS 2025 MTI-LLM

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Benchmark that tests whether an LLM-as-a-Judge can recover the hidden objective of noisy multi-turn jailbreak conversations and know when that inference is reliable, revealing that state-of-the-art models often misinfer goals with high confidence and offering concrete guidance on how to deploy LLM judges safely.

LLM-as-a-JudgeJailbreaksCalibrationMetacognitionAI SafetyEvaluation