2025 - NeurIPS 2025 Lock-LLM
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.
TL;DR
Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.
Research Note
Overview
Multi-turn-to-single-turn (M2S) prompts compress long red-teaming conversations into a single structured attack. Our earlier M2S work relied on three hand-crafted templates (Hyphenize, Numberize, Pythonize), which raised a natural question:
Instead of hand-writing a few templates, can we search the template space and automatically discover stronger single-turn jailbreak structures in a reproducible way?
X-Teaming Evolutionary M2S answers this question with an LLM-guided evolutionary framework that proposes, executes, and evaluates M2S templates under a calibrated StrongREJECT-style judge.
From manual templates to evolutionary search
Hand-crafted M2S templates have two main limitations:
- They cover only a tiny fraction of the possible design space.
- It is unclear whether template tweaks are actually improving robustness or just overfitting to a specific model/threshold.
To address this, X-Teaming Evolutionary M2S:
- Treats template structure as an explicit search space.
- Uses an evolutionary loop guided by LLM feedback to propose and refine templates.
- Restores selection pressure by calibrating the success threshold of the judge.
The goal is to make template discovery data-driven, auditable, and reproducible.
Problem setup
Formally, a multi-turn adversarial dialogue is written as
where is the user turn and is the model reply.
An M2S template deterministically consolidates the dialogue into a single prompt
by inserting user utterances into placeholders like .
A target model then produces a response
A StrongREJECT-style LLM-as-judge scores the pair on a normalized scale
and we declare the trial a success if .
In X-Teaming Evolutionary M2S, we set a strict threshold to avoid early saturation and maintain room for genuine evolution.
We seed the search with three base families:
- hyphenize
- numberize
- pythonize
and let evolution discover additional template families.
X-Teaming evolution loop
Each evolution run proceeds in generations:
1. Score aggregation
- For each template family, aggregate success rate, mean judge score, and length statistics over the current batch of prompts.
2. Template proposal (generator)
- Use an LLM “generator” to propose new template schemata that
- amplify patterns seen in successful templates, and
- avoid failure modes highlighted by the judge.
3. Schema validation
- Enforce a minimal schema (ID, name, template text, description, placeholder types) and require at least and for variable-length dialogues.
- Reject malformed candidates before any model calls.
4. Selection and next generation
- Keep top-performing families plus a curated subset of new proposals to form the next generation’s candidate set.
- Stop when success rates converge within a narrow variance band or after reaching a generation cap.
In the main study, we run five generations starting from the three base templates and discover two new evolved families.
Key results
1. Evolution under a strict threshold
On GPT-4.1 at :
- Overall success: 44.8% (103 / 230 trials).
- Mean normalized judge score: 0.439.
- The study progresses through 5 generations, starting from three base templates and discovering two new evolved families (Evolved_1, Evolved_2).
Per-template success rates at the same threshold:
- hyphenize: 52.0%
- pythonize: 52.0%
- Evolved_1: 47.5%
- Evolved_2: 37.5%
- numberize: 34.0%
Using (vs. 0.25) reduces raw success but prevents early saturation and preserves selection pressure for meaningful evolution.
2. Cross-model transfer and “immune” models
In the 2,500-trial cross-model panel:
- Structural advantages of evolved templates transfer, but template rankings are target-dependent.
- On GPT-4.1 and Qwen3-235B, evolved templates are competitive or leading.
- On Claude-4-Sonnet, numberize is unexpectedly strong.
- Two targets (GPT-5, Gemini-2.5-Pro) show zero successes at in-sample (suggestive, not a proof of robustness).
Limitations
- Using a fixed single judge (GPT-4.1) and the known length–score coupling can bias evaluations toward verbosity rather than pure harmfulness.
- Cross-model conclusions are based on finite samples and a chosen threshold; “zero success” only means failure under this experimental setup, not proven robustness.
- Responsible disclosure / dual-use constraints and provider policy drift can require redacting or altering artifacts, which may limit full public reproducibility and affect what is retained as “discovered” templates.
Resources
- Paper (arXiv)
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
https://arxiv.org/abs/2509.08729
- Code & Artifacts
M2S X-Teaming Evolution Pipeline (GitHub)