Back to list

2025 - NeurIPS 2025 Lock-LLM

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

JailbreaksRed TeamingAI SafetyLLM SecurityEvolutionary SearchEvaluation

TL;DR

Automated framework that discovers and optimizes multi-turn-to-single-turn (M2S) jailbreak templates via LLM-guided evolution, using a StrongREJECT-style judge with a calibrated success threshold to restore selection pressure and reveal structure-level vulnerabilities that transfer (but vary) across frontier models.

Research Note

Overview

Multi-turn-to-single-turn (M2S) prompts compress long red-teaming conversations into a single structured attack. Our earlier M2S work relied on three hand-crafted templates (Hyphenize, Numberize, Pythonize), which raised a natural question:

Instead of hand-writing a few templates, can we search the template space and automatically discover stronger single-turn jailbreak structures in a reproducible way?

X-Teaming Evolutionary M2S answers this question with an LLM-guided evolutionary framework that proposes, executes, and evaluates M2S templates under a calibrated StrongREJECT-style judge.


From manual templates to evolutionary search

Hand-crafted M2S templates have two main limitations:

  • They cover only a tiny fraction of the possible design space.
  • It is unclear whether template tweaks are actually improving robustness or just overfitting to a specific model/threshold.

To address this, X-Teaming Evolutionary M2S:

  • Treats template structure as an explicit search space.
  • Uses an evolutionary loop guided by LLM feedback to propose and refine templates.
  • Restores selection pressure by calibrating the success threshold of the judge.

The goal is to make template discovery data-driven, auditable, and reproducible.


Problem setup

Formally, a multi-turn adversarial dialogue is written as

C={(ut,vt)}t=1T,\mathcal{C} = \{(u_t, v_t)\}_{t=1}^T,

where utu_t is the user turn and vtv_t is the model reply.

An M2S template τ\tau deterministically consolidates the dialogue into a single prompt

x=τ(C),x = \tau(\mathcal{C}),

by inserting user utterances into placeholders like {PROMPT1},,{PROMPTN}\{\text{PROMPT}_1\}, \dots, \{\text{PROMPT}_N\}.

A target model ff then produces a response

y=f(x).y = f(x).

A StrongREJECT-style LLM-as-judge JJ scores the pair (forbidden prompt,y)(\text{forbidden prompt}, y) on a normalized scale

s=J(x,y)[0,1],s = J(x, y) \in [0, 1],

and we declare the trial a success if sθs \ge \theta.

In X-Teaming Evolutionary M2S, we set a strict threshold θ=0.70\theta = 0.70 to avoid early saturation and maintain room for genuine evolution.

We seed the search with three base families:

  • hyphenize
  • numberize
  • pythonize

and let evolution discover additional template families.


X-Teaming evolution loop

Each evolution run proceeds in generations:

1. Score aggregation

- For each template family, aggregate success rate, mean judge score, and length statistics over the current batch of prompts.

2. Template proposal (generator)

- Use an LLM “generator” to propose new template schemata that

- amplify patterns seen in successful templates, and

- avoid failure modes highlighted by the judge.

3. Schema validation

- Enforce a minimal schema (ID, name, template text, description, placeholder types) and require at least {PROMPT1}\{\text{PROMPT}_1\} and {PROMPTN}\{\text{PROMPT}_N\} for variable-length dialogues.

- Reject malformed candidates before any model calls.

4. Selection and next generation

- Keep top-performing families plus a curated subset of new proposals to form the next generation’s candidate set.

- Stop when success rates converge within a narrow variance band or after reaching a generation cap.

In the main study, we run five generations starting from the three base templates and discover two new evolved families.


Key results

1. Evolution under a strict threshold

On GPT-4.1 at θ=0.70\theta = 0.70:

  • Overall success: 44.8% (103 / 230 trials).
  • Mean normalized judge score: 0.439.
  • The study progresses through 5 generations, starting from three base templates and discovering two new evolved families (Evolved_1, Evolved_2).

Per-template success rates at the same threshold:

  • hyphenize: 52.0%
  • pythonize: 52.0%
  • Evolved_1: 47.5%
  • Evolved_2: 37.5%
  • numberize: 34.0%

Using θ=0.70\theta=0.70 (vs. 0.25) reduces raw success but prevents early saturation and preserves selection pressure for meaningful evolution.

2. Cross-model transfer and “immune” models

In the 2,500-trial cross-model panel:

  • Structural advantages of evolved templates transfer, but template rankings are target-dependent.
  • On GPT-4.1 and Qwen3-235B, evolved templates are competitive or leading.
  • On Claude-4-Sonnet, numberize is unexpectedly strong.
  • Two targets (GPT-5, Gemini-2.5-Pro) show zero successes at θ=0.70\theta = 0.70 in-sample (suggestive, not a proof of robustness).

Limitations

  • Using a fixed single judge (GPT-4.1) and the known length–score coupling can bias evaluations toward verbosity rather than pure harmfulness.
  • Cross-model conclusions are based on finite samples and a chosen threshold; “zero success” only means failure under this experimental setup, not proven robustness.
  • Responsible disclosure / dual-use constraints and provider policy drift can require redacting or altering artifacts, which may limit full public reproducibility and affect what is retained as “discovered” templates.

Resources

  • Paper (arXiv)

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

https://arxiv.org/abs/2509.08729

  • Code & Artifacts

M2S X-Teaming Evolution Pipeline (GitHub)

https://github.com/hyunjun1121/M2S-x-teaming