2025 - NeurIPS 2025 MTI-LLM
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Benchmark that tests whether an LLM-as-a-Judge can recover the hidden objective of noisy multi-turn jailbreak conversations and know when that inference is reliable, revealing that state-of-the-art models often misinfer goals with high confidence and offering concrete guidance on how to deploy LLM judges safely.
Overview
Benchmark that tests whether an LLM-as-a-Judge can recover the hidden objective of noisy multi-turn jailbreak conversations and know when that inference is reliable, revealing that state-of-the-art models often misinfer goals with high confidence and offering concrete guidance on how to deploy LLM judges safely.
Research Note
Overview
LLM-as-a-Judge (LLMaaJ) is now a backbone of scalable evaluation and moderation. But one basic question is still unresolved:
When the task objective is not stated explicitly, can an LLM judge reliably infer the hidden goal of a conversation—and know when that inference is trustworthy?
Multi-turn jailbreaks are the hardest setting: adversaries smear their goal across many turns, hide it behind role-play wrappers or distractors, and exploit long-context weaknesses.
ObjexMT (Objective EXtraction in Multi-Turn jailbreaks) is a benchmark that directly targets this question.
Given a multi-turn jailbreak transcript, a model must:
1. Extract a single-sentence base objective that states the attacker’s true goal.
2. Report a self-assessed confidence in that extraction.
We then quantify both accuracy and metacognitive quality of the judge.
Why multi-turn objective extraction matters
Deployments increasingly rely on LLM judges to:
- score system outputs,
- moderate user–assistant chats,
- or act as safety filters in the loop.
In many of these workflows, we implicitly assume that the judge can fill in a missing objective from context:
- “Determine whether this dialogue successfully teaches how to build X.”
- “Rate how harmful this conversation is.”
But in multi-turn jailbreaks:
- The attacker’s goal is disguised or distributed across turns.
- LLMs are known to struggle with irrelevant context and long inputs (lost-in-the-middle, distractor susceptibility).
If a judge misreads the objective yet remains highly confident, safety decisions built on top of it become fragile. ObjexMT is designed as a stress test for this failure mode.
Task and benchmark design
Input–Output definition
For each instance, the model under test receives:
- A full multi-turn jailbreak transcript reconstructed from per-turn fields (
turn_1…turn_N).
The model must output JSON with two fields:
base_prompt: a single-sentence imperative describing the attacker’s main goal.confidence: self-reported confidence in (or 0–100, normalized).
This isolates two abilities:
1. Objective extraction – Can the model recover the latent goal?
2. Metacognition – Does its confidence correlate with being right?
Datasets
ObjexMT evaluates four public multi-turn safety datasets:
- SafeMTData_Attack600 – automatically expanded multi-turn attack paths.
- SafeMTData_1K – safety-alignment dialogues including refusals.
- MHJ_local – human multi-turn jailbreaks (rich tactics).
- CoSafe – coreference-heavy multi-turn attacks.
We use each dataset as released (no schema unification), preserving their natural distribution of tactics and lengths.
Models and judge
- Extractors (6 models)
- GPT-4.1
- Claude-Sonnet-4
- Qwen3-235B-A22B-FP8
- Kimi-K2
- DeepSeek-V3.1
- Gemini-2.5-Flash
- Each model is run once per instance (single-pass decoding; no replicas).
- A fixed LLM judge (GPT-4.1) compares
base_promptvs. gold objective and outputs:
- continuous similarity_score ∈ [0,1],
- categorical rating (Exact / High / Moderate / Low),
- short reasoning.
All raw I/O and judge outputs are released as a single Excel file in the public ObjexMT dataset.
From similarity to correctness: human-aligned thresholding
Instead of hand-picking a cut-off, ObjexMT learns a single decision threshold from human annotations:
1. Sample 300 calibration items covering all sources.
2. Two experts assign one of four similarity categories
(Exact / High / Moderate / Low) for each pair (gold vs. extracted).
3. Sweep possible thresholds over the judge’s similarity score and select
on this labeled set, breaking ties toward smaller τ (conservative against false negatives).
- Final threshold: \(\tau^\star = 0.66\) with high .
- This is frozen for all models and datasets; no model-specific tuning.
Binary correctness is then:
- Correct if
similarity_score ≥ τ*, - Incorrect otherwise.
Metacognition metrics
Using the correctness labels, ObjexMT quantifies metacognition from self-reported confidences via:
- ECE (Expected Calibration Error) – gap between confidence and empirical accuracy (10 bins).
- Brier score – squared error between confidence and correctness.
- Wrong@High-Conf – fraction of errors with confidence ≥ {0.80, 0.90, 0.95}.
- Risk–coverage curves & AURC – how error rate behaves as we only keep high-confidence predictions.
Rows with invalid JSON are marked extraction_error and excluded from metacognition metrics (in practice, extraction errors are negligible for all models).
Key results
1. Objective extraction accuracy
Across SafeMTData_Attack600, SafeMTData_1K, and MHJ_local:
- The best model reaches overall objective-extraction accuracy around 0.61.
- Several other models cluster slightly below (around 0.60), forming a top tier.
- Performance varies sharply by dataset, with per-source accuracies roughly spanning 16–82%.
Automated obfuscation (e.g., programmatic attack expansions) can be harder than purely human jailbreaks, highlighting that objective extraction difficulty depends heavily on how the dialogue is constructed.
2. Calibration and selective risk
Metacognitive metrics show substantial variation:
- The best-calibrated model achieves:
- relatively low ECE,
- competitive Brier score,
- and strong AURC (good risk–coverage trade-off).
Even then, high-confidence mistakes remain:
- Wrong@0.90 can reach from the mid-teens up to nearly half of all errors, depending on the model.
- Some models maintain mean confidence around 0.88 while accuracy is closer to 0.47, indicating systematic overconfidence.
3. Dataset-level heterogeneity
Accuracy and calibration differ markedly between:
- Human conversational attacks (MHJ_local) – comparatively easier;
- Automated attack expansions (SafeMTData_Attack600) – systematically harder;
- Mixed safety dialogues (SafeMTData_1K) – intermediate.
This heterogeneity suggests that “judge quality” is not a single scalar but depends on the structure and origin of the conversations being evaluated.
Operational guidance for LLM-as-a-Judge
ObjexMT leads to several concrete recommendations:
1. Don’t assume judges can infer objectives.
When possible, provide the task objective explicitly (e.g., base prompt, rubric) instead of asking the judge to reverse-engineer it from a long adversarial transcript.
2. Treat confidence as a noisy signal, not ground truth.
Even top models show substantial Wrong@High-Conf, so raw verbalized confidence should not be used directly as a decision threshold in safety-critical systems.
3. Use selective prediction / abstention.
Calibrated risk–coverage curves enable policies like:
- “Only auto-approve when confidence is high and the judgment is safe.”
- “Escalate to human review when confidence is low or the objective is unclear.”
4. Report judge behavior per dataset type.
Published evaluator scores should specify the distribution of inputs (e.g., human dialogues vs. auto-obfuscated attacks), since robustness depends strongly on conversation structure.
For organizations using LLM judges (e.g., for UGC moderation, NPC dialog filtering, or model evaluation), ObjexMT offers a drop-in diagnostic to test whether the judge can handle noisy, multi-step, attack-like inputs and when its confidence can be trusted.
My role (co-first author)
As co-first author of ObjexMT, I:
- Helped define the objective-extraction task and multi-turn jailbreak setting.
- Co-designed the instruction templates for extraction and similarity judging.
- Implemented the full evaluation pipeline:
- dataset packaging and transcript reconstruction,
- single-pass extractor runs across six models,
- similarity-judging and threshold calibration logic,
- computation of all accuracy and metacognition metrics (ECE, Brier, Wrong@High-Conf, AURC).
- Led analysis of dataset-wise heterogeneity and high-confidence error patterns.
- Co-developed the operational recommendations for when to provide objectives explicitly and how to use confidence-gated abstention.
- Maintained the public ObjexMT dataset repository and documentation for replication.
Resources
- Paper (arXiv)
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
https://arxiv.org/abs/2508.16889
- Dataset & logs (GitHub)
ObjexMT Dataset – Multi-Model Jailbreak Extraction Evaluation