~/docs/resources/jagged-frontier-ai-security-evaluation-kit.mdLast modified: Just now

AUDIENCE:Security engineering leads, AppSec leads, SOC leads, and platform teams evaluating AI for vulnerability work, triage, patch review, or security reasoning.

PROMISE:In 1–2 short test cycles, you will have a per-task scorecard, a validation log, and a clear decision on where smaller models work, where they fail, and what your surrounding system must enforce.

A Jagged Frontier Evaluation Kit for Security AI

A repeatable way to test models on your real security tasks and choose the workflow that produces validated results.

Use this when you must choose between a “bigger model” and a “better workflow”. Run a small, auditable evaluation. Keep evidence you can defend later.

##Why this matters (and what “jagged” means)

It is tempting to pick an AI model by size, price, or a single benchmark rank. For security work, that can be the wrong shortcut.

AISLE claims they tested vulnerabilities highlighted in the Mythos announcement and found that smaller open-weights models could detect or reconstruct key issues, and that rankings reshuffled by task. (Source: AI Cybersecurity After Mythos: The Jagged Frontier - AISLE https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier)

If rankings move by task, you cannot assume “best overall” equals “best for your workload”. The safer approach is to treat the model as one part of a system: what context you feed, what you verify, and what you can prove downstream.

The UK AI Security Institute reports that Claude Mythos Preview solved a cyber task end to end in 3 of 10 attempts, and completed an average of 22 of 32 steps across attempts. (Source: Our evaluation of Claude Mythos Preview's cyber capabilities https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities)

##Fast signals you need a per-task evaluation (not a single model choice)

-You do more than one kind of security work (triage, exploit reasoning, patch review, OWASP-style analysis).
-You cannot explain how a model output became a confirmed finding.
-False positives create busywork for your team or for maintainers.
-You only have aggregate scores, not per-task transcripts and validation results.
-Your “success” is downstream (accepted patch, confirmed report), not a nice-looking answer.
-You need to compare cost per validated finding, not cost per token.
-You suspect a smaller model might be “good enough” if your system is tight.

##The 7-step jagged frontier test (run this before you buy or standardize)

[01]Pick 4–6 tasks that match your real security workload

[02]Build a harness that logs prompts, outputs, tool calls, and validation decisions

[03]Define per-task scoring: validated rate, false positives, time-to-triage, patch outcome

[04]Run one small open model and one frontier model with identical context and rules

[05]Add confirmation: map claims to real code and a reproducible proof path

[06]Record downstream outcomes: maintainer response, accepted patch, confirmed issue

[07]Document where rankings flip, then route work by task (not hype)

##What to measure and what to save (so your results are defensible)

Group 1 : Evidence to collect for every run

->Model-by-model transcripts with the exact prompt and output for each task
->Logs or screenshots from your validation pipeline marking confirmed, rejected, or partial findings
->Per-task score sheets (avoid a single combined score)
->Audit notes showing what code snippets, traces, or writeups were provided as input
->Cost notes tied to validated outcomes (not just token counts)

Group 2 : Per-task metrics (keep it simple)

->Validated finding rate (confirmed by tool or human)
->False-positive rate (rejected after checking)
->Partial chain rate (some steps right, proof path not complete)
->Time-to-triage (from output to decision)
->Downstream outcome (accepted patch, confirmed issue, or no action)

Group 3 : Decision questions for the review meeting

->Which tasks do we actually need: triage, exploit reasoning, patch suggestion, report drafting?
->Do we need discovery volume, or validated findings that survive review?
->Can we tolerate higher false positives if confirmation is cheap and fast?
->Will we benchmark inside our own validation stack, or rely on vendor demos?
->What counts as success here: confirmed report, accepted patch, or reduced triage time?

##The “system moat” blueprint (what must exist around the model)

Use this as an audit map. If a box is missing, your evaluation results will be hard to trust and hard to repeat.

1) Task framing and context control

├─Use narrow task definitions (one goal per run)
├─Control the input context (exact code region, versions, assumptions)
└─Keep prompts stable across models to make comparisons fair

2) Validation pipeline (non-negotiable)

├─Treat outputs as hypotheses until verified
├─Force a check that maps the claim to real code and a reproducible proof path
└─Separate true positives, partial chains, and false positives in your logs

3) Human-in-the-loop points

├─Define where a human must step in (uncertainty, exploit steps, remediation claims)
├─Use reviewers to confirm or reject, not to rewrite everything
└─Capture reviewer decisions as structured notes, not chat fragments

4) Downstream impact tracking

├─Track what happened after triage (issue filed, patch proposed, patch accepted)
├─Prefer maintainer-facing quality signals over raw counts
└─Use outcomes to route tasks to the best model for that task next time

##FAQ (to avoid the usual traps)

Does this mean smaller models are always better for security?

No. Use this as a decision lens. Performance can flip by task, so you should test your tasks and keep evidence. AISLE reports that rankings changed across tasks. (Source: AI Cybersecurity After Mythos: The Jagged Frontier - AISLE https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier)

What if a vendor will not share transcripts or per-task results?

Assume you cannot reproduce their claims. Ask for per-task transcripts, your exact task definitions, and a clear explanation of how they validate findings. If they cannot provide this, run your own small harness before you standardize anything.

Why track partial steps instead of just success or failure?

Partial progress shows where humans still must intervene. The UK AI Security Institute reports Mythos Preview completed an average of 22 of 32 steps across attempts, even when it did not fully finish. (Source: Our evaluation of Claude Mythos Preview's cyber capabilities https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities)

What is the biggest operational mistake teams make here?

They measure volume or model size and skip the validation layer. In vulnerability work, unverifiable exploit chains and false positives create real cost. Build confirmation into the workflow before you scale usage.