We Scaled the Benchmark to 200 Prompts. The Model Wasn't the Problem.

Why We Expanded

In Part 1 and Part 2, we benchmarked a layered routing pipeline across 60 prompts in three domains — healthcare, finance, and telecom. The results were clear: deterministic rules handled 93% of routing decisions. A 1.5B SLM added marginal value on the remaining edge cases.

But 60 prompts across 3 domains is a controlled test. Enterprise environments are not controlled. Regulated industries have distinct vocabularies, different compliance constraints, and edge cases that don’t appear in a curated dataset.

We needed to know: does the architecture hold when the dataset gets harder?

What Changed

We expanded from 60 to 200 prompts across 5 regulated verticals:

Domain	Prompts	Regulation Context
Healthcare	40	HIPAA — patient data, clinical decision support
Finance	40	PCI DSS, Basel IV — cardholder data, risk modelling
Telecom	40	EU AI Act — autonomous network management classified high-risk by August 2026
Legal	40	Attorney-client privilege, eDiscovery, cross-border compliance
Defence	40	Classified handling, NATO operations, personnel clearance

Legal and defence prompts are structurally different from the original three domains. Legal language is dense, context-dependent, and full of terms that look analytical but may be procedural. Defence prompts reference operational concepts, NATO-specific terminology, and scenarios where sensitivity isn’t determined by PII alone.

The benchmark tests routing accuracy (did the prompt reach the right model?), PII recall (was sensitive data detected before leaving the perimeter?), and latency across the same three modes: rules-only, rules+SLM hybrid, and SLM-only.

The Results

Metric	Rules-only	Rules+SLM	SLM-only
Routing accuracy	92.0%	84.5%	32.0%
SLM calls	0 (0%)	37 (18.5%)	200 (100%)
Avg latency	10.3ms	175.5ms	840ms
P50 latency	<1ms	<1ms	830ms
P95 latency	10.9ms	949ms	967ms
PII recall	95.0%	95.0%	0.0%

Read that third column again. SLM-only: 32% accuracy. Zero PII recall. Consistent with our original findings — but now across five verticals instead of three.

The Finding We Didn’t Expect

In the original benchmark, the SLM added 1.7% accuracy. Marginal, but positive. We expected the same pattern at scale.

Instead, the 1.5B model degraded accuracy by 7.5 points when given the ambiguous cases. Rules-only: 92%. Rules+SLM: 84.5%.

The model isn’t just unhelpful on harder prompts. It’s confidently wrong. When the rules engine correctly routes a defence procurement query to cloud_standard, the SLM overrides it to local_slm. When a legal privilege review should go to cloud_premium, the SLM says “simple.”

This is the same failure mode we documented in Part 2 — the model optimises for helpfulness, not compliance. But on a diverse dataset, the damage compounds. Every domain the model hasn’t seen enough training data for becomes a liability.

Right-Sizing Is Necessary. It’s Not Sufficient.

Our Part 1 and Part 2 findings still hold: deterministic layers should handle what they can, and models should be right-sized to the task. A 1.5B model for classification behind a rules engine is the right architecture.

But the expanded benchmark exposed a harder truth: the model isn’t the variable that matters most. The governance layer is.

Consider what happens without it:

The SLM routes a prompt containing a patient name and diagnosis to local handling. No PII detection runs. The data leaves the perimeter unmasked.
The SLM classifies a defence procurement analysis as “simple.” It gets handled by a model without the context window or domain knowledge to produce a useful response. The user gets a confidently wrong answer.
The SLM confirms it processed a financial transaction it never executed. The user acts on the confirmation.

In each case, the failure isn’t the model’s accuracy. It’s the absence of a layer that verifies the model’s decision before it takes effect.

What the Governance Layer Does

The architecture that survived the expanded benchmark:

┌─────────────────────────────────────┐
│  Deterministic Layer                │
│  PII detection, format validation,  │
│  domain policy enforcement          │
│  (handles 92% of routing decisions) │
├─────────────────────────────────────┤
│  Governance Layer                   │
│  Action verification, audit trail,  │
│  hallucination detection            │
│  (deterministic safety net)         │
├─────────────────────────────────────┤
│  SLM Layer                          │
│  Intent classification for          │
│  ambiguous cases only               │
│  (bounded, never trusted alone)     │
├─────────────────────────────────────┤
│  Tool / Model Layer                 │
│  Route to appropriate endpoint      │
│  (local SLM, cloud, premium)        │
└─────────────────────────────────────┘

The governance layer sits above the SLM, not below it. Every routing decision — whether from rules or from the model — passes through verification before execution. If the model claims an action was taken, the governance layer checks whether tools were actually called. If the model routes a PII-containing prompt to local handling, the deterministic PII detector overrides it.

The model earns a place in the architecture. But it never gets the final word.

Per-Domain Breakdown

Domain	Rules-only	Rules+SLM	SLM-only
Healthcare	92.5%	82.5%	27.5%
Finance	92.5%	85.0%	32.5%
Telecom	90.0%	82.5%	35.0%
Legal	92.5%	87.5%	30.0%
Defence	92.5%	85.0%	35.0%

Rules performance is consistent across all five verticals — 90-92.5%. The deterministic layer doesn’t degrade with domain diversity because it’s policy-driven: each domain has its own keyword set, PII entity list, and sensitivity configuration. Adding a new vertical means adding a policy file, not retraining a model.

The SLM degrades most in healthcare (82.5%) where clinical terminology is dense and ambiguous. It performs slightly better in legal and defence where prompt structure is more formulaic.

What This Means for Enterprises Scaling AI

1. Test against the diversity you’ll see in production

A benchmark that only covers your initial use case will overestimate accuracy. The gap between our 60-prompt result (93%) and 200-prompt result (92%) was small for rules — but the SLM dropped from +1.7% value to -7.5% damage. Models are more sensitive to distribution shift than deterministic layers.

2. The model is not the governance layer

Right-sizing the model is an optimisation decision. Building the governance layer is a compliance decision. They serve different purposes and have different owners. The engineering team picks the model. The security and compliance teams define the governance policies. Coupling them is the same mistake as letting an SLM make PII classification decisions.

3. Policy-driven architecture scales. Model-driven architecture doesn’t.

Adding legal and defence to the routing pipeline required two new policy YAML files — keyword lists, PII entities, sensitivity rules. No retraining. No prompt engineering. No GPU. The deterministic layer scales linearly with domains because it’s configuration, not computation.

4. The audit trail is what proves it works

Every routing decision in the benchmark is logged: which rule fired, what confidence score, whether the SLM was called, what it returned, what PII was detected. When the SLM degraded accuracy, we knew exactly which prompts it got wrong and why. That same audit trail is what your compliance team reviews and your security team queries during an incident.

The Arc So Far

Part 1: 98% PII recall with deterministic detection. No LLM needed for the data boundary.

Part 2: Layered architecture beats model-only. The SLM adds value only at the edges.

This post: Scaled to 200 prompts, 5 verticals. The model isn’t the variable — the governance layer is.

The data boundary is one layer. But once prompts are routed correctly and PII is masked, the agent still needs to do something — call tools, access systems, execute actions. That’s a different boundary with its own governance gap.

Part 3: The tool boundary — your agent has access to 47 tools. Who approved that?

200-prompt benchmark across healthcare (HIPAA), finance (PCI DSS), telecom (EU AI Act), legal (attorney-client privilege), and defence (classified handling). Tested with qwen2.5:1.5b on Ollama. Full audit logs available. March 2026.

Facing these challenges in your AI stack? Get early access → or get in touch.