March 31, 2026 · Krunal Sabnis
We Scaled the Benchmark to 200 Prompts. The Model Wasn't the Problem.
Expanding from 60 prompts across 3 domains to 200 across 5 regulated verticals confirmed what we suspected: right-sizing the model matters, but the governance layer is what makes any of it production-safe.
Why We Expanded
In Part 1 and Part 2, we benchmarked a layered routing pipeline across 60 prompts in three domains — healthcare, finance, and telecom. The results were clear: deterministic rules handled 93% of routing decisions. A 1.5B SLM added marginal value on the remaining edge cases.
But 60 prompts across 3 domains is a controlled test. Enterprise environments are not controlled. Regulated industries have distinct vocabularies, different compliance constraints, and edge cases that don’t appear in a curated dataset.
We needed to know: does the architecture hold when the dataset gets harder?
What Changed
We expanded from 60 to 200 prompts across 5 regulated verticals:
| Domain | Prompts | Regulation Context |
|---|---|---|
| Healthcare | 40 | HIPAA — patient data, clinical decision support |
| Finance | 40 | PCI DSS, Basel IV — cardholder data, risk modelling |
| Telecom | 40 | EU AI Act — autonomous network management classified high-risk by August 2026 |
| Legal | 40 | Attorney-client privilege, eDiscovery, cross-border compliance |
| Defence | 40 | Classified handling, NATO operations, personnel clearance |
Legal and defence prompts are structurally different from the original three domains. Legal language is dense, context-dependent, and full of terms that look analytical but may be procedural. Defence prompts reference operational concepts, NATO-specific terminology, and scenarios where sensitivity isn’t determined by PII alone.
The benchmark tests routing accuracy (did the prompt reach the right model?), PII recall (was sensitive data detected before leaving the perimeter?), and latency across the same three modes: rules-only, rules+SLM hybrid, and SLM-only.
The Results
| Metric | Rules-only | Rules+SLM | SLM-only |
|---|---|---|---|
| Routing accuracy | 92.0% | 84.5% | 32.0% |
| SLM calls | 0 (0%) | 37 (18.5%) | 200 (100%) |
| Avg latency | 10.3ms | 175.5ms | 840ms |
| P50 latency | <1ms | <1ms | 830ms |
| P95 latency | 10.9ms | 949ms | 967ms |
| PII recall | 95.0% | 95.0% | 0.0% |
Read that third column again. SLM-only: 32% accuracy. Zero PII recall. Consistent with our original findings — but now across five verticals instead of three.
The Finding We Didn’t Expect
In the original benchmark, the SLM added 1.7% accuracy. Marginal, but positive. We expected the same pattern at scale.
Instead, the 1.5B model degraded accuracy by 7.5 points when given the ambiguous cases. Rules-only: 92%. Rules+SLM: 84.5%.
The model isn’t just unhelpful on harder prompts. It’s confidently wrong. When the rules engine correctly routes a defence procurement query to cloud_standard, the SLM overrides it to local_slm. When a legal privilege review should go to cloud_premium, the SLM says “simple.”
This is the same failure mode we documented in Part 2 — the model optimises for helpfulness, not compliance. But on a diverse dataset, the damage compounds. Every domain the model hasn’t seen enough training data for becomes a liability.
Right-Sizing Is Necessary. It’s Not Sufficient.
Our Part 1 and Part 2 findings still hold: deterministic layers should handle what they can, and models should be right-sized to the task. A 1.5B model for classification behind a rules engine is the right architecture.
But the expanded benchmark exposed a harder truth: the model isn’t the variable that matters most. The governance layer is.
Consider what happens without it:
- The SLM routes a prompt containing a patient name and diagnosis to local handling. No PII detection runs. The data leaves the perimeter unmasked.
- The SLM classifies a defence procurement analysis as “simple.” It gets handled by a model without the context window or domain knowledge to produce a useful response. The user gets a confidently wrong answer.
- The SLM confirms it processed a financial transaction it never executed. The user acts on the confirmation.
In each case, the failure isn’t the model’s accuracy. It’s the absence of a layer that verifies the model’s decision before it takes effect.
What the Governance Layer Does
The architecture that survived the expanded benchmark:
┌─────────────────────────────────────┐
│ Deterministic Layer │
│ PII detection, format validation, │
│ domain policy enforcement │
│ (handles 92% of routing decisions) │
├─────────────────────────────────────┤
│ Governance Layer │
│ Action verification, audit trail, │
│ hallucination detection │
│ (deterministic safety net) │
├─────────────────────────────────────┤
│ SLM Layer │
│ Intent classification for │
│ ambiguous cases only │
│ (bounded, never trusted alone) │
├─────────────────────────────────────┤
│ Tool / Model Layer │
│ Route to appropriate endpoint │
│ (local SLM, cloud, premium) │
└─────────────────────────────────────┘
The governance layer sits above the SLM, not below it. Every routing decision — whether from rules or from the model — passes through verification before execution. If the model claims an action was taken, the governance layer checks whether tools were actually called. If the model routes a PII-containing prompt to local handling, the deterministic PII detector overrides it.
The model earns a place in the architecture. But it never gets the final word.
Per-Domain Breakdown
| Domain | Rules-only | Rules+SLM | SLM-only |
|---|---|---|---|
| Healthcare | 92.5% | 82.5% | 27.5% |
| Finance | 92.5% | 85.0% | 32.5% |
| Telecom | 90.0% | 82.5% | 35.0% |
| Legal | 92.5% | 87.5% | 30.0% |
| Defence | 92.5% | 85.0% | 35.0% |
Rules performance is consistent across all five verticals — 90-92.5%. The deterministic layer doesn’t degrade with domain diversity because it’s policy-driven: each domain has its own keyword set, PII entity list, and sensitivity configuration. Adding a new vertical means adding a policy file, not retraining a model.
The SLM degrades most in healthcare (82.5%) where clinical terminology is dense and ambiguous. It performs slightly better in legal and defence where prompt structure is more formulaic.
What This Means for Enterprises Scaling AI
1. Test against the diversity you’ll see in production
A benchmark that only covers your initial use case will overestimate accuracy. The gap between our 60-prompt result (93%) and 200-prompt result (92%) was small for rules — but the SLM dropped from +1.7% value to -7.5% damage. Models are more sensitive to distribution shift than deterministic layers.
2. The model is not the governance layer
Right-sizing the model is an optimisation decision. Building the governance layer is a compliance decision. They serve different purposes and have different owners. The engineering team picks the model. The security and compliance teams define the governance policies. Coupling them is the same mistake as letting an SLM make PII classification decisions.
3. Policy-driven architecture scales. Model-driven architecture doesn’t.
Adding legal and defence to the routing pipeline required two new policy YAML files — keyword lists, PII entities, sensitivity rules. No retraining. No prompt engineering. No GPU. The deterministic layer scales linearly with domains because it’s configuration, not computation.
4. The audit trail is what proves it works
Every routing decision in the benchmark is logged: which rule fired, what confidence score, whether the SLM was called, what it returned, what PII was detected. When the SLM degraded accuracy, we knew exactly which prompts it got wrong and why. That same audit trail is what your compliance team reviews and your security team queries during an incident.
The Arc So Far
Part 1: 98% PII recall with deterministic detection. No LLM needed for the data boundary.
Part 2: Layered architecture beats model-only. The SLM adds value only at the edges.
This post: Scaled to 200 prompts, 5 verticals. The model isn’t the variable — the governance layer is.
The data boundary is one layer. But once prompts are routed correctly and PII is masked, the agent still needs to do something — call tools, access systems, execute actions. That’s a different boundary with its own governance gap.
Part 3: The tool boundary — your agent has access to 47 tools. Who approved that?
200-prompt benchmark across healthcare (HIPAA), finance (PCI DSS), telecom (EU AI Act), legal (attorney-client privilege), and defence (classified handling). Tested with qwen2.5:1.5b on Ollama. Full audit logs available. March 2026.
Facing these challenges in your AI stack? Get early access → or get in touch.